使用Python和Pandas分析Pronto CycleShare数据

demond123 8年前
   <p>这是一篇非常不错的pandas 分析入门文章，在此简单翻译摘录如下。</p>    <p>本周，西雅图的自行车共享系统 Pronto CycleShare 一周岁了。 为了庆祝这一点，Pronto 提供了从第一年的数据缓存，并宣布了 Pronto Cycle Share 的数据分析挑战 。</p>    <p>你可以用很多工具分析这些数据，但我的选择工具是 Python。 在这篇文章中，我想展示如何开始分析这些数据，并使用 PyData 技术栈，即 NumPy ， Pandas ， Matplotlib 和 Seaborn 与其他可用的数据源。</p>    <p>这篇文章以 Jupyter Notebook 形式组织，它是一种开放的文档格式。结合了文本、代码、数据和图形，并且通过 Web 浏览器查看。本文中的内容可以下载对应的 Notebook 文件，并通过 Jupyter 打开。</p>    <h2>下载 Pronto 的数据</h2>    <p>我们可以从 Pronto 官网 下载对应的 数据文件 。总下载大约70MB，解压缩的文件大约900MB。</p>    <p>接下来我们需要导入一些 Python 包：</p>    <p>In [2]:</p>    <pre>  <code class="language-python">%matplotlib inline  import matplotlib.pyplot as plt  import pandas as pd  import numpy as np  import seaborn as sns; sns.set()</code></pre>    <p>现在我们使用Pandas加载所有的行程记录：</p>    <p>In [3]:</p>    <pre>  <code class="language-python">trips = pd.read_csv('2015_trip_data.csv',                      parse_dates=['starttime', 'stoptime'],                      infer_datetime_format=True)  trips.head()</code></pre>    <p>Out[3]:</p>    <table>     <thead>      <tr>       <th> </th>       <th>trip <em>id</em></th>       <th>starttime</th>       <th>stoptime</th>       <th>bikeid</th>       <th>tripduration</th>       <th>fromstation <em>name</em></th>       <th>tostation <em>name</em></th>       <th>fromstation <em>id</em></th>       <th>tostation_id</th>       <th>usertype</th>       <th>gender</th>       <th>birthyear</th>      </tr>     </thead>     <tbody>      <tr>       <th>0</th>       <td>431</td>       <td>2014-10-13 10:31:00</td>       <td>2014-10-13 10:48:00</td>       <td>SEA00298</td>       <td>985.935</td>       <td>2nd Ave & Spring St</td>       <td>Occidental Park / Occidental Ave S & S Washing...</td>       <td>CBD-06</td>       <td>PS-04</td>       <td>Annual Member</td>       <td>Male</td>       <td>1960</td>      </tr>      <tr>       <th>1</th>       <td>432</td>       <td>2014-10-13 10:32:00</td>       <td>2014-10-13 10:48:00</td>       <td>SEA00195</td>       <td>926.375</td>       <td>2nd Ave & Spring St</td>       <td>Occidental Park / Occidental Ave S & S Washing...</td>       <td>CBD-06</td>       <td>PS-04</td>       <td>Annual Member</td>       <td>Male</td>       <td>1970</td>      </tr>      <tr>       <th>2</th>       <td>433</td>       <td>2014-10-13 10:33:00</td>       <td>2014-10-13 10:48:00</td>       <td>SEA00486</td>       <td>883.831</td>       <td>2nd Ave & Spring St</td>       <td>Occidental Park / Occidental Ave S & S Washing...</td>       <td>CBD-06</td>       <td>PS-04</td>       <td>Annual Member</td>       <td>Female</td>       <td>1988</td>      </tr>      <tr>       <th>3</th>       <td>434</td>       <td>2014-10-13 10:34:00</td>       <td>2014-10-13 10:48:00</td>       <td>SEA00333</td>       <td>865.937</td>       <td>2nd Ave & Spring St</td>       <td>Occidental Park / Occidental Ave S & S Washing...</td>       <td>CBD-06</td>       <td>PS-04</td>       <td>Annual Member</td>       <td>Female</td>       <td>1977</td>      </tr>      <tr>       <th>4</th>       <td>435</td>       <td>2014-10-13 10:34:00</td>       <td>2014-10-13 10:49:00</td>       <td>SEA00202</td>       <td>923.923</td>       <td>2nd Ave & Spring St</td>       <td>Occidental Park / Occidental Ave S & S Washing...</td>       <td>CBD-06</td>       <td>PS-04</td>       <td>Annual Member</td>       <td>Male</td>       <td>1971</td>      </tr>     </tbody>    </table>    <p>这个行程数据集的每一行是由一个人单独骑行，共包含超过140,000条数据。</p>    <h2>探索时间与行程的关系</h2>    <p>让我们先看看一年中每日行程次数的趋势。</p>    <p>In [4]:</p>    <pre>  <code class="language-python"># Find the start date  ind = pd.DatetimeIndex(trips.starttime)  trips['date'] = ind.date.astype('datetime64')  trips['hour'] = ind.hour</code></pre>    <p>In [5]:</p>    <pre>  <code class="language-python"># Count trips by date  by_date = trips.pivot_table('trip_id', aggfunc='count',                              index='date',                              columns='usertype', )</code></pre>    <p>In [6]:</p>    <pre>  <code class="language-python">fig, ax = plt.subplots(2, figsize=(16, 8))  fig.subplots_adjust(hspace=0.4)  by_date.iloc[:, 0].plot(ax=ax[0], title='Annual Members');  by_date.iloc[:, 1].plot(ax=ax[1], title='Day-Pass Users');</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/aea2b41d4cdd8104355b9b73ceae0f66.jpg"></p>    <p>此图显示每日趋势，以年费用户（上图）和临时用户（下图）分隔。 根据图标，我们可以获得几个结论：</p>    <ul>     <li>4月份短期使用的临时用户大幅增加可能是由于 美国规划协会全国会议 在西雅图市中心举行。 其他一个比较接近的时间是7月4日周末。</li>     <li>临时用户呈现了一个与季节相关联的稳定的衰退趋势; 年费用户的使用没有随着秋天的来临而显着减少。</li>     <li>年费用户和临时用户似乎都显示出明显的每周趋势。</li>    </ul>    <p>现在放大每周趋势，看一下所有的骑乘都是按照星期几分部的。由于2015年1月份左右模式的变化，我们按照年份和星期几进行拆分：</p>    <p>In [7]:</p>    <pre>  <code class="language-python">by_weekday = by_date.groupby([by_date.index.year,                                by_date.index.dayofweek]).mean()  by_weekday.columns.name = None  # remove label for plot    fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)  by_weekday.loc[2014].plot(title='Average Use by Day of Week (2014)', ax=ax[0]);  by_weekday.loc[2015].plot(title='Average Use by Day of Week (2015)', ax=ax[1]);  for axi in ax:      axi.set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/293b52ba72ea5623ea82184ec63ec668.jpg"></p>    <p>我们看到了一个互补的模式：年费用户倾向于工作日使用他们的自行车（即作为通勤的一部分），而临时用户倾向于在周末使用他们的自行车。这种模式甚至在2015年年初都没有特别的体现出来，尤其是年费用户：似乎在头几个月，用户还没有使用 Pronto 的通勤习惯。</p>    <p>查看平日和周末的平均每小时骑行也很有趣。这需要一些操作：</p>    <p>In [8]:</p>    <pre>  <code class="language-python"># count trips by date and by hour  by_hour = trips.pivot_table('trip_id', aggfunc='count',                              index=['date', 'hour'],                              columns='usertype').fillna(0).reset_index('hour')    # average these counts by weekend  by_hour['weekend'] = (by_hour.index.dayofweek >= 5)  by_hour = by_hour.groupby(['weekend', 'hour']).mean()  by_hour.index.set_levels([['weekday', 'weekend'],                            ["{0}:00".format(i) for i in range(24)]],                           inplace=True);  by_hour.columns.name = None</code></pre>    <p>现在我们可以绘制结果来查看每小时的趋势：</p>    <p>In [9]:</p>    <pre>  <code class="language-python">fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)  by_hour.loc['weekday'].plot(title='Average Hourly Use (Mon-Fri)', ax=ax[0])  by_hour.loc['weekend'].plot(title='Average Hourly Use (Sat-Sun)', ax=ax[1])  ax[0].set_ylabel('Average Trips per Hour');</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/2a228f78c57ca58515fcaadf4cada029.jpg"></p>    <p>我们看到一个“通勤”模式和一个“娱乐”模式之间的明显区别:“通勤”模式在早上和晚上急剧上升，而“娱乐”模式在下午的时候有一个宽峰。 有趣的是，年费会员在周末的行为似乎与临时用户在周末的行为几乎相同。</p>    <h2>旅行时间</h2>    <p>接下来，我们来看看旅行的持续时间。 Pronto 免费骑行最长可达30分钟; 任何长于此的单次使用，在前半个小时都会产生几美元的使用费，此后每小时大约需要十美元。</p>    <p>让我们看看年费会员和临时使用者的旅行持续时间的分布：</p>    <p>In [10]:</p>    <pre>  <code class="language-python">trips['minutes'] = trips.tripduration / 60  trips.groupby('usertype')['minutes'].hist(bins=np.arange(61), alpha=0.5, normed=True);  plt.xlabel('Duration (minutes)')  plt.ylabel('relative frequency')  plt.title('Trip Durations')  plt.text(34, 0.09, "Free Trips\n\nAdditional Fee", ha='right',           size=18, rotation=90, alpha=0.5, color='red')  plt.legend(['Annual Members', 'Short-term Pass'])    plt.axvline(30, linestyle='--', color='red', alpha=0.3);</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/b82c900a222aee91d8d6a4eda6484f45.jpg"></p>    <p>在这里，我添加了一个红色的虚线，分开免费骑乘（左）和付费骑乘（右）。看来，年费用户对系统规则更加了解：只有行程分布的一小部分超过30分钟。另一方面，大约四分之一的临时用户时间超过半小时限制，并收取额外费用。 我的预期是，这些临时用户不能很好理解这种定价结构，并且可能会因为不开心的体验不再使用。</p>    <h2>估计行程距离</h2>    <p>看看旅行的距离也十分有趣。Pronto 发布的数据中不包括行车的距离，因此我们需要通过其他来源来确定。让我们从加载行车数据开始 - 因为一些行程在Pronto的服务点之间开始和结束，我们将其添加为一个“车站”：</p>    <p>In [11]:</p>    <pre>  <code class="language-python">stations = pd.read_csv('2015_station_data.csv')  pronto_shop = dict(id=54, name="Pronto shop",                     terminal="Pronto shop",                     lat=47.6173156, long=-122.3414776,                     dockcount=100, online='10/13/2014')  stations = stations.append(pronto_shop, ignore_index=True)</code></pre>    <p>现在我们需要找到两对纬度/经度坐标之间的骑车距离。幸运的是，Google 地图有一个距离 API，我们可以免费使用。</p>    <p>从文档中知道，我们每天免费使用的限制为每天最多 2500 个距离，每 10 秒最多 100 个距离。现在有 55 个站，我们有（55 * 54/2） = 1485 个非零距离，所以我们可以在几天内免费查询所有车站之间的距离。</p>    <p>为此，我们一次查询一行，在查询之间等待10+秒（注意：我们可能还会使用 googlemaps Python 包 ，但使用它需要获取 API 密钥）。</p>    <p>In [12]:</p>    <pre>  <code class="language-python">from time import sleep    def query_distances(stations=stations):      """Query the Google API for bicycling distances"""      latlon_list = ['{0},{1}'.format(lat, long)                     for (lat, long) in zip(stations.lat, stations.long)]        def create_url(i):          URL = ('https://maps.googleapis.com/maps/api/distancematrix/json?'                 'origins={origins}&destinations={destinations}&mode=bicycling')          return URL.format(origins=latlon_list[i],                            destinations='|'.join(latlon_list[i + 1:]))        for i in range(len(latlon_list) - 1):          url = create_url(i)          filename = "distances_{0}.json".format(stations.terminal.iloc[i])          print(i, filename)          !curl "{url}" -o {filename}          sleep(11) # only one query per 10 seconds!      def build_distance_matrix(stations=stations):      """Build a matrix from the Google API results"""      dist = np.zeros((len(stations), len(stations)), dtype=float)      for i, term in enumerate(stations.terminal[:-1]):          filename = 'queried_distances/distances_{0}.json'.format(term)          row = json.load(open(filename))          dist[i, i + 1:] = [el['distance']['value'] for el in row['rows'][0]['elements']]      dist += dist.T      distances = pd.DataFrame(dist, index=stations.terminal,                               columns=stations.terminal)      distances.to_csv('station_distances.csv')      return distances    # only call this the first time  import os  if not os.path.exists('station_distances.csv'):      # Note: you can call this function at most ~twice per day!      query_distances()        # Move all the queried files into a directory      # so we don't accidentally overwrite them      if not os.path.exists('queried_distances'):          os.makedirs('queried_distances')      !mv distances_*.json queried_distances        # Build distance matrix and save to CSV      distances = build_distance_matrix()</code></pre>    <p>这里是第一个5x5距离矩阵：</p>    <p>In [13]:</p>    <pre>  <code class="language-python">distances = pd.read_csv('station_distances.csv', index_col='terminal')  distances.iloc[:5, :5]</code></pre>    <p>Out[13]:</p>    <table>     <thead>      <tr>       <th> </th>       <th>BT-01</th>       <th>BT-03</th>       <th>BT-04</th>       <th>BT-05</th>       <th>CBD-13</th>      </tr>      <tr>       <th>terminal</th>       <th> </th>       <th> </th>       <th> </th>       <th> </th>       <th> </th>      </tr>     </thead>     <tbody>      <tr>       <th>BT-01</th>       <td>0</td>       <td>422</td>       <td>1067</td>       <td>867</td>       <td>1342</td>      </tr>      <tr>       <th>BT-03</th>       <td>422</td>       <td>0</td>       <td>838</td>       <td>445</td>       <td>920</td>      </tr>      <tr>       <th>BT-04</th>       <td>1067</td>       <td>838</td>       <td>0</td>       <td>1094</td>       <td>1121</td>      </tr>      <tr>       <th>BT-05</th>       <td>867</td>       <td>445</td>       <td>1094</td>       <td>0</td>       <td>475</td>      </tr>      <tr>       <th>CBD-13</th>       <td>1342</td>       <td>920</td>       <td>1121</td>       <td>475</td>       <td>0</td>      </tr>     </tbody>    </table>    <p>让我们将这些距离转换为英里，并将它们加入我们的行程数据：</p>    <p>In [14]:</p>    <pre>  <code class="language-python">stacked = distances.stack() / 1609.34  # convert meters to miles  stacked.name = 'distance'  trips = trips.join(stacked, on=['from_station_id', 'to_station_id'])</code></pre>    <p>现在我们可以绘制行程距离的分布：</p>    <p>In [15]:</p>    <pre>  <code class="language-python">fig, ax = plt.subplots(figsize=(12, 4))  trips.groupby('usertype')['distance'].hist(bins=np.linspace(0, 6.99, 50),                                             alpha=0.5, ax=ax);  plt.xlabel('Distance between start & end (miles)')  plt.ylabel('relative frequency')  plt.title('Minimum Distance of Trip')  plt.legend(['Annual Members', 'Short-term Pass']);</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/62fe74b27738f8a830ac6b4f24eb28a1.jpg"></p>    <p>请记住，这显示站点之间的最短可能距离，是每次行程上实际距离的下限。许多旅行（特别是临时用户）在几个街区内开始和结束。除此之外，旅行高峰一般在大约1英里左右，也有一些用户将他们的旅行距离扩展到四英里或更长的距离。</p>    <h2>骑手速度</h2>    <p>给定这些距离，我们还可以计算估计骑行速度的下限。 让我们这样做，然后看看年费用户和临时用户的速度分布：</p>    <p>In [16]:</p>    <pre>  <code class="language-python">trips['speed'] = trips.distance * 60 / trips.minutes  trips.groupby('usertype')['speed'].hist(bins=np.linspace(0, 15, 50), alpha=0.5, normed=True);  plt.xlabel('lower bound riding speed (MPH)')  plt.ylabel('relative frequency')  plt.title('Rider Speed Lower Bound (MPH)')  plt.legend(['Annual Members', 'Short-term Pass']);</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/9d8e88a77a7029461569144abc7cf8d7.jpg"></p>    <p>有趣的是，分布是完全不同的，年费用户的速度平均值更高一些。你可能会想到这里的结论，年费用户的速度比临时用户更高，但数据本身不足以支持这一结论。如果年费用户倾向于通过最直接的路线从点A去往点B，那么这些数据也可以被解释，而临时用户倾向于绕行并间接到达他们的目的地。我怀疑现实是这两种效应的混合。</p>    <p>还要看看距离和速度之间的关系：</p>    <p>In [17]:</p>    <pre>  <code class="language-python">g = sns.FacetGrid(trips, col="usertype", hue='usertype', size=6)  g.map(plt.scatter, "distance", "speed", s=4, alpha=0.2)    # Add lines and labels  x = np.linspace(0, 10)  g.axes[0, 0].set_ylabel('Lower Bound Speed')  for ax in g.axes.flat:      ax.set_xlabel('Lower Bound Distance')      ax.plot(x, 2 * x, '--r', alpha=0.3)      ax.text(9.8, 16.5, "Free Trips\n\nAdditional Fee", ha='right',              size=18, rotation=39, alpha=0.5, color='red')      ax.axis([0, 10, 0, 25])</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/003733a2e80713bd514581d3f7eb8251.jpg"></p>    <p>总的来说，我们看到较长的路途速度更快 - 虽然这受到与上述相同的下限影响。如上所述，作为参考，我绘制了需要的红线用于区分额外费用（低于红线）和免费费用（红线以上）。我们再次看到，年度会员对于不超过半小时的限制比每天通过用户更加精明 - 点的分布的指向了用户注意了他们使用的时间，以避免额外的费用。</p>    <h2>海拔高度</h2>    <p>在西雅图自行车分享服务的可行性的一个焦点是，西雅图是一个丘陵城市。在服务发布之前，一些分析师预测，西雅图会有源源不断的自行车上坡下坡，所以并不适合分享单车系统的落地。</p>    <p>数据版本中不包含海拔高度数据，但我们可以转到 Google Maps API 获取我们需要的数据;。</p>    <p>在这种情况下，自由使用限制为每天 2500 个请求，每次请求最多包含 512 个海拔高度。 由于我们只需要55个海拔高度，我们可以在单个查询中执行：</p>    <p>In [18]:</p>    <pre>  <code class="language-python">def get_station_elevations(stations):      """Get station elevations via Google Maps API"""      URL = "https://maps.googleapis.com/maps/api/elevation/json?locations="      locs = '|'.join(['{0},{1}'.format(lat, long)                       for (lat, long) in zip(stations.lat, stations.long)])      URL += locs      !curl "{URL}" -o elevations.json      def process_station_elevations():      """Convert Elevations JSON output to CSV"""      import json      D = json.load(open('elevations.json'))      def unnest(D):          loc = D.pop('location')          loc.update(D)          return loc      elevs = pd.DataFrame([unnest(item) for item in D['results']])      elevs.to_csv('station_elevations.csv')      return elevs    # only run this the first time:  import os  if not os.path.exists('station_elevations.csv'):      get_station_elevations(stations)      process_station_elevations()</code></pre>    <p>现在让我们读入海拔高度数据：</p>    <p>In [19]:</p>    <pre>  <code class="language-python">elevs = pd.read_csv('station_elevations.csv', index_col=0)  elevs.head()</code></pre>    <p>Out[19]:</p>    <table>     <thead>      <tr>       <th> </th>       <th>elevation</th>       <th>lat</th>       <th>lng</th>       <th>resolution</th>      </tr>     </thead>     <tbody>      <tr>       <th>0</th>       <td>37.351780</td>       <td>47.618418</td>       <td>-122.350964</td>       <td>76.351616</td>      </tr>      <tr>       <th>1</th>       <td>33.815830</td>       <td>47.615829</td>       <td>-122.348564</td>       <td>76.351616</td>      </tr>      <tr>       <th>2</th>       <td>34.274055</td>       <td>47.616094</td>       <td>-122.341102</td>       <td>76.351616</td>      </tr>      <tr>       <th>3</th>       <td>44.283257</td>       <td>47.613110</td>       <td>-122.344208</td>       <td>76.351616</td>      </tr>      <tr>       <th>4</th>       <td>42.460381</td>       <td>47.610185</td>       <td>-122.339641</td>       <td>76.351616</td>      </tr>     </tbody>    </table>    <p>为了验证结果，我们需要仔细检查纬度和经度是否匹配：</p>    <p>In [20]:</p>    <pre>  <code class="language-python"># double check that locations match  print(np.allclose(stations.long, elevs.lng))  print(np.allclose(stations.lat, elevs.lat))</code></pre>    <pre>  <code class="language-python">True  True</code></pre>    <p>现在我们可以将海拔数据与行程数据整合：</p>    <p>In [21]:</p>    <pre>  <code class="language-python">stations['elevation'] = elevs['elevation']  elevs.index = stations['terminal']    trips['elevation_start'] = trips.join(elevs, on='from_station_id')['elevation']  trips['elevation_end'] = trips.join(elevs, on='to_station_id')['elevation']  trips['elevation_gain'] = trips['elevation_end'] - trips['elevation_start']</code></pre>    <p>让我们来看看海拔数据和会员类型的分布关系：</p>    <p>In [22]:</p>    <pre>  <code class="language-python">g = sns.FacetGrid(trips, col="usertype", hue='usertype')  g.map(plt.hist, "elevation_gain", bins=np.arange(-145, 150, 10))  g.fig.set_figheight(6)  g.fig.set_figwidth(16);    # plot some lines to guide the eye  for lim in range(60, 150, 20):      x = np.linspace(-lim, lim, 3)      for ax in g.axes.flat:          ax.fill(x, 100 * (lim - abs(x)),                  color='gray', alpha=0.1, zorder=0)</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/f811ad4012072803bbbe09bdbd3f12f9.jpg"></p>    <p>我们在背景中绘制了一些阴影以帮助引导分析。 年度会员和临时用户之间有很大的区别：年费用户非常明显的表示出偏好下坡行程（左侧的分布），而临时用户表现并不明显，而是表示喜欢骑乘开始并在相同高度结束。</p>    <p>为了使海拔数据变化的影响更加明显，我们做一些计算：</p>    <p>In [23]:</p>    <pre>  <code class="language-python">print("total downhill trips:", (trips.elevation_gain < 0).sum())  print("total uphill trips:  ", (trips.elevation_gain > 0).sum())</code></pre>    <pre>  <code class="language-python">total downhill trips: 80532  total uphill trips:   50493</code></pre>    <p>我们看到，第一年下坡比上坡多出了 3 万次 - 这是大约 60％ 以上。 根据目前的使用水平，这意味着 Pronto 工作人员必须每天从海拔较低的服务点运送大约 100 辆自行车到高海拔服务点。</p>    <h2>天气</h2>    <p>另一个常见的反对循环共享的可行性的论点是天气。让我们来看看出行数量随着温度和降水量的变化。</p>    <p>幸运的是，数据包括了大范围的天气数据：</p>    <p>In [24]:</p>    <pre>  <code class="language-python">weather = pd.read_csv('2015_weather_data.csv', index_col='Date', parse_dates=True)  weather.columns</code></pre>    <p>Out[24]:</p>    <pre>  <code class="language-python">Index(['Max_Temperature_F', 'Mean_Temperature_F', 'Min_TemperatureF',         'Max_Dew_Point_F', 'MeanDew_Point_F', 'Min_Dewpoint_F', 'Max_Humidity',         'Mean_Humidity ', 'Min_Humidity ', 'Max_Sea_Level_Pressure_In ',         'Mean_Sea_Level_Pressure_In ', 'Min_Sea_Level_Pressure_In ',         'Max_Visibility_Miles ', 'Mean_Visibility_Miles ',         'Min_Visibility_Miles ', 'Max_Wind_Speed_MPH ', 'Mean_Wind_Speed_MPH ',         'Max_Gust_Speed_MPH', 'Precipitation_In ', 'Events'],        dtype='object')         dtype ='object'）</code></pre>    <p>让我们将天气数据与行程数据结合起来：</p>    <p>In [25]:</p>    <pre>  <code class="language-python">by_date = trips.groupby(['date', 'usertype'])['trip_id'].count()  by_date.name = 'count'  by_date = by_date.reset_index('usertype').join(weather)</code></pre>    <p>现在我们可以看看按工作日和周末为纬度，查看出行数量随温度和降水量的变化：</p>    <p>In [26]:</p>    <pre>  <code class="language-python"># add a flag indicating weekend  by_date['weekend'] = (by_date.index.dayofweek >= 5)    #----------------------------------------------------------------  # Plot Temperature Trend  g = sns.FacetGrid(by_date, col="weekend", hue='usertype', size=6)  g.map(sns.regplot, "Mean_Temperature_F", "count")  g.add_legend();    # do some formatting  g.axes[0, 0].set_title('')  g.axes[0, 1].set_title('')  g.axes[0, 0].text(0.05, 0.95, 'Monday - Friday', va='top', size=14,                    transform=g.axes[0, 0].transAxes)  g.axes[0, 1].text(0.05, 0.95, 'Saturday - Sunday', va='top', size=14,                    transform=g.axes[0, 1].transAxes)  g.fig.text(0.45, 1, "Trend With Temperature", ha='center', va='top', size=16);    #----------------------------------------------------------------  # Plot Precipitation  g = sns.FacetGrid(by_date, col="weekend", hue='usertype', size=6)  g.map(sns.regplot, "Precipitation_In ", "count")  g.add_legend();    # do some formatting  g.axes[0, 0].set_ylim(-50, 600);  g.axes[0, 0].set_title('')  g.axes[0, 1].set_title('')  g.axes[0, 0].text(0.95, 0.95, 'Monday - Friday', ha='right', va='top', size=14,                    transform=g.axes[0, 0].transAxes)  g.axes[0, 1].text(0.95, 0.95, 'Saturday - Sunday', ha='right', va='top', size=14,                    transform=g.axes[0, 1].transAxes)  g.fig.text(0.45, 1, "Trend With Precipitation", ha='center', va='top', size=16);</code></pre>    <p style="text-align:center"><img src="https://simg.open-open.com/show/b71d04c6257f42c2e819b6d5c29d7ef4.jpg"> <img src="https://simg.open-open.com/show/fedd152f94a04d2236d455e90fee30be.jpg"></p>    <p>对于天气的影响，我们可以看出明显的趋势：人们更喜欢温暖、阳光明媚的天气。但是也有一些有趣的细节：工作日期间，所有的用户都受到天气的影响。然而周末年费用户受影响更少。我没有什么好的理论说明为什么有这种趋势，如果你有好的想法，欢迎提供。</p>    <h2>总结</h2>    <p>根据上面的一些系列分析，但是我想我们可以从这些数据中获得一些结论：</p>    <ul>     <li>年费用户与临时用户整体上会有不同的行为：年费用户通常是利用 Pronto 进行工作日的通勤。临时用户则是周末使用 Pronto 探索城市的特定区域。</li>     <li>尽管年费用户对定价策略有所了解，但是四分之一的行程还是超过了半小时的限制，并产生了额外的费用。为了客户的权益，Pronto 应该更好告知用户这种定价策略。</li>     <li>海拔和天气会影响使用，正如你预计的一样：下坡比上坡多60%，寒冷和下雨也会明显减少当天的骑行数量。</li>    </ul>    <p> </p>    <p>来自：http://ipfans.github.io/2017/02/analyzing-pronto-cycleshare-data-with-python-and-pandas/</p>    <p> </p>
使用Python和Pandas分析Pronto CycleShare数据

相关经验

目录