In this report, I will do some data preprocessing and then get some basic information about the dataset, New York Citi Bike Trip Histories, via tools.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import json
import zipfile
import urllib.request
import itertools
from sklearn import metrics
from geopy.distance import vincenty
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Dataset

Here we’ll use 201707-citibike-tripdata.csv.zip only.

Schema

The data includes:

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
  • Gender (Zero=unknown; 1=male; 2=female)
  • Year of Birth

This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations (which we were using more in June and July 2013), and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it’s secure).

urllib.request.urlretrieve("https://s3.amazonaws.com/tripdata/201707-citibike-tripdata.csv.zip", "data.zip")

with zipfile.ZipFile("data.zip","r") as zip_ref:
    zip_ref.extractall("./data")
from os import walk

for (dirpath, dirnames, filenames) in walk("./data"):
    print(filenames)
['201707-citibike-tripdata.csv']
df = pd.read_csv("./data/201707-citibike-tripdata.csv")
df.head()
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 364 2017-07-01 00:00:00 2017-07-01 00:06:05 539 Metropolitan Ave & Bedford Ave 40.715348 -73.960241 3107 Bedford Ave & Nassau Ave 40.723117 -73.952123 14744 Subscriber 1986.0 1
1 2142 2017-07-01 00:00:03 2017-07-01 00:35:46 293 Lafayette St & E 8 St 40.730207 -73.991026 3425 2 Ave & E 104 St 40.789210 -73.943708 19587 Subscriber 1981.0 1
2 328 2017-07-01 00:00:08 2017-07-01 00:05:37 3242 Schermerhorn St & Court St 40.691029 -73.991834 3397 Court St & Nelson St 40.676395 -73.998699 27937 Subscriber 1984.0 2
3 2530 2017-07-01 00:00:11 2017-07-01 00:42:22 2002 Wythe Ave & Metropolitan Ave 40.716887 -73.963198 398 Atlantic Ave & Furman St 40.691652 -73.999979 26066 Subscriber 1985.0 1
4 2534 2017-07-01 00:00:15 2017-07-01 00:42:29 2002 Wythe Ave & Metropolitan Ave 40.716887 -73.963198 398 Atlantic Ave & Furman St 40.691652 -73.999979 29408 Subscriber 1982.0 2
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1735599 entries, 0 to 1735598
Data columns (total 15 columns):
tripduration               int64
starttime                  object
stoptime                   object
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   object
birth year                 float64
gender                     int64
dtypes: float64(5), int64(5), object(5)
memory usage: 198.6+ MB

Preprocess

Missing Values & Anomaly Detection

There might be some noise in the dataset, like strange stations or null values. Please detect it and take proper actions to them.

df.shape
(1735599, 15)
df.isnull().sum()
tripduration                    0
starttime                       0
stoptime                        0
start station id                0
start station name              0
start station latitude          0
start station longitude         0
end station id                  0
end station name                0
end station latitude            0
end station longitude           0
bikeid                          0
usertype                        0
birth year                 228596
gender                          0
dtype: int64

The data is clean that we can simply drop the column "birth year" since we are not going to use this feature anyway.

Here I ues the dropna function to drop any columns with any missing values.

df = df.dropna(axis=1, how='any')
df.shape
(1735599, 14)

Next, to detect and eliminate strange values, we plot all the histograms on numeric values.

fig, axes = plt.subplots(nrows=9, ncols=2, figsize=(15,15))

i = 1
for col in df.columns:
    if df[col].dtype == np.float64 or df[col].dtype == np.int64:
        ax = plt.subplot(9, 2, i)
        df[col].hist(bins=30)
        ax.set_title(col)
        i += 1
        ax = plt.subplot(9, 2, i)
        df[col].hist(bins=30)
        ax.set_title(col+" (log scale)")
        ax.set_yscale('log')
        i += 1

fig.tight_layout()
plt.show()

There are some values that are strange:

  1. The maximum value of trip duration is 2500000 sec(s), which is about 28 days. This is very uncommon for a person to rent a citibike for such a long time.
  2. There is a station that is so far away from other stations. (See the rightmost balue of end station latitude/longitude.)

Here I would deal with the first problem by removing any records whose trip duration is over 20 days.

df = df[df['trip duration'] <= 24*60*60*20]

Now we check if all stations have one-to-one matching among their id, name, latitude, and longitude.

  • station id <—> station name
x1 = len(df['start station id'].unique())
y1 = len(df[['start station id', 'start station name']].drop_duplicates())
x2 = len(df['end station id'].unique())
y2 = len(df[['end station id', 'end station name']].drop_duplicates())

x1 == y1 and x2 == y2
True
  • station id <—> station latitude
x1 = len(df['start station id'].unique())
y2 = len(df[['start station id', 'start station latitude']].drop_duplicates())
x2 = len(df['end station id'].unique())
y2 = len(df[['end station id', 'end station latitude']].drop_duplicates())

x1 == y1 and x2 == y2
True
  • station id <—> station longitude
x1 = len(df['start station id'].unique())
y2 = len(df[['start station id', 'start station longitude']].drop_duplicates())
x2 = len(df['end station id'].unique())
y2 = len(df[['end station id', 'end station longitude']].drop_duplicates())

x1 == y1 and x2 == y2
True

We can also draw all stations on map to see if there is any strange location

Here I use the Basemap package of matplotlib to draw the map. (See examples here)

t1 = df[['start station id', 'start station name', 'start station latitude', 'start station longitude']] \
            .drop_duplicates().rename(columns = {'start station id':'station id', \
                                                 'start station name':'station name', \
                                                 'start station latitude':'station latitude', 
                                                 'start station longitude': 'station longitude'})
t2 = df[['end station id', 'end station name', 'end station latitude', 'end station longitude']] \
        .drop_duplicates().rename(columns = {'end station id':'station id', \
                                             'end station name':'station name', \
                                             'end station latitude':'station latitude', \
                                             'end station longitude': 'station longitude'})
df_loc = pd.concat([t1, t2]).drop_duplicates()
# Initialize plots
fig, ax = plt.subplots(figsize=(15,15))

# determine range to print based on min, max lat and lon of the data
lat = list(df_loc['station latitude'])
lon = list(df_loc['station longitude'])
text = list(df_loc['station id'])
margin = 0.01 # buffer to add to the range
lat_min = min(lat) - margin
lat_max = max(lat) + margin
lon_min = min(lon) - margin
lon_max = max(lon) + margin

# create map using BASEMAP
m = Basemap(llcrnrlon=lon_min,
            llcrnrlat=lat_min,
            urcrnrlon=lon_max,
            urcrnrlat=lat_max,
            lat_0=(lat_max - lat_min)/2,
            lon_0=(lon_max - lon_min)/2,
            projection='lcc',
            resolution = 'f',)

m.drawcoastlines()
m.fillcontinents(lake_color='aqua')
m.drawmapboundary(fill_color='aqua')
m.drawrivers()

# convert lat and lon to map projection coordinates
lons, lats = m(lon, lat)

# plot points as red dots
ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.6)
for i in range(df_loc.shape[0]):
    plt.text(lons[i], lats[i], text[i])
plt.show()

The station “3254” and “3182” seems weird, however, these 2 stations are actually on an island named “Governors Island”.

And the station “3036”, which is at the upper-right corner of the map, does not exist actually, whereas station “3201” and “3192” are confirmed to exist.

(The way that I confirm the above information is to check the station list on the official website & Googe Map.)

So we need to remove data related to this station.

df = df[df['start station id']!=3036]
df = df[df['end station id']!=3036]
df_loc = df_loc[df_loc['station id']!=3036]
df_loc.to_csv("data/station_information.csv", index=None)

So now I want to check if all station ids mentioned in this dataframe exists.

Create Self-defined Features

For future use, we need to calculate in-flow and out-flow for each stations every half hour. The result data set should contains station_id, time, in_flow_count, out_flow_count.

in/out flow of a station are define as the number of trips move to/from the station within the 30 minutes period. So one day can be splitted into 48 segments.

To split one day into 48 segemnts, first we need to transform column "starttime" and "stoptime" into datetime format. (Use the to_datetime function, specifying a format to match the data.)

# format example: 2017-07-01 00:00:00
df['starttime'] = pd.to_datetime(df['starttime'], format='%Y-%m-%d %H:%M:%S')
df['stoptime'] =pd.to_datetime(df['stoptime'], format='%Y-%m-%d %H:%M:%S')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1735598 entries, 0 to 1735598
Data columns (total 14 columns):
tripduration               int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   object
gender                     int64
dtypes: datetime64[ns](2), float64(4), int64(5), object(3)
memory usage: 198.6+ MB

Now we need to annotate the data according to time segment: (Use DatetimeIndex to retrieve date&time information)

def gen_time_segment(dt):
    if dt.minute < 30:
        minute = "%02d" % 0
    else:
        minute = "%02d" % 30
    return "{}-{}-{} {}:{}".format(dt.year, dt.month, dt.day, dt.hour, minute)

df['start_seg'] = [gen_time_segment(dt) for dt in df['starttime']]
df['stop_seg'] = [gen_time_segment(dt) for dt in df['stoptime']]

df[['start station id', 'starttime', 'start_seg', 'end station id', 'stoptime', 'stop_seg']].head()
start station id starttime start_seg end station id stoptime stop_seg
0 539 2017-07-01 00:00:00 2017-7-1 0:00 3107 2017-07-01 00:06:05 2017-7-1 0:00
1 293 2017-07-01 00:00:03 2017-7-1 0:00 3425 2017-07-01 00:35:46 2017-7-1 0:30
2 3242 2017-07-01 00:00:08 2017-7-1 0:00 3397 2017-07-01 00:05:37 2017-7-1 0:00
3 2002 2017-07-01 00:00:11 2017-7-1 0:00 398 2017-07-01 00:42:22 2017-7-1 0:30
4 2002 2017-07-01 00:00:15 2017-7-1 0:00 398 2017-07-01 00:42:29 2017-7-1 0:30

Then group and count the data according to time annotations:

  • in-flow
inflow = df[['end station id', 'stop_seg']] \
            .groupby(['end station id', 'stop_seg']) \
            .size().reset_index(name='counts') \
            .rename(columns={'end station id':'station id','stop_seg':'time', 'counts':'in_flow_count'})
  • out-flow
outflow = df[['start station id', 'start_seg']] \
            .groupby(['start station id', 'start_seg']) \
            .size().reset_index(name='counts') \
            .rename(columns={'start station id':'station id','start_seg':'time', 'counts':'out_flow_count'})

In the end, merge the inflow and outflow dataframes, considering every time segment at every station.

station_id_list = list(df_loc['station id'])

# Create combinations of time series and station ids
time_seg_list = list(pd.date_range("2017-07-01 00:00:00", "2017-07-31 23:30:00", freq="30min"))
template = pd.DataFrame(list(itertools.product(station_id_list, time_seg_list)), \
                        columns=["station id", "time"])

# Merge in/out flow information & Add zeros to missing data according to every time segment
dat = pd.merge(inflow, outflow, on=['station id', 'time'], how='outer')
dat['time'] = pd.to_datetime(dat['time'], format='%Y-%m-%d %H:%M')
dat = dat.merge(template, on=["station id", "time"], how="right").fillna(0)
dat.head()
station id time in_flow_count out_flow_count
0 72 2017-07-01 00:00:00 1.0 0.0
1 72 2017-07-01 10:00:00 1.0 0.0
2 72 2017-07-01 10:30:00 7.0 7.0
3 72 2017-07-01 11:00:00 1.0 1.0
4 72 2017-07-01 12:00:00 2.0 6.0
dat.to_csv("data/station_flow.csv", index=None)

Query

How many stations are there in this dataset,and what is the average distance between them?

\[average~distance = \frac {\sum_{i!=j} dist(S_i,S_j)}{\frac{N(N-1)}{2}}\]
print("{} stations are found in this dataset.".format(len(station_id_list)))
633 stations are found in this dataset.

To calculate the distance between stations, I use the function vincenty in package GeoPy.

# Create dictionaries for station latitude/longitude
lat_dic = {}
lon_dic = {}
for index, row in df_loc.iterrows():
    lat_dic[row['station id']] = row['station latitude']
    lon_dic[row['station id']] = row['station longitude']

# Generate combinations of pairs of station
c = itertools.combinations(station_id_list, 2)

# Calculate the averge distance of pairs of stations
dist = 0
count = 0
for stn1, stn2 in c:
        dist += vincenty((lat_dic[stn1], lon_dic[stn1]), (lat_dic[stn2], lon_dic[stn2])).meters
        count += 1
print("The average distance between different stations is {} (meters)".format(dist/count))
The average distance between different stations is 5393.467456938468 (meters)

What are the top 3 frequent stations pairs (start station, end station) in weekdays, how about in weekends?

\[(S_i,S_j)~is~not~(S_j,S_i)\]
# Split the dataframe into weekdays information & weekends information
df_weekdays = df[df['starttime'].dt.dayofweek < 5]
df_weekends = df[df['starttime'].dt.dayofweek >= 5]

# Count and sort station pair frequencies
stn_pair_weekdays = df_weekdays[['start station id', 'end station id']] \
                    .groupby(['start station id', 'end station id']) \
                    .size().reset_index(name='counts') \
                    .set_index(['start station id', 'end station id']) \
                    .sort_values(by='counts', ascending=False)
stn_pair_weekends = df_weekends[['start station id', 'end station id']] \
                    .groupby(['start station id', 'end station id']) \
                    .size().reset_index(name='counts') \
                    .set_index(['start station id', 'end station id']) \
                    .sort_values(by='counts', ascending=False)
            
# Find the top 3 station pairs for weekday & weekend
top_weekday_pair = list(stn_pair_weekdays.head(3).index)
top_weekend_pair = list(stn_pair_weekends.head(3).index)

# Print out the result
print("The top 3 frequent stations pairs in weekdays are: {}, {}, and {}.".format(*top_weekday_pair))
print("The top 3 frequent stations pairs in weekends are: {}, {}, and {}.".format(*top_weekend_pair))
The top 3 frequent stations pairs in weekdays are: (432, 3263), (2006, 2006), and (281, 281).
The top 3 frequent stations pairs in weekends are: (3182, 3182), (3182, 3254), and (3254, 3182).

Find the top 3 stations with highest average out-flow, and top 3 highest average in-flow

# Sort the average in/out flow count of each station
average_inflow = dat[['station id', 'in_flow_count']] \
                .groupby(['station id']) \
                .mean() \
                .sort_values(by='in_flow_count', ascending=False)
average_outflow = dat[['station id', 'out_flow_count']] \
                .groupby(['station id']) \
                .mean() \
                .sort_values(by='out_flow_count', ascending=False)
            
# List the top 3 stations
top_inflow = list(average_inflow.head(3).index)
top_outflow = list(average_outflow.head(3).index)

# Print out the result
print("The top 3 stations with highest outflow are: {}, {}, and {}".format(*top_outflow))
print("The top 3 stations with highest inflow are: {}, {}, and {}".format(*top_inflow))
The top 3 stations with highest outflow are: 519, 426, and 514
The top 3 stations with highest inflow are: 426, 519, and 514
# Sum up in/out flow at each time station
dat['flow_count'] = dat['in_flow_count'] + dat['out_flow_count']

# Calculate and sort the average flow count for each station
average_flow = dat[['station id', 'flow_count']] \
                .groupby(['station id']) \
                .mean() \
                .sort_values(by='flow_count', ascending=False)
            
# Find the top 1 station
top_flow = list(average_inflow.head(1).index)

# Print out the result
print("The most popular station is: {}".format(*top_outflow))
The most popular station is: 519

a. Draw the in-flow($A$) and out-flow($B$) for that station in a line chart

Here I use the plot function of pandas.

# Select station & add information in missing time
small_df = dat[dat['station id'] == 519].sort_values(by='time')
small_df = small_df.sort_values(by='time')

# Plot line chart
small_df.plot(x='time', y=['in_flow_count', 'out_flow_count'], kind='line', figsize=(15,15))
plt.show()

b. Calculate the distance function between $A$ and $B$

Here I use pairwise_distance of scikit-learn, with metrics as “euclidean distance”.

dist = metrics.pairwise_distances([small_df['in_flow_count']], [small_df['out_flow_count']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))
The euclidean distance between in-flow and out-flow of this station is: 178.5553135585721

c. Calculate the distance function between $A-mean(A)$ and $B-mean(B)$ , and draw them both.

# Substract the mean
small_df['in_flow_diff'] = small_df['in_flow_count'] - small_df['in_flow_count'].mean()
small_df['out_flow_diff'] = small_df['out_flow_count'] - small_df['out_flow_count'].mean()

# Plot line chart
small_df.plot(x='time', y=['in_flow_diff', 'out_flow_diff'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_diff']], [small_df['out_flow_diff']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))
The euclidean distance between in-flow and out-flow of this station is: 178.27760158193377

d. Calculate the distance function between $(A-mean(A))/std(A)$ and $(B-mean(B))/std(B)$ , and draw them both.

# Calculate variance
small_df['in_flow_variance'] = small_df['in_flow_diff'] / small_df['in_flow_count'].std()
small_df['out_flow_variance'] = small_df['out_flow_diff'] / small_df['out_flow_count'].std()

# Plot line chart
small_df.plot(x='time', y=['in_flow_variance', 'out_flow_variance'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_variance']], [small_df['out_flow_variance']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))
The euclidean distance between in-flow and out-flow of this station is: 12.27694023559983

e. Calculate the distance function between ${A_i−f(i)|A_i \in A}$ and ${B_i −f(i)|B_i \in B}$,and draw them both.

\[f~is~the~linear~function~that~minimize~\sum(A_i−f(i))^2\]

First, we need to fit the linear regression models $f(i)$ for both $A$ and $B$, using Linear Regression in Generalized Linear Models.

# Prepare input for linear regression model
small_df.sort_values(by='time', ascending=True)
length = small_df.shape[0]
time = np.arange(length).reshape(length, 1)

# Create and fit the models
reg_A = linear_model.LinearRegression()
reg_A.fit(time, list(small_df['in_flow_count']))
reg_B = linear_model.LinearRegression()
reg_B.fit(time, list(small_df['out_flow_count']))

# Save the prediction results
small_df['in_flow_linear'] = y_pred_A = reg_A.predict(time)
small_df['out_flow_linear'] = y_pred_B = reg_B.predict(time)

# Plot the fit reuslt
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,10))
ax = plt.subplot(1, 2, 1)
plt.scatter(time, small_df['in_flow_count'].values.reshape(length, 1), color='black', alpha=0.1)
plt.plot(time, y_pred_A, color='blue', linewidth=3)
ax = plt.subplot(1, 2, 2)
plt.scatter(time, small_df['out_flow_count'].values.reshape(length, 1), color='black', alpha=0.1)
plt.plot(time, y_pred_B, color='blue', linewidth=3)

Then we can calculate ${A_i−f(i) | A_i \in A}$ and ${B_i −f(i) | B_i \in B}$, and draw them both.

# Calculate distance to the line drawn by linear model
small_df['in_flow_ols'] = small_df['in_flow_count'] - small_df['in_flow_linear']
small_df['out_flow_ols'] = small_df['out_flow_count'] - small_df['out_flow_linear']

# Plot line chart
small_df.plot(x='time', y=['in_flow_ols', 'out_flow_ols'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_ols']], [small_df['out_flow_ols']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))
The euclidean distance between in-flow and out-flow of this station is: 178.27208810218377

f. Calculate the distance function between $Smooth(A)$ and $Smooth(B)$ ,and draw them both.

You can choose any smoothing function,just specify it, or simply use

\[A_i = \frac{A_{i−1}+A_i+A_{i+1}}{3}\]

(or take the average of 5,7,9… elements)

Here I use the function rolling in pandas to take the average mean.

  • window = 9, as I take the average of 9 elements.
  • min_periods = 1 to avoid NA values at the head/tail of the time series.
  • center = True to make it match correctly to the time.
# Calculate the moving averages
small_df['in_flow_smooth'] = small_df['in_flow_count'].rolling(window=9, min_periods=1, center=True).mean()
small_df['out_flow_smooth'] = small_df['out_flow_count'].rolling(window=9, min_periods=1, center=True).mean()

# Plot line chart
small_df.plot(x='time', y=['in_flow_smooth', 'out_flow_smooth'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_smooth']], [small_df['out_flow_smooth']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))
The euclidean distance between in-flow and out-flow of this station is: 63.427226939887035

Visualize the flows of citibikes over time

First, annotate each record as in Early-July, Mid-July, or Late-July.

def gen_time_group(dt):
    if dt.day <= 10:
        return "Early-July"
    elif dt.day <= 20:
        return "Mid-July"
    else:
        return "Late-July"
# Calculate and sort the average flow count for each station
flow = dat[['station id', 'time', 'flow_count']] 

# Create time group
flow['time_group'] = [gen_time_group(dt) for dt in flow['time']]

# Summarise flow count according to time group
flow = flow.groupby(["station id", "time_group"], as_index=False) \
            .agg({'flow_count': 'sum'})

# Add latitude/logitude columns
flow['latitude'] = [lat_dic[x] for x in flow['station id']]
flow['longitude'] = [lon_dic[x] for x in flow['station id']]

flow.head()
station id time_group flow_count latitude longitude
0 72 Early-July 2369.0 40.767272 -73.993929
1 72 Late-July 2989.0 40.767272 -73.993929
2 72 Mid-July 2683.0 40.767272 -73.993929
3 79 Early-July 1580.0 40.719116 -74.006667
4 79 Late-July 2282.0 40.719116 -74.006667

Then plot every popular stations on map.

Note that the size of the station point is calculated as

\[Size = 2^{flow}\] \[flow = \frac{sum~of~flow~count~in~this~time~period}{1000}\]
def plot_stations_map(ax, stns, noText=False):
    # determine range to print based on min, max lat and lon of the data
    lat = list(stns['latitude'])
    lon = list(stns['longitude'])
    siz = [(2)**(x/1000) for x in stns['flow_count']]
    margin = 0.01 # buffer to add to the range
    lat_min = min(lat) - margin
    lat_max = max(lat) + margin
    lon_min = min(lon) - margin
    lon_max = max(lon) + margin

    # create map using BASEMAP
    m = Basemap(llcrnrlon=lon_min,
                llcrnrlat=lat_min,
                urcrnrlon=lon_max,
                urcrnrlat=lat_max,
                lat_0=(lat_max - lat_min)/2,
                lon_0=(lon_max - lon_min)/2,
                projection='lcc',
                resolution = 'f',)

    m.drawcoastlines()
    m.fillcontinents(lake_color='aqua')
    m.drawmapboundary(fill_color='aqua')
    m.drawrivers()

    # convert lat and lon to map projection coordinates
    lons, lats = m(lon, lat)

    # plot points as red dots
    if noText:
        ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.6, s=1)
        return
    else:
        ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.3, s=siz)
    
    # annotate popular stations
    for i in range(len(siz)):
        if siz[i] >= 2**6:
            plt.text(lons[i], lats[i], text[i])
pop_flow = flow[flow['flow_count'] > 2000]
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15,15))
ax = plt.subplot(1, 4, 1)
ax.set_title("All Stations")
plot_stations_map(ax, flow, noText=True)
ax = plt.subplot(1, 4, 2)
ax.set_title("Popular Stations in Early July")
plot_stations_map(ax, pop_flow[pop_flow['time_group'] == "Early-July"])
ax = plt.subplot(1, 4, 3)
ax.set_title("Popular Stations in Mid July")
plot_stations_map(ax, pop_flow[pop_flow['time_group'] == "Mid-July"])
ax = plt.subplot(1, 4, 4)
ax.set_title("Popular Stations in Late July")
plot_stations_map(ax, pop_flow[pop_flow['time_group'] == "Late-July"])

Here are my observations:

  1. Few people ride citibikes in Jersey City and Brooklyn, that is, the main flow of citibike concentrates in Manhattan.
  2. The poular stations in Brooklyn are nearly the same across the whole month.
  3. Although the top popular regions (locations with large red circle) look similar, the top popular stations are not the same across the whole month. (Check station id carefully you’ll see they are not the same.)
  4. For popular regions, more and more people ride citibikes in Mid July and Late July.