Data Preprocessing and Exploring the New York City Bike Dataset

In this report, I will do some data preprocessing and then get some basic information about the dataset, New York Citi Bike Trip Histories, via tools.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import json
import zipfile
import urllib.request
import itertools
from sklearn import metrics
from geopy.distance import vincenty
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Dataset

Here we’ll use 201707-citibike-tripdata.csv.zip only.

Schema

The data includes:

Trip Duration (seconds)
Start Time and Date
Stop Time and Date
Start Station Name
End Station Name
Station ID
Station Lat/Long
Bike ID
User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
Gender (Zero=unknown; 1=male; 2=female)
Year of Birth

This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations (which we were using more in June and July 2013), and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it’s secure).

urllib.request.urlretrieve("https://s3.amazonaws.com/tripdata/201707-citibike-tripdata.csv.zip", "data.zip")

with zipfile.ZipFile("data.zip","r") as zip_ref:
    zip_ref.extractall("./data")

from os import walk

for (dirpath, dirnames, filenames) in walk("./data"):
    print(filenames)

['201707-citibike-tripdata.csv']

df = pd.read_csv("./data/201707-citibike-tripdata.csv")
df.head()

	tripduration	starttime	stoptime	start station id	start station name	start station latitude	start station longitude	end station id	end station name	end station latitude	end station longitude	bikeid	usertype	birth year	gender
0	364	2017-07-01 00:00:00	2017-07-01 00:06:05	539	Metropolitan Ave & Bedford Ave	40.715348	-73.960241	3107	Bedford Ave & Nassau Ave	40.723117	-73.952123	14744	Subscriber	1986.0	1
1	2142	2017-07-01 00:00:03	2017-07-01 00:35:46	293	Lafayette St & E 8 St	40.730207	-73.991026	3425	2 Ave & E 104 St	40.789210	-73.943708	19587	Subscriber	1981.0	1
2	328	2017-07-01 00:00:08	2017-07-01 00:05:37	3242	Schermerhorn St & Court St	40.691029	-73.991834	3397	Court St & Nelson St	40.676395	-73.998699	27937	Subscriber	1984.0	2
3	2530	2017-07-01 00:00:11	2017-07-01 00:42:22	2002	Wythe Ave & Metropolitan Ave	40.716887	-73.963198	398	Atlantic Ave & Furman St	40.691652	-73.999979	26066	Subscriber	1985.0	1
4	2534	2017-07-01 00:00:15	2017-07-01 00:42:29	2002	Wythe Ave & Metropolitan Ave	40.716887	-73.963198	398	Atlantic Ave & Furman St	40.691652	-73.999979	29408	Subscriber	1982.0	2

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1735599 entries, 0 to 1735598
Data columns (total 15 columns):
tripduration               int64
starttime                  object
stoptime                   object
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   object
birth year                 float64
gender                     int64
dtypes: float64(5), int64(5), object(5)
memory usage: 198.6+ MB

Preprocess

Missing Values & Anomaly Detection

There might be some noise in the dataset, like strange stations or null values. Please detect it and take proper actions to them.

df.shape

(1735599, 15)

df.isnull().sum()

tripduration                    0
starttime                       0
stoptime                        0
start station id                0
start station name              0
start station latitude          0
start station longitude         0
end station id                  0
end station name                0
end station latitude            0
end station longitude           0
bikeid                          0
usertype                        0
birth year                 228596
gender                          0
dtype: int64

The data is clean that we can simply drop the column "birth year" since we are not going to use this feature anyway.

Here I ues the dropna function to drop any columns with any missing values.

df = df.dropna(axis=1, how='any')
df.shape

(1735599, 14)

Next, to detect and eliminate strange values, we plot all the histograms on numeric values.

fig, axes = plt.subplots(nrows=9, ncols=2, figsize=(15,15))

i = 1
for col in df.columns:
    if df[col].dtype == np.float64 or df[col].dtype == np.int64:
        ax = plt.subplot(9, 2, i)
        df[col].hist(bins=30)
        ax.set_title(col)
        i += 1
        ax = plt.subplot(9, 2, i)
        df[col].hist(bins=30)
        ax.set_title(col+" (log scale)")
        ax.set_yscale('log')
        i += 1

fig.tight_layout()
plt.show()

There are some values that are strange:

The maximum value of trip duration is 2500000 sec(s), which is about 28 days. This is very uncommon for a person to rent a citibike for such a long time.
There is a station that is so far away from other stations. (See the rightmost balue of end station latitude/longitude.)

Here I would deal with the first problem by removing any records whose trip duration is over 20 days.

df = df[df['trip duration'] <= 24*60*60*20]

Now we check if all stations have one-to-one matching among their id, name, latitude, and longitude.

station id <—> station name

x1 = len(df['start station id'].unique())
y1 = len(df[['start station id', 'start station name']].drop_duplicates())
x2 = len(df['end station id'].unique())
y2 = len(df[['end station id', 'end station name']].drop_duplicates())

x1 == y1 and x2 == y2

True

station id <—> station latitude

x1 = len(df['start station id'].unique())
y2 = len(df[['start station id', 'start station latitude']].drop_duplicates())
x2 = len(df['end station id'].unique())
y2 = len(df[['end station id', 'end station latitude']].drop_duplicates())

x1 == y1 and x2 == y2

True

station id <—> station longitude

x1 = len(df['start station id'].unique())
y2 = len(df[['start station id', 'start station longitude']].drop_duplicates())
x2 = len(df['end station id'].unique())
y2 = len(df[['end station id', 'end station longitude']].drop_duplicates())

x1 == y1 and x2 == y2

True

We can also draw all stations on map to see if there is any strange location

Here I use the Basemap package of matplotlib to draw the map. (See examples here)

t1 = df[['start station id', 'start station name', 'start station latitude', 'start station longitude']] \
            .drop_duplicates().rename(columns = {'start station id':'station id', \
                                                 'start station name':'station name', \
                                                 'start station latitude':'station latitude', 
                                                 'start station longitude': 'station longitude'})
t2 = df[['end station id', 'end station name', 'end station latitude', 'end station longitude']] \
        .drop_duplicates().rename(columns = {'end station id':'station id', \
                                             'end station name':'station name', \
                                             'end station latitude':'station latitude', \
                                             'end station longitude': 'station longitude'})
df_loc = pd.concat([t1, t2]).drop_duplicates()

# Initialize plots
fig, ax = plt.subplots(figsize=(15,15))

# determine range to print based on min, max lat and lon of the data
lat = list(df_loc['station latitude'])
lon = list(df_loc['station longitude'])
text = list(df_loc['station id'])
margin = 0.01 # buffer to add to the range
lat_min = min(lat) - margin
lat_max = max(lat) + margin
lon_min = min(lon) - margin
lon_max = max(lon) + margin

# create map using BASEMAP
m = Basemap(llcrnrlon=lon_min,
            llcrnrlat=lat_min,
            urcrnrlon=lon_max,
            urcrnrlat=lat_max,
            lat_0=(lat_max - lat_min)/2,
            lon_0=(lon_max - lon_min)/2,
            projection='lcc',
            resolution = 'f',)

m.drawcoastlines()
m.fillcontinents(lake_color='aqua')
m.drawmapboundary(fill_color='aqua')
m.drawrivers()

# convert lat and lon to map projection coordinates
lons, lats = m(lon, lat)

# plot points as red dots
ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.6)
for i in range(df_loc.shape[0]):
    plt.text(lons[i], lats[i], text[i])
plt.show()

The station “3254” and “3182” seems weird, however, these 2 stations are actually on an island named “Governors Island”.

And the station “3036”, which is at the upper-right corner of the map, does not exist actually, whereas station “3201” and “3192” are confirmed to exist.

(The way that I confirm the above information is to check the station list on the official website & Googe Map.)

So we need to remove data related to this station.

df = df[df['start station id']!=3036]
df = df[df['end station id']!=3036]
df_loc = df_loc[df_loc['station id']!=3036]

df_loc.to_csv("data/station_information.csv", index=None)

So now I want to check if all station ids mentioned in this dataframe exists.

Create Self-defined Features

For future use, we need to calculate in-flow and out-flow for each stations every half hour. The result data set should contains station_id, time, in_flow_count, out_flow_count.

in/out flow of a station are define as the number of trips move to/from the station within the 30 minutes period. So one day can be splitted into 48 segments.

To split one day into 48 segemnts, first we need to transform column "starttime" and "stoptime" into datetime format. (Use the to_datetime function, specifying a format to match the data.)

# format example: 2017-07-01 00:00:00
df['starttime'] = pd.to_datetime(df['starttime'], format='%Y-%m-%d %H:%M:%S')
df['stoptime'] =pd.to_datetime(df['stoptime'], format='%Y-%m-%d %H:%M:%S')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1735598 entries, 0 to 1735598
Data columns (total 14 columns):
tripduration               int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   object
gender                     int64
dtypes: datetime64[ns](2), float64(4), int64(5), object(3)
memory usage: 198.6+ MB

Now we need to annotate the data according to time segment: (Use DatetimeIndex to retrieve date&time information)

def gen_time_segment(dt):
    if dt.minute < 30:
        minute = "%02d" % 0
    else:
        minute = "%02d" % 30
    return "{}-{}-{} {}:{}".format(dt.year, dt.month, dt.day, dt.hour, minute)

df['start_seg'] = [gen_time_segment(dt) for dt in df['starttime']]
df['stop_seg'] = [gen_time_segment(dt) for dt in df['stoptime']]

df[['start station id', 'starttime', 'start_seg', 'end station id', 'stoptime', 'stop_seg']].head()

	start station id	starttime	start_seg	end station id	stoptime	stop_seg
0	539	2017-07-01 00:00:00	2017-7-1 0:00	3107	2017-07-01 00:06:05	2017-7-1 0:00
1	293	2017-07-01 00:00:03	2017-7-1 0:00	3425	2017-07-01 00:35:46	2017-7-1 0:30
2	3242	2017-07-01 00:00:08	2017-7-1 0:00	3397	2017-07-01 00:05:37	2017-7-1 0:00
3	2002	2017-07-01 00:00:11	2017-7-1 0:00	398	2017-07-01 00:42:22	2017-7-1 0:30
4	2002	2017-07-01 00:00:15	2017-7-1 0:00	398	2017-07-01 00:42:29	2017-7-1 0:30

Then group and count the data according to time annotations:

in-flow

inflow = df[['end station id', 'stop_seg']] \
            .groupby(['end station id', 'stop_seg']) \
            .size().reset_index(name='counts') \
            .rename(columns={'end station id':'station id','stop_seg':'time', 'counts':'in_flow_count'})

out-flow

outflow = df[['start station id', 'start_seg']] \
            .groupby(['start station id', 'start_seg']) \
            .size().reset_index(name='counts') \
            .rename(columns={'start station id':'station id','start_seg':'time', 'counts':'out_flow_count'})

In the end, merge the inflow and outflow dataframes, considering every time segment at every station.

station_id_list = list(df_loc['station id'])

# Create combinations of time series and station ids
time_seg_list = list(pd.date_range("2017-07-01 00:00:00", "2017-07-31 23:30:00", freq="30min"))
template = pd.DataFrame(list(itertools.product(station_id_list, time_seg_list)), \
                        columns=["station id", "time"])

# Merge in/out flow information & Add zeros to missing data according to every time segment
dat = pd.merge(inflow, outflow, on=['station id', 'time'], how='outer')
dat['time'] = pd.to_datetime(dat['time'], format='%Y-%m-%d %H:%M')
dat = dat.merge(template, on=["station id", "time"], how="right").fillna(0)
dat.head()

	station id	time	in_flow_count	out_flow_count
0	72	2017-07-01 00:00:00	1.0	0.0
1	72	2017-07-01 10:00:00	1.0	0.0
2	72	2017-07-01 10:30:00	7.0	7.0
3	72	2017-07-01 11:00:00	1.0	1.0
4	72	2017-07-01 12:00:00	2.0	6.0

dat.to_csv("data/station_flow.csv", index=None)

Query

How many stations are there in this dataset,and what is the average distance between them?

\[average~distance = \frac {\sum_{i!=j} dist(S_i,S_j)}{\frac{N(N-1)}{2}}\]

print("{} stations are found in this dataset.".format(len(station_id_list)))

633 stations are found in this dataset.

To calculate the distance between stations, I use the function vincenty in package GeoPy.

# Create dictionaries for station latitude/longitude
lat_dic = {}
lon_dic = {}
for index, row in df_loc.iterrows():
    lat_dic[row['station id']] = row['station latitude']
    lon_dic[row['station id']] = row['station longitude']

# Generate combinations of pairs of station
c = itertools.combinations(station_id_list, 2)

# Calculate the averge distance of pairs of stations
dist = 0
count = 0
for stn1, stn2 in c:
        dist += vincenty((lat_dic[stn1], lon_dic[stn1]), (lat_dic[stn2], lon_dic[stn2])).meters
        count += 1
print("The average distance between different stations is {} (meters)".format(dist/count))

The average distance between different stations is 5393.467456938468 (meters)

What are the top 3 frequent stations pairs `(start station, end station)` in weekdays, how about in weekends?

\[(S_i,S_j)~is~not~(S_j,S_i)\]

# Split the dataframe into weekdays information & weekends information
df_weekdays = df[df['starttime'].dt.dayofweek < 5]
df_weekends = df[df['starttime'].dt.dayofweek >= 5]

# Count and sort station pair frequencies
stn_pair_weekdays = df_weekdays[['start station id', 'end station id']] \
                    .groupby(['start station id', 'end station id']) \
                    .size().reset_index(name='counts') \
                    .set_index(['start station id', 'end station id']) \
                    .sort_values(by='counts', ascending=False)
stn_pair_weekends = df_weekends[['start station id', 'end station id']] \
                    .groupby(['start station id', 'end station id']) \
                    .size().reset_index(name='counts') \
                    .set_index(['start station id', 'end station id']) \
                    .sort_values(by='counts', ascending=False)
            
# Find the top 3 station pairs for weekday & weekend
top_weekday_pair = list(stn_pair_weekdays.head(3).index)
top_weekend_pair = list(stn_pair_weekends.head(3).index)

# Print out the result
print("The top 3 frequent stations pairs in weekdays are: {}, {}, and {}.".format(*top_weekday_pair))
print("The top 3 frequent stations pairs in weekends are: {}, {}, and {}.".format(*top_weekend_pair))

The top 3 frequent stations pairs in weekdays are: (432, 3263), (2006, 2006), and (281, 281).
The top 3 frequent stations pairs in weekends are: (3182, 3182), (3182, 3254), and (3254, 3182).

Find the top 3 stations with highest average `out-flow`, and top 3 highest average `in-flow`

# Sort the average in/out flow count of each station
average_inflow = dat[['station id', 'in_flow_count']] \
                .groupby(['station id']) \
                .mean() \
                .sort_values(by='in_flow_count', ascending=False)
average_outflow = dat[['station id', 'out_flow_count']] \
                .groupby(['station id']) \
                .mean() \
                .sort_values(by='out_flow_count', ascending=False)
            
# List the top 3 stations
top_inflow = list(average_inflow.head(3).index)
top_outflow = list(average_outflow.head(3).index)

# Print out the result
print("The top 3 stations with highest outflow are: {}, {}, and {}".format(*top_outflow))
print("The top 3 stations with highest inflow are: {}, {}, and {}".format(*top_inflow))

The top 3 stations with highest outflow are: 519, 426, and 514
The top 3 stations with highest inflow are: 426, 519, and 514

What is the most popular station(highest average inflow+outflow)?

# Sum up in/out flow at each time station
dat['flow_count'] = dat['in_flow_count'] + dat['out_flow_count']

# Calculate and sort the average flow count for each station
average_flow = dat[['station id', 'flow_count']] \
                .groupby(['station id']) \
                .mean() \
                .sort_values(by='flow_count', ascending=False)
            
# Find the top 1 station
top_flow = list(average_inflow.head(1).index)

# Print out the result
print("The most popular station is: {}".format(*top_outflow))

The most popular station is: 519

a. Draw the in-flow($A$) and out-flow($B$) for that station in a line chart

Here I use the plot function of pandas.

# Select station & add information in missing time
small_df = dat[dat['station id'] == 519].sort_values(by='time')
small_df = small_df.sort_values(by='time')

# Plot line chart
small_df.plot(x='time', y=['in_flow_count', 'out_flow_count'], kind='line', figsize=(15,15))
plt.show()

b. Calculate the distance function between $A$ and $B$

Here I use pairwise_distance of scikit-learn, with metrics as “euclidean distance”.

dist = metrics.pairwise_distances([small_df['in_flow_count']], [small_df['out_flow_count']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))

The euclidean distance between in-flow and out-flow of this station is: 178.5553135585721

c. Calculate the distance function between $A-mean(A)$ and $B-mean(B)$ , and draw them both.

# Substract the mean
small_df['in_flow_diff'] = small_df['in_flow_count'] - small_df['in_flow_count'].mean()
small_df['out_flow_diff'] = small_df['out_flow_count'] - small_df['out_flow_count'].mean()

# Plot line chart
small_df.plot(x='time', y=['in_flow_diff', 'out_flow_diff'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_diff']], [small_df['out_flow_diff']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))

The euclidean distance between in-flow and out-flow of this station is: 178.27760158193377

d. Calculate the distance function between $(A-mean(A))/std(A)$ and $(B-mean(B))/std(B)$ , and draw them both.

# Calculate variance
small_df['in_flow_variance'] = small_df['in_flow_diff'] / small_df['in_flow_count'].std()
small_df['out_flow_variance'] = small_df['out_flow_diff'] / small_df['out_flow_count'].std()

# Plot line chart
small_df.plot(x='time', y=['in_flow_variance', 'out_flow_variance'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_variance']], [small_df['out_flow_variance']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))

The euclidean distance between in-flow and out-flow of this station is: 12.27694023559983

e. Calculate the distance function between ${A_i−f(i)|A_i \in A}$ and ${B_i −f(i)|B_i \in B}$,and draw them both.

\[f~is~the~linear~function~that~minimize~\sum(A_i−f(i))^2\]

First, we need to fit the linear regression models $f(i)$ for both $A$ and $B$, using Linear Regression in Generalized Linear Models.

# Prepare input for linear regression model
small_df.sort_values(by='time', ascending=True)
length = small_df.shape[0]
time = np.arange(length).reshape(length, 1)

# Create and fit the models
reg_A = linear_model.LinearRegression()
reg_A.fit(time, list(small_df['in_flow_count']))
reg_B = linear_model.LinearRegression()
reg_B.fit(time, list(small_df['out_flow_count']))

# Save the prediction results
small_df['in_flow_linear'] = y_pred_A = reg_A.predict(time)
small_df['out_flow_linear'] = y_pred_B = reg_B.predict(time)

# Plot the fit reuslt
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,10))
ax = plt.subplot(1, 2, 1)
plt.scatter(time, small_df['in_flow_count'].values.reshape(length, 1), color='black', alpha=0.1)
plt.plot(time, y_pred_A, color='blue', linewidth=3)
ax = plt.subplot(1, 2, 2)
plt.scatter(time, small_df['out_flow_count'].values.reshape(length, 1), color='black', alpha=0.1)
plt.plot(time, y_pred_B, color='blue', linewidth=3)

Then we can calculate ${A_i−f(i) | A_i \in A}$ and ${B_i −f(i) | B_i \in B}$, and draw them both.

# Calculate distance to the line drawn by linear model
small_df['in_flow_ols'] = small_df['in_flow_count'] - small_df['in_flow_linear']
small_df['out_flow_ols'] = small_df['out_flow_count'] - small_df['out_flow_linear']

# Plot line chart
small_df.plot(x='time', y=['in_flow_ols', 'out_flow_ols'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_ols']], [small_df['out_flow_ols']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))

The euclidean distance between in-flow and out-flow of this station is: 178.27208810218377

f. Calculate the distance function between $Smooth(A)$ and $Smooth(B)$ ,and draw them both.

You can choose any smoothing function,just specify it, or simply use

\[A_i = \frac{A_{i−1}+A_i+A_{i+1}}{3}\]

(or take the average of 5,7,9… elements)

Here I use the function rolling in pandas to take the average mean.

window = 9, as I take the average of 9 elements.
min_periods = 1 to avoid NA values at the head/tail of the time series.
center = True to make it match correctly to the time.

# Calculate the moving averages
small_df['in_flow_smooth'] = small_df['in_flow_count'].rolling(window=9, min_periods=1, center=True).mean()
small_df['out_flow_smooth'] = small_df['out_flow_count'].rolling(window=9, min_periods=1, center=True).mean()

# Plot line chart
small_df.plot(x='time', y=['in_flow_smooth', 'out_flow_smooth'], kind='line', figsize=(15,15))
plt.show()

Now calculate the euclidean distance between the in-flow and the out-flow again.

dist = metrics.pairwise_distances([small_df['in_flow_smooth']], [small_df['out_flow_smooth']], metric='euclidean')
print("The euclidean distance between in-flow and out-flow of this station is: {}".format(dist[0][0]))

The euclidean distance between in-flow and out-flow of this station is: 63.427226939887035

Visualize the flows of citibikes over time

First, annotate each record as in Early-July, Mid-July, or Late-July.

def gen_time_group(dt):
    if dt.day <= 10:
        return "Early-July"
    elif dt.day <= 20:
        return "Mid-July"
    else:
        return "Late-July"

# Calculate and sort the average flow count for each station
flow = dat[['station id', 'time', 'flow_count']] 

# Create time group
flow['time_group'] = [gen_time_group(dt) for dt in flow['time']]

# Summarise flow count according to time group
flow = flow.groupby(["station id", "time_group"], as_index=False) \
            .agg({'flow_count': 'sum'})

# Add latitude/logitude columns
flow['latitude'] = [lat_dic[x] for x in flow['station id']]
flow['longitude'] = [lon_dic[x] for x in flow['station id']]

flow.head()

	station id	time_group	flow_count	latitude	longitude
0	72	Early-July	2369.0	40.767272	-73.993929
1	72	Late-July	2989.0	40.767272	-73.993929
2	72	Mid-July	2683.0	40.767272	-73.993929
3	79	Early-July	1580.0	40.719116	-74.006667
4	79	Late-July	2282.0	40.719116	-74.006667

Then plot every popular stations on map.

Note that the size of the station point is calculated as

\[Size = 2^{flow}\] \[flow = \frac{sum~of~flow~count~in~this~time~period}{1000}\]

def plot_stations_map(ax, stns, noText=False):
    # determine range to print based on min, max lat and lon of the data
    lat = list(stns['latitude'])
    lon = list(stns['longitude'])
    siz = [(2)**(x/1000) for x in stns['flow_count']]
    margin = 0.01 # buffer to add to the range
    lat_min = min(lat) - margin
    lat_max = max(lat) + margin
    lon_min = min(lon) - margin
    lon_max = max(lon) + margin

    # create map using BASEMAP
    m = Basemap(llcrnrlon=lon_min,
                llcrnrlat=lat_min,
                urcrnrlon=lon_max,
                urcrnrlat=lat_max,
                lat_0=(lat_max - lat_min)/2,
                lon_0=(lon_max - lon_min)/2,
                projection='lcc',
                resolution = 'f',)

    m.drawcoastlines()
    m.fillcontinents(lake_color='aqua')
    m.drawmapboundary(fill_color='aqua')
    m.drawrivers()

    # convert lat and lon to map projection coordinates
    lons, lats = m(lon, lat)

    # plot points as red dots
    if noText:
        ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.6, s=1)
        return
    else:
        ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.3, s=siz)
    
    # annotate popular stations
    for i in range(len(siz)):
        if siz[i] >= 2**6:
            plt.text(lons[i], lats[i], text[i])

pop_flow = flow[flow['flow_count'] > 2000]
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15,15))
ax = plt.subplot(1, 4, 1)
ax.set_title("All Stations")
plot_stations_map(ax, flow, noText=True)
ax = plt.subplot(1, 4, 2)
ax.set_title("Popular Stations in Early July")
plot_stations_map(ax, pop_flow[pop_flow['time_group'] == "Early-July"])
ax = plt.subplot(1, 4, 3)
ax.set_title("Popular Stations in Mid July")
plot_stations_map(ax, pop_flow[pop_flow['time_group'] == "Mid-July"])
ax = plt.subplot(1, 4, 4)
ax.set_title("Popular Stations in Late July")
plot_stations_map(ax, pop_flow[pop_flow['time_group'] == "Late-July"])

Here are my observations:

Few people ride citibikes in Jersey City and Brooklyn, that is, the main flow of citibike concentrates in Manhattan.
The poular stations in Brooklyn are nearly the same across the whole month.
Although the top popular regions (locations with large red circle) look similar, the top popular stations are not the same across the whole month. (Check station id carefully you’ll see they are not the same.)
For popular regions, more and more people ride citibikes in Mid July and Late July.

An Explorer of Things

Hello, my name is Chihling :)

Data Preprocessing and Exploring the New York City Bike Dataset

Dataset

Schema

Preprocess

Missing Values & Anomaly Detection

Create Self-defined Features

Query

How many stations are there in this dataset,and what is the average distance between them?

What are the top 3 frequent stations pairs `(start station, end station)` in weekdays, how about in weekends?

Find the top 3 stations with highest average `out-flow`, and top 3 highest average `in-flow`

What is the most popular station(highest average inflow+outflow)?

a. Draw the in-flow($A$) and out-flow($B$) for that station in a line chart

b. Calculate the distance function between $A$ and $B$

c. Calculate the distance function between $A-mean(A)$ and $B-mean(B)$ , and draw them both.

d. Calculate the distance function between $(A-mean(A))/std(A)$ and $(B-mean(B))/std(B)$ , and draw them both.

e. Calculate the distance function between ${A_i−f(i)|A_i \in A}$ and ${B_i −f(i)|B_i \in B}$,and draw them both.

f. Calculate the distance function between $Smooth(A)$ and $Smooth(B)$ ,and draw them both.

Visualize the flows of citibikes over time

Dataset

Schema

Preprocess

Missing Values & Anomaly Detection

Create Self-defined Features

Query

How many stations are there in this dataset,and what is the average distance between them?

What are the top 3 frequent stations pairs (start station, end station) in weekdays, how about in weekends?

Find the top 3 stations with highest average out-flow, and top 3 highest average in-flow

What is the most popular station(highest average inflow+outflow)?

a. Draw the in-flow($A$) and out-flow($B$) for that station in a line chart

b. Calculate the distance function between $A$ and $B$

c. Calculate the distance function between $A-mean(A)$ and $B-mean(B)$ , and draw them both.

d. Calculate the distance function between $(A-mean(A))/std(A)$ and $(B-mean(B))/std(B)$ , and draw them both.

e. Calculate the distance function between ${A_i−f(i)|A_i \in A}$ and ${B_i −f(i)|B_i \in B}$,and draw them both.

f. Calculate the distance function between $Smooth(A)$ and $Smooth(B)$ ,and draw them both.

Visualize the flows of citibikes over time

What are the top 3 frequent stations pairs `(start station, end station)` in weekdays, how about in weekends?

Find the top 3 stations with highest average `out-flow`, and top 3 highest average `in-flow`