Analyze the NYC Taxi Data
In this document, I will walk through the analysis of New York City Taxi Data (with download link shown in Section II) using Python
. 6 months of “Yellow” label data will be loaded and analyzed.
I. Data Analytic Tool/Package Used
The following libraries are the basic libraries for data analytics.
import pandas as pd
import numpy as np
import urllib.request
import zipfile
import random
import itertools
import math
On the other hand, to visualize the information extracted from data, the libraries in below are also needed. For example, the Python Shapefile Library (pyshp) provides read and write support for the ESRI Shapefile format. And Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
import shapefile
from shapely.geometry import Polygon
from descartes.patch import PolygonPatch
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
Furthermore, to deal with the large scale of data (4GB
for 6 months in this case), a database is needed. Here I will use SQLAlchemy
, which is a Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.
from sqlalchemy import create_engine
nyc_database = create_engine('sqlite:///nyc_database.db')
II. Dataset
The information and download links of NYC Taxi Data can be found in the link below:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Here we use Python
code to download “Yellow” label data from 2017 January to 2017 June.
# Download the Trip Record Data
for month in range(1,7):
urllib.request.urlretrieve("https://s3.amazonaws.com/nyc-tlc/trip+data/"+ \
"yellow_tripdata_2017-{0:0=2d}.csv".format(month),
"nyc.2017-{0:0=2d}.csv".format(month))
# Download the location Data
urllib.request.urlretrieve("https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip", "taxi_zones.zip")
with zipfile.ZipFile("taxi_zones.zip","r") as zip_ref:
zip_ref.extractall("./shape")
A. Trip Record Data
The yellow taxi trip records include fields capturing
- pick-up and drop-off dates/times
- pick-up and drop-off locations
- trip distances
- itemized fares
- rate types
- payment types
- driver-reported passenger counts
and the explanations are in this dictionary.
df = pd.read_csv("nyc.2017-01.csv", nrows=10)
df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2017-01-09 11:13:28 | 2017-01-09 11:25:45 | 1 | 3.30 | 1 | N | 263 | 161 | 1 | 12.5 | 0.0 | 0.5 | 2.00 | 0 | 0.3 | 15.30 |
1 | 1 | 2017-01-09 11:32:27 | 2017-01-09 11:36:01 | 1 | 0.90 | 1 | N | 186 | 234 | 1 | 5.0 | 0.0 | 0.5 | 1.45 | 0 | 0.3 | 7.25 |
2 | 1 | 2017-01-09 11:38:20 | 2017-01-09 11:42:05 | 1 | 1.10 | 1 | N | 164 | 161 | 1 | 5.5 | 0.0 | 0.5 | 1.00 | 0 | 0.3 | 7.30 |
3 | 1 | 2017-01-09 11:52:13 | 2017-01-09 11:57:36 | 1 | 1.10 | 1 | N | 236 | 75 | 1 | 6.0 | 0.0 | 0.5 | 1.70 | 0 | 0.3 | 8.50 |
4 | 2 | 2017-01-01 00:00:00 | 2017-01-01 00:00:00 | 1 | 0.02 | 2 | N | 249 | 234 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0 | 0.3 | 52.80 |
Now we load the trip record data into our database chunk by chunk.
j, chunksize = 1, 100000
for month in range(1,7):
fp = "nyc.2017-{0:0=2d}.csv".format(month)
for df in pd.read_csv(fp, chunksize=chunksize, iterator=True):
df = df.rename(columns={c: c.replace(' ', '_') for c in df.columns})
df['pickup_hour'] = [x[11:13] for x in df['tpep_pickup_datetime']]
df['dropoff_hour'] = [x[11:13] for x in df['tpep_dropoff_datetime']]
df.index += j
df.to_sql('table_record', nyc_database, if_exists='append')
j = df.index[-1] + 1
del df
Location Data
Location data including TLC taxi zone location IDs, location names and corresponding boroughs for each ID can be found here. A shapefile containing the boundaries for the taxi zones can be found here.
Note that if you need the exact latitude/longitude of each region, this website will do you the favor.
sf = shapefile.Reader("shape/taxi_zones.shp")
fields_name = [field[0] for field in sf.fields[1:]]
shp_dic = dict(zip(fields_name, list(range(len(fields_name)))))
attributes = sf.records()
shp_attr = [dict(zip(fields_name, attr)) for attr in attributes]
df_loc = pd.DataFrame(shp_attr).join(get_lat_lon(sf).set_index("LocationID"), on="LocationID")
df_loc.head()
LocationID | OBJECTID | Shape_Area | Shape_Leng | borough | zone | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0.000782 | 0.116357 | EWR | Newark Airport | -74.171533 | 40.689483 |
1 | 2 | 2 | 0.004866 | 0.433470 | Queens | Jamaica Bay | -73.822478 | 40.610824 |
2 | 3 | 3 | 0.000314 | 0.084341 | Bronx | Allerton/Pelham Gardens | -73.844953 | 40.865747 |
3 | 4 | 4 | 0.000112 | 0.043567 | Manhattan | Alphabet City | -73.977725 | 40.724137 |
4 | 5 | 5 | 0.000498 | 0.092146 | Staten Island | Arden Heights | -74.187558 | 40.550664 |
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,8))
ax = plt.subplot(1, 2, 1)
ax.set_title("Boroughs in NYC")
draw_region_map(ax, sf)
ax = plt.subplot(1, 2, 2)
ax.set_title("Zones in NYC")
draw_zone_map(ax, sf)
III. Analytic Problems
In the following 3 Analytic Problems, I will show my analytic results and graph visualizations along with the Python
code that generates the plots.
Q1: Which regions have most pickups and drop-offs?
In the beginning, we select the data we need (location of pickup/dropoff ans their counts) from the databse.
df_pu = pd.read_sql_query('SELECT PULocationID AS LocationID, count(*) AS PUcount \
FROM table_record \
GROUP BY PULocationID', nyc_database)
df_do = pd.read_sql_query('SELECT DOLocationID AS LocationID, count(*) AS DOcount \
FROM table_record \
GROUP BY DOLocationID', nyc_database)
With the selected data, we want to obtain the zones with most pickups and drop-offs.
template = pd.DataFrame([x for x in range(1,max(df_loc['LocationID'].tolist()))], columns=["LocationID"])
df_q1 = pd.concat([df_pu, df_do]).join(template.set_index("LocationID"), how = 'outer', on=["LocationID"]).fillna(0) \
.groupby(["LocationID"], as_index=False) \
.agg({'PUcount': 'sum', 'DOcount': 'sum'})\
.sort_values(by=['LocationID'])
df_q1['TOTALcount'] = df_q1['PUcount'] + df_q1['DOcount']
loc = df_loc[["LocationID", "zone", "borough"]]
df_q1 = df_q1.merge(loc, left_on="LocationID", right_on="LocationID")
PUcount = dict(zip(df_q1['LocationID'].tolist(), df_q1['PUcount'].tolist()))
PUtop3 = df_q1.sort_values(by=['PUcount'], ascending=False).set_index("LocationID").head(3)
DOcount = dict(zip(df_q1['LocationID'].tolist(), df_q1['DOcount'].tolist()))
DOtop3 = df_q1.sort_values(by=['DOcount'], ascending=False).set_index("LocationID").head(3)
In the tables below, we can see that in the first half of 2017,
- the top 3 pickup zones are
Upper East Side South
,Midtown Center
, andUpper East Side North
, which are all inManhattan
. - the top 3 drop-off zones are also
Upper East Side South
,Midtown Center
, andUpper East Side North
.
PUtop3
PUcount | DOcount | TOTALcount | zone | borough | |
---|---|---|---|---|---|
LocationID | |||||
237 | 1915990.0 | 1692273.0 | 3608263.0 | Upper East Side South | Manhattan |
161 | 1822848.0 | 1781749.0 | 3604597.0 | Midtown Center | Manhattan |
236 | 1774245.0 | 1828630.0 | 3602875.0 | Upper East Side North | Manhattan |
DOtop3
PUcount | DOcount | TOTALcount | zone | borough | |
---|---|---|---|---|---|
LocationID | |||||
236 | 1774245.0 | 1828630.0 | 3602875.0 | Upper East Side North | Manhattan |
161 | 1822848.0 | 1781749.0 | 3604597.0 | Midtown Center | Manhattan |
237 | 1915990.0 | 1692273.0 | 3608263.0 | Upper East Side South | Manhattan |
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18,8))
ax = plt.subplot(1, 2, 1)
ax.set_title("Zones with most pickups")
draw_zone_map(ax, sf, heat=PUcount, text=PUtop3.index.tolist())
ax = plt.subplot(1, 2, 2)
ax.set_title("Zones with most drop-offs")
draw_zone_map(ax, sf, heat=DOcount, text=DOtop3.index.tolist())
Note that in the figure above, we can see that despite the top 3 pickup/dropoff zones, many other zones in Manhattan
are also popular. By the way, the zone in the lower-right part of the map, which is JFK Airport
, is the most popular pickup/dropoff zone in New York City excluding zones in Manhattan
.
Next, we investigate boroughs with most pickups and drop-offs. In the tables below, we can see that Manhattan
is obviously the most popular borough and Staten Island
is the least popular borough. Queens
and Brooklyn
are also popular, although their pickup/droppoff count is less than 10% of Manhattan
’s.
df_q1_region = df_q1.groupby(["borough"], as_index=False) \
.agg({'PUcount': 'sum', 'DOcount': 'sum', 'TOTALcount': 'sum'}) \
.sort_values(by=['TOTALcount'], ascending=False)
df_q1_region
borough | PUcount | DOcount | TOTALcount | |
---|---|---|---|---|
3 | Manhattan | 44852180.0 | 43309852.0 | 88162032.0 |
4 | Queens | 2893988.0 | 2473817.0 | 5367805.0 |
1 | Brooklyn | 725594.0 | 2335785.0 | 3061379.0 |
0 | Bronx | 42170.0 | 305066.0 | 347236.0 |
2 | EWR | 3457.0 | 91694.0 | 95151.0 |
5 | Staten Island | 1561.0 | 11871.0 | 13432.0 |
PUcount = dict(zip(df_q1_region['borough'].tolist(), df_q1_region['PUcount'].tolist()))
DOcount = dict(zip(df_q1_region['borough'].tolist(), df_q1_region['DOcount'].tolist()))
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,8))
ax = plt.subplot(1, 2, 1)
ax.set_title("Boroughs with most pickups")
draw_region_map(ax, sf, heat = PUcount)
ax = plt.subplot(1, 2, 2)
ax.set_title("Boroughs with most drop-offs")
draw_region_map(ax, sf, heat = DOcount)
In the figure above, it is noticed that in the first half of 2017, there are more pickups in Queens
than in Brooklyn
while there are similar number of drop-offs in both Queens
and Brooklyn
.
Q2: When are the peak hours and off-peak hours for taking taxi?
Again we start from selecting data needed from the database.
df_pu = pd.read_sql_query('SELECT pickup_hour AS time, count(*) AS PUcount \
FROM table_record \
GROUP BY pickup_hour', nyc_database)
df_do = pd.read_sql_query('SELECT dropoff_hour AS time, count(*) AS DOcount \
FROM table_record \
GROUP BY dropoff_hour', nyc_database)
df_q2 = df_pu.merge(df_do, on="time")
From the selected data, we arrange and show the visualization of the information we need.
template = pd.DataFrame(["{0:0=2d}".format(x) for x in range(0,24)], columns=["time"])
df_q2 = df_q2.join(template.set_index("time"), how = 'outer', on=["time"]).fillna(0) \
.groupby(["time"], as_index=False) \
.agg({'PUcount': 'sum', 'DOcount': 'sum'}) \
.rename(columns = {'PUcount':'Pick-ups', 'DOcount': 'Drop-offs'}) \
.sort_values(by='time')
ax = df_q2.plot(x='time', y=['Pick-ups', 'Drop-offs'], kind='line', style="-o", figsize=(15,5))
ax.set_ylabel("count")
plt.show()
As you can see, according to the NYC Taxi records from 2017 January to 2017 June, it is found that
- The peak hours are around
6PM ~ 7PM
. - The off-peak hours are around
5AM
.
Q3: What are the differences between short and long distance trips of taking taxi?
To answer this question, we should define what short and long distance trips are at first.
1. Define short and long distance
To get a closer look at the distribution of trip distance, we select the trip_distance
column values and print out its summary statistics.
df_dist = pd.read_sql_query('SELECT trip_distance FROM table_record', nyc_database)
df_dist['trip_distance'].describe()
count 4.932189e+07
mean 2.891543e+00
std 3.701831e+00
min 0.000000e+00
25% 9.800000e-01
50% 1.600000e+00
75% 3.000000e+00
max 1.514910e+03
Name: trip_distance, dtype: float64
The distrubution of trip_distance
is extremely right skewed, which is shown in the figure below.
ax = df_dist['trip_distance'].hist(bins=30, figsize=(15,5))
ax.set_yscale('log')
ax.set_xlabel("trip distance (miles)")
ax.set_ylabel("count")
plt.show()
According to the dustribution of trip distances and the fact that it takes about 30 miles to drive across the whole New York City, we decided to use 30 as the number to split the trips into short or long distance trips.
df_q3_short = pd.read_sql_query('SELECT count(*) AS count FROM table_record \
WHERE trip_distance < 30', nyc_database)
df_q3_long = pd.read_sql_query('SELECT count(*) AS count FROM table_record \
WHERE trip_distance >= 30 ', nyc_database)
print("Short Trips: {} records in total.\nLong Trips: {} records in total."\
.format(df_q3_short.values[0][0], df_q3_long.values[0][0]))
Short Trips: 49303249 records in total.
Long Trips: 18640 records in total.
2. Observe The Difference in Temporal Attributes
Instinctly, we think that the pickup/dropoff time may be different for short trips and long trips since the purpose for short trips is not the same as that of long trips.
To validate our assumption, we first select temporal information from our database.
df_q3_short = pd.read_sql_query('SELECT pickup_hour AS PUtime, \
dropoff_hour AS DOtime, count(*) AS count \
FROM table_record \
WHERE trip_distance < 30 \
GROUP BY pickup_hour, dropoff_hour', nyc_database)
df_q3_long = pd.read_sql_query('SELECT pickup_hour AS PUtime, \
dropoff_hour AS DOtime, count(*) AS count \
FROM table_record \
WHERE trip_distance >= 30 \
GROUP BY pickup_hour, dropoff_hour', nyc_database)
Afterwards, we visualize the pickup/dropoff counts at each time for both short trips and long trips.
df_q3 = df_q3_short.merge(df_q3_long, on=["PUtime", "DOtime"], suffixes=["_short", "_long"]) \
.rename(columns={"count_short":"short trips", "count_long":"long trips", \
"PUtime":"pickup time", "DOtime":"dropoff time"})
df_q3_PU = df_q3.groupby(["pickup time"], as_index=False) \
.agg({'short trips': 'sum', 'long trips':'sum'}) \
.sort_values(by="pickup time")
df_q3_DO = df_q3.groupby(["dropoff time"], as_index=False) \
.agg({'short trips': 'sum', 'long trips':'sum'}) \
.sort_values(by="dropoff time")
diff_short_long_trip_on_time()
Based on the figure shown above, we can observe that
- For short trips,
- the peak hours of pickups are from
6PM
to10PM
. - the peak hours of dropoffs are also from
6PM
to10PM
.
- the peak hours of pickups are from
- For long trips ( > 30 miles),
- the peak hours of pickups are from
1PM
to4PM
. - the peak hours of dropoffs are from
3PM
to midnight.
- the peak hours of pickups are from
- The off-peak hours are similar for both short trips and long trips.
With these observations, it can be guesed that
- Short trips are mainly contributed by people having dinner and hanging out at night.
- Long trips ( > 30 miles) are contributed by travelers taking an arrival or a departure.
3. Observe The Difference in Spatial Attributes
So let’s examine if our guesses are correct by summarize the spatial attributes of short trips and long trips.
Here we extract pickup/dropoff locations and their counts from database by grouping the counts by each ('pickup zone', 'dropoff zone')
pair.
df_q3_short = pd.read_sql_query('SELECT PULocationID, DOLocationID, count(*) AS count \
FROM table_record \
WHERE trip_distance < 30 \
GROUP BY PULocationID, DOLocationID', nyc_database)
df_q3_long = pd.read_sql_query('SELECT PULocationID, DOLocationID, count(*) AS count \
FROM table_record \
WHERE trip_distance >= 30 \
GROUP BY PULocationID, DOLocationID', nyc_database)
After extracting data from database, we then arrange the information and show the top 3 ('pickup zone', 'dropoff zone')
pair for both short trips and long trips.
df_q3 = df_q3_short.merge(df_q3_long, on=["PULocationID", "DOLocationID"], suffixes=["_short", "_long"]) \
.rename(columns={"count_short":"short trips", "count_long":"long trips"})
df_q3 = df_q3.merge(df_loc[["LocationID", "zone"]], left_on="PULocationID", right_on="LocationID") \
.drop(['LocationID'], axis=1).rename(columns={"zone":"pickup zone"}) \
.merge(df_loc[["LocationID", "zone"]], left_on="DOLocationID", right_on="LocationID") \
.drop(['LocationID'], axis=1).rename(columns={"zone":"dropoff zone"})
In the tables below, we can see that
- The top 3 routes for short trips are all in the most prosperous area in
Manhattan
, which are filled with restaurants and entertainment venues. - The top 3 routes for long trips ( > 30 miles) are routes that drive from one airport zone to another.
- The second popular route is a special case that passengers are picked up from and also dropped-off at
JFK Airport
, which may be some kind of city tours of the tranfer during their waiting time between the connecting flights.
- The second popular route is a special case that passengers are picked up from and also dropped-off at
These findings support our guesses that long trips are for traveling and that short trips are for eating and entertaining.
ShortTrip_top3 = df_q3.sort_values(by="short trips", ascending=False).head(3)
ShortTrip_top3[['pickup zone', 'dropoff zone', 'short trips']]
pickup zone | dropoff zone | short trips | |
---|---|---|---|
563 | Upper East Side North | Upper East Side North | 238878 |
632 | Upper East Side South | Upper East Side South | 209915 |
1025 | Upper West Side South | Upper West Side North | 137164 |
LongTrip_top3 = df_q3.sort_values(by="long trips", ascending=False).head(3)
LongTrip_top3[['pickup zone', 'dropoff zone', 'long trips']]
pickup zone | dropoff zone | long trips | |
---|---|---|---|
29 | JFK Airport | Newark Airport | 1097 |
101 | JFK Airport | JFK Airport | 586 |
31 | LaGuardia Airport | Newark Airport | 463 |
On the other hand, we can also observe the popular zones for short and long trips on map. By aggregating the pickup/dropoff trip count of each zone, we then show the popular pickup/drop-off zones for short trips and long trips.
df_q3_PU = df_q3.groupby("PULocationID", as_index=False).agg({'short trips':'sum', 'long trips':'sum'})
PUtop3_short = df_q3_PU.sort_values(by=['short trips'], ascending=False).set_index("PULocationID").head(3)
PUtop3_long = df_q3_PU.sort_values(by=['long trips'], ascending=False).set_index("PULocationID").head(3)
PUcount_short = dict(zip(df_q3_PU['PULocationID'].tolist(), df_q3_PU['short trips'].tolist()))
PUcount_long = dict(zip(df_q3_PU['PULocationID'].tolist(), df_q3_PU['long trips'].tolist()))
df_q3_DO = df_q3.groupby("DOLocationID", as_index=False).agg({'short trips':'sum', 'long trips':'sum'})
DOtop3_short = df_q3_DO.sort_values(by=['short trips'], ascending=False).set_index("DOLocationID").head(3)
DOtop3_long = df_q3_DO.sort_values(by=['long trips'], ascending=False).set_index("DOLocationID").head(3)
DOcount_short = dict(zip(df_q3_DO['DOLocationID'].tolist(), df_q3_DO['short trips'].tolist()))
DOcount_long = dict(zip(df_q3_DO['DOLocationID'].tolist(), df_q3_DO['long trips'].tolist()))
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(18,18))
ax = plt.subplot(2, 2, 1)
ax.set_title("Zones with most pickups for Short Trips")
draw_zone_map(ax, sf, heat=PUcount_short, text=PUtop3_short.index.tolist())
ax = plt.subplot(2, 2, 2)
ax.set_title("Zones with most pickups for Long Trips")
draw_zone_map(ax, sf, heat=PUcount_long, text=PUtop3_long.index.tolist())
ax = plt.subplot(2, 2, 3)
ax.set_title("Zones with most drop-offs for Short Trips")
draw_zone_map(ax, sf, heat=DOcount_short, text=DOtop3_short.index.tolist())
ax = plt.subplot(2, 2, 4)
ax.set_title("Zones with most drop-offs for Long Trips")
draw_zone_map(ax, sf, heat=DOcount_long, text=DOtop3_long.index.tolist())
From the figure above, we can find something interesting:
- Surprisingly,
JFK Airport
is the most popular pickup zone for short trips.- More specifically, in 6 months there are about 1 million taxis taken from
JFK Airport
(about 5700 trips per day) and drive no more than 30 miles to arrive their destination. - These passengers may contain foreign travelers or locals who just came back from abroad.
- More specifically, in 6 months there are about 1 million taxis taken from
- There are more drop-offs at
JFK Airport
thanNewark Airport
in short trips. - There are more drop-offs at
Newark Airport
thanJFK Airport
in long trips. - For long trips,
JFK Airport
andLaGuardia Airport
are the most popular pickup zones.
4. Other Differences
Lastly, we compare short trips and long trips in some other aspects.
for attr in ["passenger_count", "RatecodeID", "payment_type"]:
diff_short_long_trip_on(attr, rpr="proportion", kind='bar')
Unexpectedly, the distribution of passenger count is nearly the same for short trips and long trips.
RateCodeID
represents the final rate code in effect at the end of the trip:
- 1=Standard rate
- 2=JFK
- 3=Newark
- 4=Nassau or Westchester
- 5=Negotiated fare
- 6=Group ride
It can be seen that 40
percent of long trips use Negotiated fare
and another 40
percent of long trips use JFK
, Newark
, or Nassau or Westchester
while less than 5
percent of short trips use any of them.
payment_type
is a numeric code signifying how the passenger paid for the trip:
- 1=Credit card
- 2=Cash
- 3=No charge
- 4=Dispute
Passengers of long trips paid a little more frequent in credit card and a little less frequent in cash comparing to that of short trips.
IV. Appendix
Here are some self-defined functions that are frequently used in the above sections.
def diff_short_long_trip_on_time():
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(18,18))
ax = plt.subplot(2,2,1, polar=True)
# make the histogram that bined on 24 hour
radii = np.array(df_q3_PU['short trips'].tolist(), dtype="int64")
title = "Pickup Time for Short Trips"
plt_clock(ax, radii, title, "#dc143c")
ax = plt.subplot(2,2,2, polar=True)
# make the histogram that bined on 24 hour
radii = np.array(df_q3_PU['long trips'].tolist(), dtype="int64")
title = "Pickup Time for Long Trips"
plt_clock(ax, radii, title, "#56B4E9")
ax = plt.subplot(2,2,3, polar=True)
# make the histogram that bined on 24 hour
radii = np.array(df_q3_DO['short trips'].tolist(), dtype="int64")
title = "Dropoff Time for Short Trips"
plt_clock(ax, radii, title, "#dc143c")
ax = plt.subplot(2,2,4, polar=True)
# make the histogram that bined on 24 hour
radii = np.array(df_q3_DO['long trips'].tolist(), dtype="int64")
title = "Dropoff Time for Long Trips"
plt_clock(ax, radii, title, "#56B4E9")
def plt_clock(ax, radii, title, color):
N = 24
bottom = 2
# create theta for 24 hours
theta = np.linspace(0.0, 2 * np.pi, N, endpoint=False)
# width of each bin on the plot
width = (2*np.pi) / N
bars = ax.bar(theta, radii, width=width, bottom=bottom, color=color, edgecolor="#999999")
# set the lable go clockwise and start from the top
ax.set_theta_zero_location("N")
# clockwise
ax.set_theta_direction(-1)
# set the label
ax.set_xticks(theta)
ticks = ["{}:00".format(x) for x in range(24)]
ax.set_xticklabels(ticks)
ax.set_title(title)
def diff_short_long_trip_on(attr, rpr="count", kind='bar'):
df_q3_short = pd.read_sql_query('SELECT '+attr+', count(*) as count \
FROM table_record \
WHERE trip_distance < 30 \
GROUP BY '+attr, nyc_database)
df_q3_long = pd.read_sql_query('SELECT '+attr+', avg(trip_distance) AS AVG_trip_distance, count(*) as count \
FROM table_record \
WHERE trip_distance >= 30 \
GROUP BY '+attr, nyc_database)
if rpr == "proportion":
s = np.sum(df_q3_short['count'].values)
df_q3_short['proportion'] = [float(x)/s for x in df_q3_short['count']]
s = np.sum(df_q3_long['count'].values)
df_q3_long['proportion'] = [float(x)/s for x in df_q3_long['count']]
df_q3 = df_q3_short.merge(df_q3_long, on=attr, suffixes=["_short", "_long"]) \
.rename(columns={rpr+"_short":"short trips", rpr+"_long":"long trips"})
ax = df_q3.plot(x=attr, y=['short trips', 'long trips'], kind=kind, figsize=(15,5))
ax.set_ylabel(rpr)
ax.set_title(attr.replace("_", " ")+" difference in short/long trip")
def get_lat_lon(sf):
content = []
for sr in sf.shapeRecords():
shape = sr.shape
rec = sr.record
loc_id = rec[shp_dic['LocationID']]
x = (shape.bbox[0]+shape.bbox[2])/2
y = (shape.bbox[1]+shape.bbox[3])/2
content.append((loc_id, x, y))
return pd.DataFrame(content, columns=["LocationID", "longitude", "latitude"])
def get_boundaries(sf):
lat, lon = [], []
for shape in list(sf.iterShapes()):
lat.extend([shape.bbox[0], shape.bbox[2]])
lon.extend([shape.bbox[1], shape.bbox[3]])
margin = 0.01 # buffer to add to the range
lat_min = min(lat) - margin
lat_max = max(lat) + margin
lon_min = min(lon) - margin
lon_max = max(lon) + margin
return lat_min, lat_max, lon_min, lon_max
def draw_region_map(ax, sf, heat={}):
continent = [235/256, 151/256, 78/256]
ocean = (89/256, 171/256, 227/256)
reg_list={'Staten Island':1, 'Queens':2, 'Bronx':3, 'Manhattan':4, 'EWR':5, 'Brooklyn':6}
reg_x = {'Staten Island':[], 'Queens':[], 'Bronx':[], 'Manhattan':[], 'EWR':[], 'Brooklyn':[]}
reg_y = {'Staten Island':[], 'Queens':[], 'Bronx':[], 'Manhattan':[], 'EWR':[], 'Brooklyn':[]}
# colorbar
if len(heat) != 0:
norm = mpl.colors.Normalize(vmin=math.sqrt(min(heat.values())), vmax=math.sqrt(max(heat.values()))) #norm = mpl.colors.LogNorm(vmin=1,vmax=max(heat))
cm=plt.get_cmap('Reds')
#sm = plt.cm.ScalarMappable(cmap=cm, norm=norm)
#sm.set_array([])
#plt.colorbar(sm, ticks=np.linspace(min(heat.values()),max(heat.values()),8), \
# boundaries=np.arange(min(heat.values())-10,max(heat.values())+10,.1))
ax.set_facecolor(ocean)
for sr in sf.shapeRecords():
shape = sr.shape
rec = sr.record
reg_name = rec[shp_dic['borough']]
if len(heat) == 0:
norm = mpl.colors.Normalize(vmin=1,vmax=6) #norm = mpl.colors.LogNorm(vmin=1,vmax=max(heat))
cm=plt.get_cmap('Pastel1')
R,G,B,A = cm(norm(reg_list[reg_name]))
col = [R,G,B]
else:
R,G,B,A = cm(norm(math.sqrt(heat[reg_name])))
col = [R,G,B]
# check number of parts (could use MultiPolygon class of shapely?)
nparts = len(shape.parts) # total parts
if nparts == 1:
polygon = Polygon(shape.points)
patch = PolygonPatch(polygon, facecolor=col, alpha=1.0, zorder=2)
ax.add_patch(patch)
else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0 = shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = len(shape.points)
polygon = Polygon(shape.points[i0:i1+1])
patch = PolygonPatch(polygon, facecolor=col, alpha=1.0, zorder=2)
ax.add_patch(patch)
reg_x[reg_name].append((shape.bbox[0]+shape.bbox[2])/2)
reg_y[reg_name].append((shape.bbox[1]+shape.bbox[3])/2)
for k in reg_list:
if len(heat)==0:
plt.text(np.mean(reg_x[k]), np.mean(reg_y[k]), k, horizontalalignment='center', verticalalignment='center',
bbox=dict(facecolor='black', alpha=0.5), color="white", fontsize=12)
else:
plt.text(np.mean(reg_x[k]), np.mean(reg_y[k]), "{}\n({}K)".format(k, heat[k]/1000), horizontalalignment='center',
verticalalignment='center',bbox=dict(facecolor='black', alpha=0.5), color="white", fontsize=12)
# display
limits = get_boundaries(sf)
plt.xlim(limits[0], limits[1])
plt.ylim(limits[2], limits[3])
def draw_zone_map(ax, sf, heat={}, text=[], arrows=[]):
continent = [235/256, 151/256, 78/256]
ocean = (89/256, 171/256, 227/256)
theta = np.linspace(0, 2*np.pi, len(text)+1).tolist()
ax.set_facecolor(ocean)
# colorbar
if len(heat) != 0:
norm = mpl.colors.Normalize(vmin=min(heat.values()),vmax=max(heat.values())) #norm = mpl.colors.LogNorm(vmin=1,vmax=max(heat))
cm=plt.get_cmap('Reds')
sm = plt.cm.ScalarMappable(cmap=cm, norm=norm)
sm.set_array([])
plt.colorbar(sm, ticks=np.linspace(min(heat.values()),max(heat.values()),8),
boundaries=np.arange(min(heat.values())-10,max(heat.values())+10,.1))
for sr in sf.shapeRecords():
shape = sr.shape
rec = sr.record
loc_id = rec[shp_dic['LocationID']]
zone = rec[shp_dic['zone']]
if len(heat) == 0:
col = continent
else:
if loc_id not in heat:
R,G,B,A = cm(norm(0))
else:
R,G,B,A = cm(norm(heat[loc_id]))
col = [R,G,B]
# check number of parts (could use MultiPolygon class of shapely?)
nparts = len(shape.parts) # total parts
if nparts == 1:
polygon = Polygon(shape.points)
patch = PolygonPatch(polygon, facecolor=col, alpha=1.0, zorder=2)
ax.add_patch(patch)
else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0 = shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = len(shape.points)
polygon = Polygon(shape.points[i0:i1+1])
patch = PolygonPatch(polygon, facecolor=col, alpha=1.0, zorder=2)
ax.add_patch(patch)
x = (shape.bbox[0]+shape.bbox[2])/2
y = (shape.bbox[1]+shape.bbox[3])/2
if (len(text) == 0 and rec[shp_dic['Shape_Area']] > 0.0001):
plt.text(x, y, str(loc_id), horizontalalignment='center', verticalalignment='center')
elif len(text) != 0 and loc_id in text:
#plt.text(x+0.01, y-0.01, str(loc_id), fontsize=12, color="white", bbox=dict(facecolor='black', alpha=0.5))
eta_x = 0.05*np.cos(theta[text.index(loc_id)])
eta_y = 0.05*np.sin(theta[text.index(loc_id)])
ax.annotate("[{}] {}".format(loc_id, zone), xy=(x, y), xytext=(x+eta_x, y+eta_y),
bbox=dict(facecolor='black', alpha=0.5), color="white", fontsize=12,
arrowprops=dict(facecolor='black', width=3, shrink=0.05))
if len(arrows)!=0:
for arr in arrows:
ax.annotate('', xy = arr['dest'], xytext = arr['src'], size = arr['cnt'],
arrowprops=dict(arrowstyle="fancy", fc="0.6", ec="none"))
# display
limits = get_boundaries(sf)
plt.xlim(limits[0], limits[1])
plt.ylim(limits[2], limits[3])