Mining Association Rules on New York City Bike Dataset
What we want to do here is to design 3 mining tasks with their definitions of transactions and find some rules behind them.
For each task, we should
- Try at least two discretization methods (divided by 10, divided by 20, …)
- Try at least two algorithms (Apriori, FP-growth, …) to find association rules.
- List the interesting rules.
- Compare the differences between them.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import copy
import json
import math
from collections import OrderedDict
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Dataset
New York Citi Bike Trip Histories
We’ll use 201707-citibike-tripdata.csv.zip
in this homework only.
A. Schema
We have already preprocessed this dataset into the following 2 data frames:
1. Every Station’s Information
Station ID
Station Name
Station Latitude
Station Longitude
df_loc = pd.read_csv("data/station_information.csv")
df_loc.head()
station id | station name | station latitude | station longitude | |
---|---|---|---|---|
0 | 539 | Metropolitan Ave & Bedford Ave | 40.715348 | -73.960241 |
1 | 293 | Lafayette St & E 8 St | 40.730207 | -73.991026 |
2 | 3242 | Schermerhorn St & Court St | 40.691029 | -73.991834 |
3 | 2002 | Wythe Ave & Metropolitan Ave | 40.716887 | -73.963198 |
4 | 361 | Allen St & Hester St | 40.716059 | -73.991908 |
2. Every Station’s Flow Data
Station ID
Time
: One day is splitted into 48 segments in this case. (Every 30 minutes)In Flow Count
: The number of trips move to the stationOut Flow Count
: The number of trips move from the station
df_flow = pd.read_csv("data/station_flow.csv")
df_flow['time'] = pd.to_datetime(df_flow['time'], format='%Y-%m-%d %H:%M:%S')
df_flow.head()
station id | time | in_flow_count | out_flow_count | |
---|---|---|---|---|
0 | 72 | 2017-07-01 00:00:00 | 1.0 | 0.0 |
1 | 72 | 2017-07-01 10:00:00 | 1.0 | 0.0 |
2 | 72 | 2017-07-01 10:30:00 | 7.0 | 7.0 |
3 | 72 | 2017-07-01 11:00:00 | 1.0 | 1.0 |
4 | 72 | 2017-07-01 12:00:00 | 2.0 | 6.0 |
B. Algorithms
- Apriori
- Github repository: https://github.com/ymoch/apyori
- How to Use: Install with pip:
pip install apyori
The MIT License (MIT) Copyright (c) 2016 Yu Mochizuki Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
from apyori import apriori
def apriori_find_association_rules(dataset, minsup, minconf):
records = list(apriori(dataset, min_support=minsup, min_confidence=minconf))
return records
def apriori_show_mining_results(records):
ap = []
for record in records:
converted_record = record._replace(ordered_statistics=[x._asdict() for x in record.ordered_statistics])
ap.append(converted_record._asdict())
#print("Frequent Itemsets:\n------------------")
#for ptn in ap:
# print('({}) support = {}'.format(", ".join(ptn["items"]), round(ptn["support"], 3)))
#print()
print("Rules:\n------")
for ptn in ap:
for rule in ptn["ordered_statistics"]:
head = rule["items_base"]
tail = rule["items_add"]
if len(head) == 0 or len(tail) == 0:
continue
confidence = rule["confidence"]
print('({}) ==> ({}) confidence = {}'.format(', '.join(head), ', '.join(tail), round(confidence, 3)))
print()
- FP-Growth
- Github repository: https://github.com/evandempsey/fp-growth
- How to Use: Install with pip:
pip install pyfpgrowth
Copyright (c) 2016, Evan Dempsey All rights reserved. Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
import pyfpgrowth
def fp_find_association_rules(dataset, minsup, minconf):
patterns = pyfpgrowth.find_frequent_patterns(dataset, minsup*len(dataset))
rules = pyfpgrowth.generate_association_rules(patterns, minconf)
return (patterns, rules)
def fp_show_mining_results(ap, N):
(patterns, rules) = ap
#print("Frequent Itemsets:\n------------------")
#for key, val in patterns.items():
# print('{} support = {}'.format(key, round(val/N, 3)))
#print()
print("Rules:\n------")
for key, val in rules.items():
head = key
tail = val[0]
confidence = val[1]
if len(tail) == 0:
continue
print('({}) ==> ({}) confidence = {}'.format(', '.join(head), ', '.join(tail), round(confidence, 3)))
print()
Task 1: Find Rules between in-flow
and out-flow
of Station 519
The first task is that we want to find that if there is any association rules between in-flow counts and out-flow counts of station 519
.
For example,
\[\{High~inflow~count\} \rightarrow \{High~outflow~count\}\]A. Define Transactions
A transaction consists of
in-flow
forstation id = 519
out-flow
forstation id = 519
dat = df_flow[df_flow['station id'] == 519][['in_flow_count', 'out_flow_count']]
dat.head(5)
in_flow_count | out_flow_count | |
---|---|---|
244329 | 3.0 | 1.0 |
244330 | 1.0 | 1.0 |
244331 | 1.0 | 2.0 |
244332 | 4.0 | 2.0 |
244333 | 7.0 | 4.0 |
Let’s do some observations on the transaction dataset.
- Make sure there is no missing values.
- See the minimum/maximum of flow counts values
- See the distribution of flow counts values.
pd.isnull(dat).sum()
in_flow_count 0
out_flow_count 0
dtype: int64
print("Min: {}\nMax: {}".format(dat.values.min(), dat.values.max()))
Min: 0.0
Max: 116.0
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15,5))
ax = plt.subplot(2, 2, 1)
dat['in_flow_count'].hist(bins=25)
ax.set_title("in flow count")
ax = plt.subplot(2, 2, 2)
dat['out_flow_count'].hist(bins=25)
ax.set_title("out flow count")
ax = plt.subplot(2, 2, 3)
ax.set_yscale('log')
dat['in_flow_count'].hist(bins=25)
ax.set_title("in flow count (log scale)")
ax = plt.subplot(2, 2, 4)
ax.set_yscale('log')
dat['out_flow_count'].hist(bins=25)
ax.set_title("out flow count (log scale)")
fig.tight_layout()
It is found that the flow counts values highly concentrates in the interval 0~20
.
I’m going to use the pandas
functions cut
and qcut
to split intervals of in/out flow counts.
cut
- `bins`: int, sequence of scalars, or IntervalIndex
- If bins is an int, it defines the number of equal-width bins in the range of x.
- If bins is a sequence it defines the bin edges allowing for non-uniform bin width.
- `labels` : array or boolean, default None
- Used as labels for the resulting bins.
qcut
- `q`: integer or array of quantiles
- If bins is an int, it defines the number of quantiles. 10 for deciles, 4 for quartiles, etc.
- If bins is an array, it defines the array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
- `labels` : array or boolean, default None
- Used as labels for the resulting bins.
-
Discretization Method (1).
equi-sized
approachWith
equi-sized
approach, I partition the continuous domain into intervals with equal length. Here I also define their labels:"level-1" = size 0%~20%
"level-2" = size 20%~40%
"level-3" = size 40%~60%
"level-4" = size 60%~80%
"level-5" = size 80%~100%
dat_1 = copy.deepcopy(dat)
dat_1['in_flow_count'] = pd.cut(dat_1['in_flow_count'], bins = 5, \
labels = ["in.level-1", "in.level-2", "in.level-3", \
"in.level-4", "in.level-5"]).astype(str)
pd.cut(dat['in_flow_count'], bins = 5).value_counts()
(-0.091, 18.2] 1263
(18.2, 36.4] 141
(36.4, 54.6] 48
(54.6, 72.8] 34
(72.8, 91.0] 2
Name: in_flow_count, dtype: int64
Label | Interval | Depth (Number of Instances) |
---|---|---|
in.level-1 | (-0.091, 18.2] | 1263 |
in.level-2 | (18.2, 36.4] | 141 |
in.level-3 | (36.4, 54.6] | 48 |
in.level-4 | (54.6, 72.8] | 34 |
in.level-5 | (72.8, 91.0] | 2 |
dat_1['out_flow_count'] = pd.cut(dat_1['out_flow_count'], bins = 5, \
labels = ["out.level-1", "out.level-2", "out.level-3", \
"out.level-4", "out.level-5"]).astype(str)
pd.cut(dat['out_flow_count'], bins = 5).value_counts()
(-0.116, 23.2] 1317
(23.2, 46.4] 111
(46.4, 69.6] 40
(69.6, 92.8] 18
(92.8, 116.0] 2
Name: out_flow_count, dtype: int64
Label | Interval | Depth (Number of Instances) |
---|---|---|
out.level-1 | (-0.116, 23.2] | 1317 |
out.level-2 | (23.2, 46.4] | 111 |
out.level-3 | (46.4, 69.6] | 40 |
out.level-4 | (69.6, 92.8] | 18 |
out.level-5 | (92.8, 116.0] | 2 |
-
Discretization Method (2).
equi-depth
approachWith
equi-depth
approach, I partition the data values into intervals with equal size along the ordering of the data.Here I also define their labels:
"zero" = quantile 0~20
"extreme-low" = quantile 20~40
"low" = quantile 40~60
"medium" = quantile 60~80
"high" = quantile 80~100
dat_2 = copy.deepcopy(dat)
dat_2['in_flow_count'] = pd.qcut(dat_2['in_flow_count'], q = 5, \
labels = ["in.zero", "in.extreme-low", "in.low", \
"in.medium", "in.high"]).astype(str)
pd.qcut(dat['in_flow_count'], q = 5).value_counts()
(-0.001, 1.0] 403
(3.0, 7.0] 315
(7.0, 14.0] 283
(14.0, 91.0] 278
(1.0, 3.0] 209
Name: in_flow_count, dtype: int64
Label | Interval | Depth (Number of Instances) |
---|---|---|
in.zero | (-0.001, 1.0] | 403 |
in.extreme-low | (1.0, 3.0] | 209 |
in.low | (3.0, 7.0] | 315 |
in.medium | (7.0, 14.0] | 283 |
in.high | (14.0, 91.0] | 278 |
dat_2['out_flow_count'] = pd.qcut(dat_2['out_flow_count'], q = 5, \
labels = ["out.zero", "out.extreme-low", "out.low", \
"out.medium", "out.high"]).astype(str)
pd.qcut(dat['out_flow_count'], q = 5).value_counts()
(-0.001, 1.0] 423
(3.0, 7.0] 307
(7.0, 14.0] 292
(14.0, 116.0] 284
(1.0, 3.0] 182
Name: out_flow_count, dtype: int64
Label | Interval | Depth (Number of Instances) |
---|---|---|
out.zero | (-0.001, 1.0] | 423 |
out.extreme-low | (1.0, 3.0] | 182 |
out.low | (3.0, 7.0] | 307 |
out.medium | (7.0, 14.0] | 292 |
out.high | (14.0, 116.0] | 284 |
B. Association Rule Mining
Using Apriori and FP-Growth algorithms, we want to discover the relationship between in-flow counts
and out-flow counts
of station 519
.
The support threshold and confidence threshold are determined by the quality and quantity of rules found. That is, number of rules found should not be too small or too large and the rules found should have confidence as high as possible.
To compare the differences, I use the same support/confidence threshold to mine rules in transaction datasets with different discretization approach.
- First, we mine the association rules of the transaction dataset that is discretized with
equi-sized
approach.- support threshold = 0.1
- confidence threshold = 0.2
%%time
print("Apriori\n********")
ap = apriori_find_association_rules(dat_1.values.tolist(), 0.1, 0.2)
Apriori
********
CPU times: user 1.88 ms, sys: 540 µs, total: 2.42 ms
Wall time: 2.38 ms
apriori_show_mining_results(ap)
Rules:
------
(in.level-1) ==> (out.level-1) confidence = 0.998
(out.level-1) ==> (in.level-1) confidence = 0.957
%%time
print("FP-Growth\n*********")
fp = fp_find_association_rules(dat_1.values.tolist(), 0.1, 0.2)
FP-Growth
*********
CPU times: user 9.18 ms, sys: 958 µs, total: 10.1 ms
Wall time: 9.31 ms
fp_show_mining_results(fp, dat_1.shape[0])
Rules:
------
(in.level-1) ==> (out.level-1) confidence = 0.998
(out.level-1) ==> (in.level-1) confidence = 0.957
- Next, we mine the association rules of the transaction dataset that is discretized with
equi-depth
approach.- support threshold = 0.1
- confidence threshold = 0.2
%%time
print("Apriori\n********")
ap = apriori_find_association_rules(dat_2.values.tolist(), 0.1, 0.2)
Apriori
********
CPU times: user 3.49 ms, sys: 1.94 ms, total: 5.43 ms
Wall time: 4.91 ms
apriori_show_mining_results(ap)
Rules:
------
(in.high) ==> (out.high) confidence = 0.878
(out.high) ==> (in.high) confidence = 0.859
(in.medium) ==> (out.medium) confidence = 0.53
(out.medium) ==> (in.medium) confidence = 0.514
(in.zero) ==> (out.zero) confidence = 0.821
(out.zero) ==> (in.zero) confidence = 0.783
%%time
print("FP-Growth\n*********")
fp = fp_find_association_rules(dat_2.values.tolist(), 0.1, 0.2)
FP-Growth
*********
CPU times: user 9.77 ms, sys: 641 µs, total: 10.4 ms
Wall time: 9.97 ms
fp_show_mining_results(fp, dat_2.shape[0])
Rules:
------
(in.high) ==> (out.high) confidence = 0.878
(out.high) ==> (in.high) confidence = 0.859
(in.medium) ==> (out.medium) confidence = 0.53
(out.medium) ==> (in.medium) confidence = 0.514
(in.zero) ==> (out.zero) confidence = 0.821
(out.zero) ==> (in.zero) confidence = 0.783
C. Observations
The table below shows the rules found, sorted by their confidence.
Discretization Method | Rules Found | Confidence |
---|---|---|
Method (1). equi-sized |
(in.level-1) <==> (out.level-1) | ==> 0.998; <== 0.957 |
Method (2). equi-depth |
(in.high) <==> (out.high) | ==> 0.878; <== 0.859 |
Method (2). equi-depth |
(in.zero) <==> (out.zero) | ==> 0.821; <== 0.783 |
Method (2). equi-depth |
(in.medium) <==> (out.medium) | ==> 0.530; <== 0.514 |
Using equi-sized
approach, we can find only 2
rules, which can be conclude as
- In every 30 minutes, in-flow count of
0~18
indicates out-flow count of0~23
, and vice versa. (confidence > 95%)
Using equi-depth
approach, we can find 6
rules, which can be conclude as
- In every 30 minutes, in-flow count of
15~91
indicates out-flow count of15~116
, and vice versa. (confidence > 85%) - In every 30 minutes, in-flow count of
0
indicates out-flow count of0
, and vice versa. (confidence > 78%) - In every 30 minutes, in-flow count of
8~14
indicates out-flow count of8~14
, and vice versa. (confidence > 50%)
D. Comparisons
Rules found by equi-depth
approach cover all ranges of the flow count except the interval 1~7
. With equi-depth
approach, we can find much more rules. I think it is because using equi-sized
approach, number of instances in the buckets are too imbalance (e.g., there are only 2 instances that contains level-5
in-flow count).
Top Rules | Discretization Method (1). equi-sized approach |
Discretization Method (2). equi-depth approach |
---|---|---|
confidence > 90% | in-flow count of 0~18 <==> out-flow count of 0~23 |
|
confidence > 80% | in-flow count of 15~91 <==> out-flow count of 15~116 |
|
confidence > 70% | in-flow count of 0 <==> out-flow count of 0 |
|
confidence > 60% | ||
confidence > 50% | in-flow count of 8~14 <==> out-flow count of 8~14 |
However, the top 1 rule found by 2 discretization approach are actually very different. The top 1 rule found with equi-sized
approach says that low in-flow tends to co-occur with low out-flow, while and the top 1 rule found by equi-depth
approach says that high in-flow tends to co-occur with high out-flow.
Execution time (ms) | Discretization Method (1). equi-sized approach |
Discretization Method (2). equi-depth approach |
---|---|---|
Apriori | 2.42 | 5.43 |
FP-Growth | 10.1 | 10.4 |
Surprisingly, FP-Growth runs much slower than Apriori in this case. I think it is because that this transaction dataset is so small (only contains data from station 519
) that constructing FP-trees becomes a significant overhead in FP-Grwoth Algorithm. Also, I think it is resonable that with equi-depth
approach, which found more rules, cost more time in excution.
Task 2: Find Rules between Time of a Day
and Flows
of station 519
The second task is to find that if there is any association rules between the time of a day and the flow counts of station 519
.
For example,
\[\{Afternoon\} \rightarrow \{High~flow~count\}\]A. Define Transactions
A transaction consists of
time of a day
flow count
of station 519 (sum up in-flow count and out-flow count)
dat = df_flow[df_flow['station id'] == 519][['time', 'in_flow_count', 'out_flow_count']]
dat['flow_count'] = dat['in_flow_count'] + dat['out_flow_count']
dat['time'] = ["{:02d}:{:02d}".format(dt.hour, dt.minute) for dt in dat['time']]
dat = dat[['time', 'flow_count']]
dat.head(5)
time | flow_count | |
---|---|---|
244329 | 00:00 | 4.0 |
244330 | 00:30 | 2.0 |
244331 | 10:00 | 3.0 |
244332 | 10:30 | 6.0 |
244333 | 11:00 | 11.0 |
-
Discretization Method (1). Every 2 hours
I split one day into
12
intervals (every 2 hours as a bucket).
dat_1 = copy.deepcopy(dat)
dat_1['time'] = ["{:02d}:00~{:02d}:00".format(math.floor(dt.hour/2)*2, math.floor(dt.hour/2)*2+2) \
for dt in dat_1['time']]
dat_1['time'] = dat_1['time'].astype(str)
Note that I use the equi-depth
approach to deal with the flow counts. (Intervals for low
, medium
, and high
flow counts are shown as belows.)
dat_1['flow_count'] = pd.qcut(dat_1['flow_count'], q = 3, \
labels = ["low", "medium", "high"]).astype(str)
pd.qcut(dat['flow_count'], q = 3).value_counts()
(-0.001, 5.0] 521
(17.0, 185.0] 486
(5.0, 17.0] 481
Name: flow_count, dtype: int64
-
Discretization Method (2).
Morning
,Noon
,Afternoon
,Evening
, orNight
- 00:00 ~ 06:00:
Night
- 06:00 ~ 11:00:
Morning
- 11:00 ~ 13:00:
Noon
- 13:00 ~ 16:00:
Afternoon
- 16:00 ~ 22:00:
Evening
- 22:00 ~ 24:00:
Night
- 00:00 ~ 06:00:
dat_2 = copy.deepcopy(dat)
mapping = ["Night"]*6 + ["Morning"]*5 + ["Noon"]*2 + ["Afternoon"]*3 + ["Evening"]*6 + ["Night"]*2
dat_2['time'] = [mapping[math.floor(dt.hour)] for dt in dat_2['time']]
Note that I use the equi-depth
approach again to deal with the in/out flow counts. (Intervals for low
, medium
, and high
flow counts are shown as belows.)
dat_2['flow_count'] = pd.qcut(dat_2['flow_count'], q = 3, \
labels = ["low", "medium", "high"]).astype(str)
pd.qcut(dat['flow_count'], q = 3).value_counts()
(-0.001, 5.0] 521
(17.0, 185.0] 486
(5.0, 17.0] 481
Name: flow_count, dtype: int64
B. Association Rule Mining
Using Apriori and FP-Growth algorithms, we want to discover the relationship between which time of a day
and flow counts
of station 519
.
The support threshold and confidence threshold are determined by the quality and quantity of rules found. That is, number of rules found should not be too small or too large and the rules found should have confidence as high as possible.
To compare the differences, I use the same support/confidence threshold to mine rules in transaction datasets with different discretization approach.
- First, we mine the association rules of the transaction dataset that is discretized with
Every 2 hors
approach.- support threshold = 0.05
- confidence threshold = 0.6
%%time
print("Apriori\n********")
ap = apriori_find_association_rules(dat_1.values.tolist(), 0.05, 0.6)
Apriori
********
CPU times: user 3.14 ms, sys: 707 µs, total: 3.85 ms
Wall time: 3.42 ms
apriori_show_mining_results(ap)
Rules:
------
(00:00~02:00) ==> (low) confidence = 0.903
(02:00~04:00) ==> (low) confidence = 0.992
(04:00~06:00) ==> (low) confidence = 0.855
(08:00~10:00) ==> (high) confidence = 0.637
(12:00~14:00) ==> (medium) confidence = 0.661
(16:00~18:00) ==> (high) confidence = 0.774
(18:00~20:00) ==> (high) confidence = 0.645
%%time
print("FP-Growth\n*********")
fp = fp_find_association_rules(dat_1.values.tolist(), 0.05, 0.6)
FP-Growth
*********
CPU times: user 9.18 ms, sys: 381 µs, total: 9.56 ms
Wall time: 9.28 ms
fp_show_mining_results(fp, dat_1.shape[0])
Rules:
------
(00:00~02:00) ==> (low) confidence = 0.903
(12:00~14:00) ==> (medium) confidence = 0.661
(16:00~18:00) ==> (high) confidence = 0.774
(18:00~20:00) ==> (high) confidence = 0.645
(02:00~04:00) ==> (low) confidence = 0.992
(04:00~06:00) ==> (low) confidence = 0.855
(08:00~10:00) ==> (high) confidence = 0.637
- Next, we mine the association rules of the transaction dataset that is discretized with
Morning/Afternon/...
approach.- support threshold = 0.05
- confidence threshold = 0.6
%%time
print("Apriori\n********")
ap = apriori_find_association_rules(dat_2.values.tolist(), 0.05, 0.6)
Apriori
********
CPU times: user 3.34 ms, sys: 1.33 ms, total: 4.67 ms
Wall time: 3.73 ms
apriori_show_mining_results(ap)
Rules:
------
(Night) ==> (low) confidence = 0.837
(low) ==> (Night) confidence = 0.797
(Noon) ==> (medium) confidence = 0.621
%%time
print("FP-Growth\n*********")
fp = fp_find_association_rules(dat_2.values.tolist(), 0.05, 0.6)
FP-Growth
*********
CPU times: user 9.44 ms, sys: 578 µs, total: 10 ms
Wall time: 9.61 ms
fp_show_mining_results(fp, dat_2.shape[0])
Rules:
------
(Noon) ==> (medium) confidence = 0.621
(Night) ==> (low) confidence = 0.837
(low) ==> (Night) confidence = 0.797
C. Observations
The table below shows the rules found, sorted by their confidence.
Discretization Method | Rules Found | Confidence |
---|---|---|
Method (1). Every 2 hours |
(02:00~04:00) ==> (low) | 0.992 |
Method (1). Every 2 hours |
(00:00~02:00) ==> (low) | 0.903 |
Method (1). Every 2 hours |
(04:00~06:00) ==> (low) | 0.855 |
Method (2). Morning/Afternon/... |
(Night) ==> (low) | 0.837 |
Method (2). Morning/Afternon/... |
(low) ==> (Night) | 0.797 |
Method (1). Every 2 hours |
(16:00~18:00) ==> (high) | 0.774 |
Method (1). Every 2 hours |
(12:00~14:00) ==> (medium) | 0.661 |
Method (1). Every 2 hours |
(18:00~20:00) ==> (high) | 0.645 |
Method (1). Every 2 hours |
(08:00~10:00) ==> (high) | 0.637 |
Method (2). Morning/Afternon/... |
(Noon) ==> (medium) | 0.621 |
Using “Every 2 hours
” approach, we can find 7
rules, which can be conclude as
- At
00 ~ 06
o’clock, the flow counts tend to belower than 5
in every half hour. (confidence > 85%) - At
12 ~ 14
o’clock, the flow counts tend to be in the interval(5, 17]
in every half hour. (confidence > 65%) - At
16 ~ 20
o’clock, the flow counts tend to bemore than 17
in every half hour (confidence > 64%), especially at16 ~ 18
o’clock (confidence > 75%). - At
08 ~ 10
o’clock, the flow counts tend to be in the intervalmore than 17
in every half hour. (confidence > 60%)
Using “Morning/Afternon/...
” approach, we can find 3
rules, which can be conclude as
- At night (
22 ~ 24
and00 ~ 06
o’clock), the flow counts tend to belower than 5
in every half hour, and vice versa. (confidence > 79%) - At noon (
11 ~ 13
o’clock), the flow counts tend to be in the interval(5, 17]
in every half hour. (confidence > 60%)
D. Comparisons
With “Every 2 hours
” approach, we can find more rules. I think it is becuase with this approach, we’ve got more buckets and it happens to match the scenario that flow counts change rapidly at the scale of hour.
Top Rules | Discretization Method (1). “Every 2 hours ” approach |
Discretization Method (2). “Morning/Afternon/... ” approach |
---|---|---|
confidence > 90% | (02:00~04:00) ==> less than 5 flow counts per 30 minutes(00:00~02:00) ==> less than 5 flow counts per 30 minutes |
|
confidence > 80% | (04:00~06:00) ==> less than 5 flow counts per 30 minutes |
(Night) ==> less than 5 flow counts per 30 minutes |
confidence > 70% | (16:00~18:00) ==> more than 17 flow counts per 30 minutes |
less than 5 flow counts per 30 minutes ==> (Night) |
confidence > 60% | (12:00~14:00) ==> 5 ~ 17 flow counts per 30 minutes(18:00~20:00) ==> more than 17 flow counts per 30 minutes(08:00~10:00) ==> more than 17 flow counts per 30 minutes |
(Noon) ==> 5 ~ 17 flow counts per 30 minutes |
If you check the rules carefully, you would find that the rules found by Method (1) almost cover the rules found by Method (2). However, we would say that the rules found by Method (2) is more instinctive at first sight.
Execution time (ms) | Discretization Method (1). “Every 2 hours ” approach |
Discretization Method (2). “Morning/Afternon/... ” approach |
---|---|---|
Apriori | 3.85 | 4.67 |
FP-Growth | 9.56 | 10.0 |
FP-Growth runs much slower than Apriori in this case. I think it is again because that this transaction dataset is so small (only contains data from station 519
) that constructing FP-trees becomes a significant overhead in FP-Grwoth Algorithm.
Task 3: Find Rules between Station Locations
and Their Daily Flows
The third task is to find if there is any association rules between the location of a station and its daily flow counts.
For example,
\[\{40 < longitude \leq 40.5 \} \rightarrow \{High~daily~flow~count\}\]A. Define Transactions
A transaction consists of
station latitude
station longitude
daily flow counts
(sum up everyday in-flow count and out-flow count)
Note that I use daily flow count instead of flow count in every half hour to avoid generating only rules about zero or low flow count.
dat = pd.merge(df_flow, df_loc, on=['station id'], how='left')
dat['flow_count'] = dat['in_flow_count'] + dat['out_flow_count']
dat['day'] = [dt.day for dt in dat['time']]
dat = dat.groupby(['station latitude', 'station longitude', "day"], as_index=False) \
.agg({'flow_count': 'sum'})
dat = dat[['station latitude', 'station longitude', 'flow_count']]
dat.head(5)
station latitude | station longitude | flow_count | |
---|---|---|---|
0 | 40.6554 | -74.010628 | 0.0 |
1 | 40.6554 | -74.010628 | 0.0 |
2 | 40.6554 | -74.010628 | 0.0 |
3 | 40.6554 | -74.010628 | 0.0 |
4 | 40.6554 | -74.010628 | 0.0 |
-
Discretization Method (1).
equi-sized
approachWith
equi-sized
approach, I partition the continuous domain into intervals with equal length.
pd.cut(dat['station latitude'], bins = 5).value_counts()
(40.715, 40.745] 5425
(40.685, 40.715] 4805
(40.745, 40.774] 3968
(40.655, 40.685] 2945
(40.774, 40.804] 2480
Name: station latitude, dtype: int64
pd.cut(dat['station longitude'], bins = 5).value_counts()
(-74.012, -73.985] 7626
(-73.985, -73.957] 7347
(-73.957, -73.93] 3875
(-74.04, -74.012] 651
(-74.067, -74.04] 124
Name: station longitude, dtype: int64
dat_1 = copy.deepcopy(dat)
dat_1['station latitude'] = pd.cut(dat_1['station latitude'], bins = 5).astype(str)
dat_1['station latitude'] = "latitude = " + dat_1['station latitude']
dat_1['station longitude'] = pd.cut(dat_1['station longitude'], bins = 5).astype(str)
dat_1['station longitude'] = "longitude = "+ dat_1['station longitude']
Note that I use the equi-depth
approach to deal with the flow counts. (Intervals for ·extreme-low
, low
, medium
, high
, and extreme-high
flow counts are shown as belows.)
dat_1['flow_count'] = pd.qcut(dat_1['flow_count'], q = 5, \
labels = ["extreme-low", "low", "medium", "high", "extreme-high"]).astype(str)
pd.qcut(dat['flow_count'], q = 5).value_counts()
(-0.001, 51.0] 3943
(51.0, 96.0] 3939
(163.0, 286.0] 3923
(286.0, 1532.0] 3919
(96.0, 163.0] 3899
Name: flow_count, dtype: int64
-
Discretization Method (2).
equi-depth
approachWith
equi-depth
approach, I partition the data values into intervals with equal size along the ordering of the data.
pd.qcut(dat['station latitude'], q = 5).value_counts()
(40.735, 40.762] 3937
(40.691, 40.715] 3937
(40.654, 40.691] 3937
(40.762, 40.804] 3906
(40.715, 40.735] 3906
Name: station latitude, dtype: int64
pd.qcut(dat['station longitude'], q = 5).value_counts()
(-73.976, -73.957] 3937
(-73.998, -73.987] 3937
(-74.068, -73.998] 3937
(-73.957, -73.93] 3906
(-73.987, -73.976] 3906
Name: station longitude, dtype: int64
dat_2 = copy.deepcopy(dat)
dat_2['station latitude'] = pd.qcut(dat_2['station latitude'], q = 5).astype(str)
dat_2['station latitude'] = "latitude = " + dat_2['station latitude']
dat_2['station longitude'] = pd.qcut(dat_2['station longitude'], q = 5).astype(str)
dat_2['station longitude'] = "longitude = "+ dat_2['station longitude']
Note that I use the equi-depth
approach again to deal with the flow counts. (Intervals for ·extreme-low
, low
, medium
, high
, and extreme-high
flow counts are shown as belows.)
dat_2['flow_count'] = pd.qcut(dat_2['flow_count'], q = 5, \
labels = ["extreme-low", "low", "medium", "high", "extreme-high"]).astype(str)
pd.qcut(dat['flow_count'], q = 5).value_counts()
(-0.001, 51.0] 3943
(51.0, 96.0] 3939
(163.0, 286.0] 3923
(286.0, 1532.0] 3919
(96.0, 163.0] 3899
Name: flow_count, dtype: int64
B. Association Rule Mining
Using Apriori and FP-Growth algorithms, we want to discover the relationship between station locations
and their daily flow counts
.
The support threshold and confidence threshold are determined by the quality and quantity of rules found. That is, number of rules found should not be too small or too large and the rules found should have confidence as high as possible.
To compare the differences, I use the same support/confidence threshold to mine rules in transaction datasets with different discretization approach.
- First, we mine the association rules of the transaction dataset that is discretized with
equi-sized
approach.- support threshold = 0.08
- confidence threshold = 0.4
%%time
print("Apriori\n********")
ap = apriori_find_association_rules(dat_1.values.tolist(), 0.08, 0.4)
Apriori
********
CPU times: user 30.2 ms, sys: 2.86 ms, total: 33 ms
Wall time: 30.7 ms
apriori_show_mining_results(ap)
Rules:
------
(extreme-high) ==> (latitude = (40.715, 40.745]) confidence = 0.529
(extreme-high) ==> (longitude = (-74.012, -73.985]) confidence = 0.64
(high) ==> (longitude = (-74.012, -73.985]) confidence = 0.526
(latitude = (40.715, 40.745]) ==> (longitude = (-74.012, -73.985]) confidence = 0.514
(latitude = (40.745, 40.774]) ==> (longitude = (-73.985, -73.957]) confidence = 0.492
(low) ==> (longitude = (-73.985, -73.957]) confidence = 0.469
(medium) ==> (longitude = (-73.985, -73.957]) confidence = 0.488
%%time
print("FP-Growth\n*********")
fp = fp_find_association_rules(dat_1.values.tolist(), 0.08, 0.4)
FP-Growth
*********
CPU times: user 166 ms, sys: 2.63 ms, total: 169 ms
Wall time: 168 ms
fp_show_mining_results(fp, dat_1.shape[0])
Rules:
------
(medium) ==> (longitude = (-73.985, -73.957]) confidence = 0.488
(high) ==> (longitude = (-74.012, -73.985]) confidence = 0.526
(low) ==> (longitude = (-73.985, -73.957]) confidence = 0.469
(latitude = (40.745, 40.774]) ==> (longitude = (-73.985, -73.957]) confidence = 0.492
(latitude = (40.715, 40.745]) ==> (longitude = (-74.012, -73.985]) confidence = 0.514
- Next, we mine the association rules of the transaction dataset that is discretized with
equi-depth
approach.- support threshold = 0.08
- confidence threshold = 0.4
%%time
print("Apriori\n********")
ap = apriori_find_association_rules(dat_2.values.tolist(), 0.08, 0.4)
Apriori
********
CPU times: user 29.9 ms, sys: 1.09 ms, total: 31 ms
Wall time: 30.4 ms
apriori_show_mining_results(ap)
Rules:
------
(extreme-high) ==> (latitude = (40.735, 40.762]) confidence = 0.451
(latitude = (40.735, 40.762]) ==> (extreme-high) confidence = 0.449
(extreme-low) ==> (latitude = (40.654, 40.691]) confidence = 0.403
(latitude = (40.654, 40.691]) ==> (extreme-low) confidence = 0.404
%%time
print("FP-Growth\n*********")
fp = fp_find_association_rules(dat_2.values.tolist(), 0.08, 0.4)
FP-Growth
*********
CPU times: user 159 ms, sys: 2.69 ms, total: 162 ms
Wall time: 162 ms
fp_show_mining_results(fp, dat_2.shape[0])
Rules:
------
(extreme-high) ==> (latitude = (40.735, 40.762]) confidence = 0.451
(latitude = (40.735, 40.762]) ==> (extreme-high) confidence = 0.449
(extreme-low) ==> (latitude = (40.654, 40.691]) confidence = 0.403
(latitude = (40.654, 40.691]) ==> (extreme-low) confidence = 0.404
It is strange that number of rules found by Apriori is greater than number of rules found by FP-Growth using the transaction dataset discretized by equi-sized
approach. I later confirm that the rules found by Apriori are all correct. Hence in the following discussion, I am going to use the mining result of Apriori.
So why did this implementation of FP-Growth found less rules?
Check out the frequent itemset found by FP-Growth, I found that this implementation of FP-growth missed some 1-frequent itemsets. For example, it showed that itemset ('extreme-high', 'latitude = (40.715, 40.745]')
is frequent, but it didn’t show that itemset ('extreme-high')
is frequent while showing that itemset ('latitude = (40.715, 40.745]')
is frequent.
This is the reason why using this implementation of FP-Growth cannot find rule (extreme-high) ==> (latitude = (40.715, 40.745])
C. Observations
Note the following function show_discretization_result()
is defined in the appendix.
show_discretization_result()
The table below shows the rules found, sorted by their confidence.
Discretization Method | Rules Found | Confidence |
---|---|---|
Method (1). equi-sized |
(extreme-high) ==> (longitude = (-74.012, -73.985]) | 0.640 |
Method (1). equi-sized |
(extreme-high) ==> (latitude = (40.715, 40.745]) | 0.529 |
Method (1). equi-sized |
(high) ==> (longitude = (-74.012, -73.985]) | 0.526 |
Method (1). equi-sized |
(latitude = (40.715, 40.745]) ==> (longitude = (-74.012, -73.985]) | 0.514 |
Method (1). equi-sized |
(latitude = (40.745, 40.774]) ==> (longitude = (-73.985, -73.957]) | 0.492 |
Method (1). equi-sized |
(medium) ==> (longitude = (-73.985, -73.957]) | 0.488 |
Method (1). equi-sized |
(low) ==> (longitude = (-73.985, -73.957]) | 0.469 |
Method (2). equi-depth |
(latitude = (40.736, 40.762]) <==> (extreme-high) | ==> 0.451; <== 0.449 |
Method (2). equi-depth |
(latitude = (40.654, 40.691]) <==> (extreme-low) | ==> 0.404; <== 0.403 |
Using equi-sized
approach, we can find 7
rules, which can be conclude as
Extreme-high
daily flow count (more than 286
per day) tend to occur at the area of latitude40.715 ~ 40.745
and longitude-74.012 ~ -73.985
. (confidence >52%)High
daily flow counts (164 ~ 286
per day) tend to occur at the area of latitude40.715 ~ 40.745
. (confidence >50%)- Stations at latitude
40.715 ~ 40.745
tend to locate at longitude-74.012 ~ -73.985
. (confidence >50%) - Stations at latitude
40.745 ~ 40.774
tend to locate at longitude-73.985 ~ -73.957
. (confidence $\simeq$50%) Medium
andlow
daily flow counts (52 ~ 163
per day) tend to occur at the area of longitude-73.985 ~ -73.957
. (confidence $\simeq$50%)
Using equi-depth
approach, we can find 4
rules, which can be conclude as
- Stations at latitude
40.736 ~ 40.762
tend to haveextreme-high
daily flow count (more than 286
per day). (confidence $\simeq$45%) - Stations at latitude
40.654 ~ 40.691
tend to haveextreme-low
daily flow count (more than 286
per day). (confidence >40%)
In general, stations at latitude 40.715 ~ 40.762
(Manhattan & Long Island City, Brooklyn) tend to have high
to extreme-high
daily flow counts, while stations at latitude 40.715 ~ 40.762
(Williamsburg, Brooklyn & Red Hook, Brooklyn) tend to have low
to extreme-low
daily flow counts.
D. Comparisons
Top Rules | Discretization Method (1). equi-sized approach |
Discretization Method (2). equi-depth approach |
---|---|---|
confidence > 60% | more than 286 flow counts per day ==> (longitude = (-74.012, -73.985]) |
|
confidence > 55% | ||
confidence > 50% | more than 286 flow counts per day ==> (latitude = (40.715, 40.745])164 ~ 286 flow counts per day ==> (longitude = (-74.012, -73.985])(latitude = (40.715, 40.745]) ==> (longitude = (-74.012, -73.985]) |
|
confidence > 45% | (latitude = (40.745, 40.774]) ==> (longitude = (-73.985, -73.957])97 ~ 163 flow counts per day ==> (longitude = (-73.985, -73.957])52 ~ 96 flow counts per day ==> (longitude = (-73.985, -73.957]) |
(latitude = (40.736, 40.762]) ==> more than 286 flow counts per day |
confidence > 40% | (latitude = (40.736, 40.762]) <== more than 286 flow counts per day(latitude = (40.654, 40.691]) <==> less than 52 flow counts per day |
With equi-sized
approach, we can find more rules and higher confidence. I think it is because using equi-depth
approach would make all splitted areas have the same density of stations, while a station is actually built according to the population or say, estimated flow count, of that area. As the result, the number of rules mined with equi-depth
approach and their confidence would likely to be low.
Execution time (ms) | Discretization Method (1). equi-sized approach |
Discretization Method (2). equi-depth approach |
---|---|---|
Apriori | 33 | 31 |
FP-Growth | 169 | 162 |
Surprisingly, FP-Growth still runs much slower than Apriori in this case. This time we have much bigger transaction dataset, however, the execution time of FP-Growth is still much larger. In my opinion, the reason may be that this implementation of FP-Growth is not so well-written to be efficient.
Appendix
The functions for show_discretization_result()
used in section C of Task 3 are defined here.
def plot_stations_map(ax, stns, parallels_val, meridians_val):
# determine range to print based on min, max lat and lon of the data
lat = list(stns['station latitude'])
lon = list(stns['station longitude'])
siz = [(2)**(x/1000) for x in stns['flow_count']]
margin = 0.01 # buffer to add to the range
lat_min = min(lat) - margin
lat_max = max(lat) + margin
lon_min = min(lon) - margin
lon_max = max(lon) + margin
# create map using BASEMAP
m = Basemap(llcrnrlon=lon_min,
llcrnrlat=lat_min,
urcrnrlon=lon_max,
urcrnrlat=lat_max,
lat_0=(lat_max - lat_min)/2,
lon_0=(lon_max - lon_min)/2,
projection='lcc',
resolution = 'f',)
m.drawcoastlines()
m.fillcontinents(lake_color='aqua')
m.drawmapboundary(fill_color='aqua')
m.drawrivers()
m.drawparallels(parallels_val,labels=[False,True,True,False])
meridians = m.drawmeridians(meridians_val,labels=[True,False,False,True])
for x in meridians:
try:
meridians[x][1][0].set_rotation(45)
except:
pass
# convert lat and lon to map projection coordinates
lons, lats = m(lon, lat)
# plot points as red dots
ax.scatter(lons, lats, marker = 'o', color='r', zorder=5, alpha=0.6, s=1)
def show_discretization_result():
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,15))
ax = plt.subplot(1, 2, 1)
ax.set_title("Equi-Sized")
parallels_val = np.array([40.655, 40.685, 40.715, 40.745, 40.774, 40.804])
meridians_val = np.array([-73.93, -73.957, -73.985, -74.012, -74.04, -74.067])
plot_stations_map(ax, dat, parallels_val, meridians_val)
ax = plt.subplot(1, 2, 2)
ax.set_title("Equi-Depth")
parallels_val = np.array([40.655, 40.691, 40.715, 40.735, 40.762, 40.804])
meridians_val = np.array([-73.93, -73.957, -73.976, -73.987, -73.998, -74.068])
plot_stations_map(ax, dat, parallels_val, meridians_val)