Analyze Airline On-time Performance Dataset
In this document, we will walk through the data exploration of Airline on-time performance dataset with Apache Pig as our exploring tool. We will analyze 8 years of flight data to answer the following 3 analytic questions.
- Find the maximal delays (you should consider both ArrDelay and DepDelay) for each month of 2008.
- How many flights were delayed caused by weather between 2000 ~ 2005? Please show the counting for each year.
- List Top 5 airports which occur delays most in 2007. (Please show the IATA airport code)
I. Analytic Tool
Apache Pig
is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Here are some useful websites:
Simply saying, the data model of Pig Latin
is fully nested and it allows complex non-atomic datatypes such as map and tuple.
- A field is a piece of data.
- A tuple is a record formed by an ordered set of fields, which can be of any type.
- A tuple is similar to a row in a table of RDBMS.
- Example −
(Raja, 30)
- A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema).
- A bag is similar to a table in RDBMS, but it is not necessary that every tuple contain the same schema.
- Example −
{(Raja, 30), (Mohammad, 45)}
- A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type.
- Example −
[name#Raja, age#30]
- Example −
Note that a Relation is the outermost structure of the Pig Latin data model, which is also a bag where inner bags are allowed. Except LOAD
and STORE
, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output.
II. Dataset
Airline on-time performance dataset consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
A. Supplement Data
If you need further information, the supplement data describes the locations of US airports, listing of carrier codes, information about individual planes, and meteorological data.
B. Flight Data
The Schema of the flight data is as follows
- Year int,
- Month int,
- DayofMonth int,
- DayOfWeek int,
- DepTime int,
- CRSDepTime int,
- ArrTime int,
- CRSArrTime int,
- UniqueCarrier varchar(5),
- FlightNum int,
- TailNum varchar(8),
- ActualElapsedTime int,
- CRSElapsedTime int,
- AirTime int,
- ArrDelay int,
- DepDelay int,
- Origin varchar(3),
- Dest varchar(3),
- Distance int,
- TaxiIn int,
- TaxiOut int,
- Cancelled int,
- CancellationCode varchar(1),
- Diverted varchar(1),
- CarrierDelay int,
- WeatherDelay int,
- NASDelay int,
- SecurityDelay int,
- LateAircraftDelay int
Here the flight dataset from 2000
to 2008
is downloaded to the directory data/
using the shell command wget
.
%sh
mkdir data
for i in 2000 2001 2002 2003 2004 2005 2007 2008
do
wget -nv -O data/$i.csv.bz2 http://stat-computing.org/dataexpo/2009/$i.csv.bz2
done
III. Analytic Problems
To solve the analytic problems, first we load the data from 2000
to 2008
.
%pig
dataset_raw = LOAD 'data/{2000,2001,2002,2003,2004,2005,2007,2008}.csv.bz2'
USING PigStorage(',') AS (Year:chararray, Month:chararray, DayofMonth:chararray,
DayOfWeek:chararray, DepTime:chararray, CRSDepTime:chararray, ArrTime:chararray,
CRSArrTime:chararray, UniqueCarrier:chararray, FlightNum:chararray,
TailNum:chararray, ActualElapsedTime:chararray, CRSElapsedTime:chararray,
AirTime:chararray, ArrDelay:int, DepDelay:int, Origin:chararray, Dest:chararray,
Distance:chararray, TaxiIn:chararray, TaxiOut:chararray, Cancelled:chararray,
CancellationCode:chararray, Diverted:chararray, CarrierDelay:chararray,
WeatherDelay:int, NASDelay:chararray, SecurityDelay:chararray,
LateAircraftDelay:chararray);
dataset = FILTER dataset_raw BY Year != 'Year';
tmp = LIMIT dataset 2;
DUMP tmp;
(2000,1,28,5,1647,1647,1906,1859,HP,154,N808AW,259,252,233,7,0,ATL,PHX,1587,15,11,0,NA,0,NA,,NA,NA,NA)
(2000,1,29,6,1648,1647,1939,1859,HP,154,N653AW,291,252,239,40,1,ATL,PHX,1587,5,47,0,NA,0,NA,,NA,NA,NA)
Q1. Find the maximal delays (you should consider both ArrDelay
and DepDelay
) for each month of 2008
.
From the dataset that consists of flight records from 2000
to 2008
, I extract records of 2008
and group these records acoording to the month
of the flight. For each group, we sort the departure delays and the arrival delays and then pick the top ones to show.
%pig
data_2008 = FILTER dataset BY Year == '2008';
records = FOREACH data_2008 GENERATE Month, ArrDelay, DepDelay,
(ArrDelay + DepDelay) AS SumDelay;
grpd = GROUP records BY Month;
top_delays = FOREACH grpd {
arr_delays = ORDER records BY ArrDelay DESC;
dep_delays = ORDER records BY DepDelay DESC;
sum_delays = ORDER records BY SumDelay DESC;
arr_delays_top1 = LIMIT arr_delays 1;
dep_delays_top1 = LIMIT dep_delays 1;
sum_delays_top1 = LIMIT sum_delays 1;
GENERATE group AS Month, FLATTEN(arr_delays_top1.ArrDelay) AS ArrDelay,
FLATTEN(dep_delays_top1.DepDelay) AS DepDelay,
FLATTEN(sum_delays_top1.SumDelay) AS SumDelay;
};
DUMP top_delays;
(1,1525,1355,2800)
(2,2461,2457,4918)
(3,1490,1521,2980)
(4,2453,2467,4920)
(5,1951,1952,3903)
(6,1707,1710,3417)
(7,1510,1518,3028)
(8,1359,1367,2726)
(9,1583,1552,3135)
(10,1392,1369,2761)
(11,1308,1286,2594)
(12,1655,1597,3252)
The query result is stored into the folder q1
.
%pig
STORE top_delays INTO 'q1';
So that we can read the query result and visualize the top delays for each month.
%python
import glob, os
import pandas as pd
# Read file
all_rec = glob.iglob(os.path.join("q1", "part*"), recursive=True)
dataframes = [pd.read_csv(f, sep="\t", header=None) for f in all_rec]
df = pd.concat(dataframes, ignore_index=True)
df.columns = ['Month','Top Arrival Delay', 'Top Departure Delay', 'Top (Arrival Delay + Departure Delay)']
# Convert minutes to hours
for col in df.columns:
if col == 'Month':
continue
df[col] = [float(x)/60 for x in df[col]]
df
Month Top Arrival Delay Top Departure Delay \
0 1 25.416667 22.583333
1 2 41.016667 40.950000
2 3 24.833333 25.350000
3 4 40.883333 41.116667
4 5 32.516667 32.533333
5 6 28.450000 28.500000
6 7 25.166667 25.300000
7 8 22.650000 22.783333
8 9 26.383333 25.866667
9 10 23.200000 22.816667
10 11 21.800000 21.433333
11 12 27.583333 26.616667
Top (Arrival Delay + Departure Delay)
0 46.666667
1 81.966667
2 49.666667
3 82.000000
4 65.050000
5 56.950000
6 50.466667
7 45.433333
8 52.250000
9 46.016667
10 43.233333
11 54.200000
%python
ax = df.plot(x='Month', y=['Top Arrival Delay', 'Top Departure Delay',
'Top (Arrival Delay + Departure Delay)'],
kind="line", style="-o", figsize=(15,5))
ax.set_title("The Maximal Delays for Each Month in 2008")
ax.set_ylabel('hour(s)')
<matplotlib.text.Text object at 0x114d074e0>
From the table and the figure shown above, we can see that in 2008,
- the largest arrival delay happens in
February
andApril
, which are also when the largest departure delay occurs. - the smallest arrival delay and the smallest departure delay both occur in
November
. - the maximal delays in summer and autumn is smaller than the maximal delays in winter and spring.
Q2. How many flights were delayed caused by weather between 2000 ~ 2005
? Please show the counting for each year.
To answer this question, I retreive flight records needed and group them by the Year
of the flights. For each group, I count the number of flight records whose WeatherDealy
is more than 0 minute(s) and show the calculation results.
%pig
data_q2 = FILTER dataset BY (Year != '2008') AND (Year != '2007');
records = FOREACH data_q2 GENERATE Year, WeatherDelay;
grpd = GROUP records BY Year;
q2 = FOREACH grpd {
weather_delays = FILTER records BY WeatherDelay > 0;
GENERATE group, COUNT(weather_delays);
};
DUMP q2;
(2000,0)
(2001,0)
(2002,0)
(2003,44620)
(2004,115989)
(2005,111388)
The query result is stored into the folder q2
.
%pig
STORE q2 INTO 'q2';
So that we can read the query result and visualize the count of weather delays for each year.
%python
import glob, os
import pandas as pd
# Read file
all_rec = glob.iglob(os.path.join("q2", "part*"), recursive=True)
dataframes = [pd.read_csv(f, sep="\t", header=None) for f in all_rec]
df = pd.concat(dataframes, ignore_index=True)
df.columns = ['Year','Weather Delay']
df
Year Weather Delay
0 2000 0
1 2001 0
2 2002 0
3 2003 44620
4 2004 115989
5 2005 111388
%python
ax = df.plot(x='Year', y='Weather Delay', kind="line", style="-o", figsize=(15,5))
ax.set_title("The Count of Weather Delay for Each Year")
ax.set_ylabel('Count')
<matplotlib.text.Text object at 0x11a972ac8>
From the table and the figure above, we can find that
- There are no weather delays from
2000
to2002
in flight records. However, I think this may due to the missing of the data or the fact that they did not record if a dealy flight is caused by weather before2003
. - There are more weather delays in
2004
than that in2003
, which may be partially driven by the increase in the total number of flights. - Unexpectedly, the count of weather delays in
2005
is less than that in2004
.
Q3. List Top 5 airports which occur delays most in 2007
. (Please show the IATA
airport code)
In the following operation on pig
dataset, I first retreive only the flight data in 2007
, and then group the delay flights by their departure airport to count the frequency of occuring departure delay for each airport.
%pig
data_q3 = FILTER dataset BY (Year == '2007') AND (DepDelay > 0);
records = FOREACH data_q3 GENERATE Origin, DepDelay;
grpd = GROUP records BY Origin;
counts = FOREACH grpd GENERATE group AS Airport,
COUNT(records) AS DelayCount;
sorted = ORDER counts BY DelayCount DESC;
q3_1 = LIMIT sorted 5;
DUMP q3_1;
(ATL,206118)
(ORD,183984)
(DFW,135433)
(DEN,109839)
(PHX,105917)
%pig
data_q3 = FILTER dataset BY (Year == '2007') AND (ArrDelay > 0);
records = FOREACH data_q3 GENERATE Origin, ArrDelay;
grpd = GROUP records BY Origin;
counts = FOREACH grpd GENERATE group AS Airport,
COUNT(records) AS DelayCount;
sorted = ORDER counts BY DelayCount DESC;
q3_2 = LIMIT sorted 5;
DUMP q3_2;
(ATL,209826)
(ORD,188410)
(DFW,148059)
(DEN,122557)
(LAX,113379)
The query result is stored into the folder q3
.
%pig
STORE q3_1 INTO 'q3/dep';
STORE q3_2 INTO 'q3/arr';
So that we can read the query result and visualize the top airports that occur departure/arrival delays the most.
%python
import glob, os
import pandas as pd
# Read file
rec_dep = glob.iglob(os.path.join("q3", "dep", "part*"), recursive=True)
rec_arr = glob.iglob(os.path.join("q3", "arr", "part*"), recursive=True)
df_dep = pd.concat([pd.read_csv(f, sep="\t", header=None) for f in rec_dep],
ignore_index=True)
df_arr = pd.concat([pd.read_csv(f, sep="\t", header=None) for f in rec_arr],
ignore_index=True)
df_dep.columns = ['Airport', 'Departure Delay']
df_arr.columns = ['Airport', 'Arrival Delay']
%python
df_dep
Airport Departure Delay
0 ATL 206118
1 ORD 183984
2 DFW 135433
3 DEN 109839
4 PHX 105917
%python
df_arr
Airport Arrival Delay
0 ATL 209826
1 ORD 188410
2 DFW 148059
3 DEN 122557
4 LAX 113379
%python
ax = df_dep.plot(x='Airport', y='Departure Delay', kind="bar", figsize=(15,7))
ax.set_title("The Top 5 Airports that Occur Departure Delays Most Frequently in 2007")
ax.set_ylabel('Count')
<matplotlib.text.Text object at 0x105a84128>
%python
ax = df_arr.plot(x='Airport', y='Arrival Delay', kind="bar", figsize=(15,7))
ax.set_title("The Top 5 Airports that Occur Arrival Delays Most Frequently in 2007")
ax.set_ylabel('Count')
<matplotlib.text.Text object at 0x10e9df278>
The top 4 airports that occurs delay the most times are the same for departure delay and arrival delay, which are
ATL
- William B Hartsfield-Atlanta Intl Airport in AtlantaORD
- Chicago O’Hare International Airport in ChicagoDFW
- Dallas-Fort Worth International Airport in Dallas-Fort WorthDEN
- Denver Intl Airport in Denver
However, the 5th highest departure delay count occurs at
PHX
- Phoenix Sky Harbor International Airport in Phoenix
while the 5th highest arrival delay count occurs at
LAX
- Los Angeles International in Los Angeles