In this document, we will walk through the data exploration of Airline on-time performance dataset with Apache Pig as our exploring tool. We will analyze 8 years of flight data to answer the following 3 analytic questions.

  1. Find the maximal delays (you should consider both ArrDelay and DepDelay) for each month of 2008.
  2. How many flights were delayed caused by weather between 2000 ~ 2005? Please show the counting for each year.
  3. List Top 5 airports which occur delays most in 2007. (Please show the IATA airport code)

I. Analytic Tool

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Here are some useful websites:

Simply saying, the data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple.

  • A field is a piece of data.
  • A tuple is a record formed by an ordered set of fields, which can be of any type.
    • A tuple is similar to a row in a table of RDBMS.
    • Example − (Raja, 30)
  • A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema).
    • A bag is similar to a table in RDBMS, but it is not necessary that every tuple contain the same schema.
    • Example − {(Raja, 30), (Mohammad, 45)}
  • A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type.
    • Example − [name#Raja, age#30]

Note that a Relation is the outermost structure of the Pig Latin data model, which is also a bag where inner bags are allowed. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output.

II. Dataset

Airline on-time performance dataset consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

A. Supplement Data

If you need further information, the supplement data describes the locations of US airports, listing of carrier codes, information about individual planes, and meteorological data.

B. Flight Data

The Schema of the flight data is as follows

  • Year int,
  • Month int,
  • DayofMonth int,
  • DayOfWeek int,
  • DepTime int,
  • CRSDepTime int,
  • ArrTime int,
  • CRSArrTime int,
  • UniqueCarrier varchar(5),
  • FlightNum int,
  • TailNum varchar(8),
  • ActualElapsedTime int,
  • CRSElapsedTime int,
  • AirTime int,
  • ArrDelay int,
  • DepDelay int,
  • Origin varchar(3),
  • Dest varchar(3),
  • Distance int,
  • TaxiIn int,
  • TaxiOut int,
  • Cancelled int,
  • CancellationCode varchar(1),
  • Diverted varchar(1),
  • CarrierDelay int,
  • WeatherDelay int,
  • NASDelay int,
  • SecurityDelay int,
  • LateAircraftDelay int

Here the flight dataset from 2000 to 2008 is downloaded to the directory data/ using the shell command wget.

%sh

mkdir data
for i in 2000 2001 2002 2003 2004 2005 2007 2008
do
  wget -nv -O data/$i.csv.bz2 http://stat-computing.org/dataexpo/2009/$i.csv.bz2
done

III. Analytic Problems

To solve the analytic problems, first we load the data from 2000 to 2008.

%pig

dataset_raw = LOAD 'data/{2000,2001,2002,2003,2004,2005,2007,2008}.csv.bz2' 
    USING PigStorage(',') AS (Year:chararray, Month:chararray, DayofMonth:chararray, 
    DayOfWeek:chararray, DepTime:chararray, CRSDepTime:chararray, ArrTime:chararray, 
    CRSArrTime:chararray, UniqueCarrier:chararray, FlightNum:chararray, 
    TailNum:chararray, ActualElapsedTime:chararray, CRSElapsedTime:chararray, 
    AirTime:chararray, ArrDelay:int, DepDelay:int, Origin:chararray, Dest:chararray, 
    Distance:chararray, TaxiIn:chararray, TaxiOut:chararray, Cancelled:chararray, 
    CancellationCode:chararray, Diverted:chararray, CarrierDelay:chararray, 
    WeatherDelay:int, NASDelay:chararray, SecurityDelay:chararray, 
    LateAircraftDelay:chararray); 
dataset = FILTER dataset_raw BY Year != 'Year';

tmp = LIMIT dataset 2;
DUMP tmp;
(2000,1,28,5,1647,1647,1906,1859,HP,154,N808AW,259,252,233,7,0,ATL,PHX,1587,15,11,0,NA,0,NA,,NA,NA,NA)
(2000,1,29,6,1648,1647,1939,1859,HP,154,N653AW,291,252,239,40,1,ATL,PHX,1587,5,47,0,NA,0,NA,,NA,NA,NA)

Q1. Find the maximal delays (you should consider both ArrDelay and DepDelay) for each month of 2008.

From the dataset that consists of flight records from 2000 to 2008, I extract records of 2008 and group these records acoording to the month of the flight. For each group, we sort the departure delays and the arrival delays and then pick the top ones to show.

%pig

data_2008 = FILTER dataset BY Year == '2008';
records = FOREACH data_2008 GENERATE Month, ArrDelay, DepDelay,
            (ArrDelay + DepDelay) AS SumDelay;
grpd = GROUP records BY Month;

top_delays = FOREACH grpd {
 arr_delays = ORDER records BY ArrDelay DESC;
 dep_delays = ORDER records BY DepDelay DESC;
 sum_delays = ORDER records BY SumDelay DESC;
 arr_delays_top1 = LIMIT arr_delays 1;
 dep_delays_top1 = LIMIT dep_delays 1;
 sum_delays_top1 = LIMIT sum_delays 1;
 GENERATE group AS Month, FLATTEN(arr_delays_top1.ArrDelay) AS ArrDelay, 
          FLATTEN(dep_delays_top1.DepDelay) AS DepDelay,
          FLATTEN(sum_delays_top1.SumDelay) AS SumDelay; 
};

DUMP top_delays;
(1,1525,1355,2800)
(2,2461,2457,4918)
(3,1490,1521,2980)
(4,2453,2467,4920)
(5,1951,1952,3903)
(6,1707,1710,3417)
(7,1510,1518,3028)
(8,1359,1367,2726)
(9,1583,1552,3135)
(10,1392,1369,2761)
(11,1308,1286,2594)
(12,1655,1597,3252)

The query result is stored into the folder q1.

%pig

STORE top_delays INTO 'q1';

So that we can read the query result and visualize the top delays for each month.

%python

import glob, os 
import pandas as pd

# Read file

all_rec = glob.iglob(os.path.join("q1", "part*"), recursive=True)    
dataframes = [pd.read_csv(f, sep="\t", header=None) for f in all_rec]
df = pd.concat(dataframes, ignore_index=True)
df.columns = ['Month','Top Arrival Delay', 'Top Departure Delay', 'Top (Arrival Delay + Departure Delay)']

# Convert minutes to hours
for col in df.columns:
    if col == 'Month':
        continue
    df[col] = [float(x)/60 for x in df[col]]
df
    Month  Top Arrival Delay  Top Departure Delay  \
0       1          25.416667            22.583333   
1       2          41.016667            40.950000   
2       3          24.833333            25.350000   
3       4          40.883333            41.116667   
4       5          32.516667            32.533333   
5       6          28.450000            28.500000   
6       7          25.166667            25.300000   
7       8          22.650000            22.783333   
8       9          26.383333            25.866667   
9      10          23.200000            22.816667   
10     11          21.800000            21.433333   
11     12          27.583333            26.616667   

    Top (Arrival Delay + Departure Delay)  
0                               46.666667  
1                               81.966667  
2                               49.666667  
3                               82.000000  
4                               65.050000  
5                               56.950000  
6                               50.466667  
7                               45.433333  
8                               52.250000  
9                               46.016667  
10                              43.233333  
11                              54.200000  
%python

ax = df.plot(x='Month', y=['Top Arrival Delay', 'Top Departure Delay', 
                            'Top (Arrival Delay + Departure Delay)'], 
             kind="line", style="-o", figsize=(15,5))
ax.set_title("The Maximal Delays for Each Month in 2008")
ax.set_ylabel('hour(s)')
<matplotlib.text.Text object at 0x114d074e0>

From the table and the figure shown above, we can see that in 2008,

  1. the largest arrival delay happens in February and April, which are also when the largest departure delay occurs.
  2. the smallest arrival delay and the smallest departure delay both occur in November.
  3. the maximal delays in summer and autumn is smaller than the maximal delays in winter and spring.

Q2. How many flights were delayed caused by weather between 2000 ~ 2005? Please show the counting for each year.

To answer this question, I retreive flight records needed and group them by the Year of the flights. For each group, I count the number of flight records whose WeatherDealy is more than 0 minute(s) and show the calculation results.

%pig
data_q2 = FILTER dataset BY (Year != '2008') AND (Year != '2007');
records = FOREACH data_q2 GENERATE Year, WeatherDelay;
grpd = GROUP records BY Year;
q2 = FOREACH grpd {
    weather_delays = FILTER records BY WeatherDelay > 0;
    GENERATE group, COUNT(weather_delays); 
};

DUMP q2;
(2000,0)
(2001,0)
(2002,0)
(2003,44620)
(2004,115989)
(2005,111388)

The query result is stored into the folder q2.

%pig

STORE q2 INTO 'q2';

So that we can read the query result and visualize the count of weather delays for each year.

%python

import glob, os 
import pandas as pd

# Read file
all_rec = glob.iglob(os.path.join("q2", "part*"), recursive=True)    
dataframes = [pd.read_csv(f, sep="\t", header=None) for f in all_rec]
df = pd.concat(dataframes, ignore_index=True)
df.columns = ['Year','Weather Delay']

df
   Year  Weather Delay
0  2000              0
1  2001              0
2  2002              0
3  2003          44620
4  2004         115989
5  2005         111388
%python

ax = df.plot(x='Year', y='Weather Delay', kind="line", style="-o", figsize=(15,5))
ax.set_title("The Count of Weather Delay for Each Year")
ax.set_ylabel('Count')
<matplotlib.text.Text object at 0x11a972ac8>

From the table and the figure above, we can find that

  1. There are no weather delays from 2000 to 2002 in flight records. However, I think this may due to the missing of the data or the fact that they did not record if a dealy flight is caused by weather before 2003.
  2. There are more weather delays in 2004 than that in 2003, which may be partially driven by the increase in the total number of flights.
  3. Unexpectedly, the count of weather delays in 2005 is less than that in 2004.

Q3. List Top 5 airports which occur delays most in 2007. (Please show the IATA airport code)

In the following operation on pig dataset, I first retreive only the flight data in 2007, and then group the delay flights by their departure airport to count the frequency of occuring departure delay for each airport.

%pig

data_q3 = FILTER dataset BY (Year == '2007') AND (DepDelay > 0);
records = FOREACH data_q3 GENERATE Origin, DepDelay;
grpd = GROUP records BY Origin;
counts = FOREACH grpd GENERATE group AS Airport, 
                               COUNT(records) AS DelayCount;
sorted = ORDER counts BY DelayCount DESC;
q3_1 = LIMIT sorted 5;

DUMP q3_1;
(ATL,206118)
(ORD,183984)
(DFW,135433)
(DEN,109839)
(PHX,105917)
%pig

data_q3 = FILTER dataset BY (Year == '2007') AND (ArrDelay > 0);
records = FOREACH data_q3 GENERATE Origin, ArrDelay;
grpd = GROUP records BY Origin;
counts = FOREACH grpd GENERATE group AS Airport, 
                               COUNT(records) AS DelayCount;
sorted = ORDER counts BY DelayCount DESC;
q3_2 = LIMIT sorted 5;

DUMP q3_2;
(ATL,209826)
(ORD,188410)
(DFW,148059)
(DEN,122557)
(LAX,113379)

The query result is stored into the folder q3.

%pig

STORE q3_1 INTO 'q3/dep';
STORE q3_2 INTO 'q3/arr';

So that we can read the query result and visualize the top airports that occur departure/arrival delays the most.

%python

import glob, os 
import pandas as pd

# Read file
rec_dep = glob.iglob(os.path.join("q3", "dep", "part*"), recursive=True)    
rec_arr = glob.iglob(os.path.join("q3", "arr", "part*"), recursive=True)    
df_dep = pd.concat([pd.read_csv(f, sep="\t", header=None) for f in rec_dep], 
                ignore_index=True)
df_arr = pd.concat([pd.read_csv(f, sep="\t", header=None) for f in rec_arr], 
                ignore_index=True)
df_dep.columns = ['Airport', 'Departure Delay']
df_arr.columns = ['Airport', 'Arrival Delay']
%python

df_dep
  Airport  Departure Delay
0     ATL           206118
1     ORD           183984
2     DFW           135433
3     DEN           109839
4     PHX           105917
%python

df_arr
  Airport  Arrival Delay
0     ATL         209826
1     ORD         188410
2     DFW         148059
3     DEN         122557
4     LAX         113379
%python

ax = df_dep.plot(x='Airport', y='Departure Delay', kind="bar", figsize=(15,7))
ax.set_title("The Top 5 Airports that Occur Departure Delays Most Frequently in 2007")
ax.set_ylabel('Count')
<matplotlib.text.Text object at 0x105a84128>

%python

ax = df_arr.plot(x='Airport', y='Arrival Delay', kind="bar", figsize=(15,7))
ax.set_title("The Top 5 Airports that Occur Arrival Delays Most Frequently in 2007")
ax.set_ylabel('Count')
<matplotlib.text.Text object at 0x10e9df278>

The top 4 airports that occurs delay the most times are the same for departure delay and arrival delay, which are

  1. ATL - William B Hartsfield-Atlanta Intl Airport in Atlanta
  2. ORD - Chicago O’Hare International Airport in Chicago
  3. DFW - Dallas-Fort Worth International Airport in Dallas-Fort Worth
  4. DEN - Denver Intl Airport in Denver

However, the 5th highest departure delay count occurs at

  1. PHX - Phoenix Sky Harbor International Airport in Phoenix

while the 5th highest arrival delay count occurs at

  1. LAX - Los Angeles International in Los Angeles