Data Preprocessing on Airline Data

Data Source and Variable Definition:

Statistical Computing Statistical Graphics We are going to use flight information for 2000.

Python Libraries to be used:

import pandas as pd
from IPython.display import display
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import numpy as np

Load Dataset

pandas.read_csv()

Useful parameters:

sep : str, default ‘,’
header : int or list of ints. Row number(s) to use as the column names, and the start of the data.
index_col : int or sequence or False, default None. Column to use as the row labels of the DataFrame.

df = pd.read_csv('data/air2000_test.csv', header=0, index_col=0)
df.head()

	Year	Month	DayofMonth	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum	...	TaxiIn	TaxiOut	Cancelled	CancellationCode	CarrierDelay	WeatherDelay	NASDelay	SecurityDelay	LateAircraftDelay
0	2000	1	28	5	1647.0	1647	1906.0	1859	HP	154	...	15	11	0	NaN	NaN	NaN	NaN	NaN	NaN
1	2000	1	29	6	1648.0	1647	1939.0	1859	HP	154	...	5	47	0	NaN	NaN	NaN	NaN	NaN	NaN
2	2000	1	30	7	NaN	1647	NaN	1859	HP	154	...	0	0	1	NaN	NaN	NaN	NaN	NaN	NaN
3	2000	1	31	1	1645.0	1647	1852.0	1859	HP	154	...	7	14	0	NaN	NaN	NaN	NaN	NaN	NaN
4	2000	1	1	6	842.0	846	1057.0	1101	HP	609	...	3	8	0	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 29 columns

Dealing with Missing Data

# count the number of missing values per column
display(df.isnull().sum())

Year                    0
Month                   0
DayofMonth              0
DayOfWeek               0
DepTime                40
CRSDepTime              0
ArrTime                42
CRSArrTime              0
UniqueCarrier           0
FlightNum               0
TailNum                 0
ActualElapsedTime      42
CRSElapsedTime          0
AirTime                42
ArrDelay               42
DepDelay               40
Origin                  0
Dest                    0
Distance                0
TaxiIn                  0
TaxiOut                 0
Cancelled               0
CancellationCode     1000
Diverted                0
CarrierDelay         1000
WeatherDelay         1000
NASDelay             1000
SecurityDelay        1000
LateAircraftDelay    1000
dtype: int64

Eliminating Samples or Features with Missing Values

One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely. We can call the dropna() method of Dataframe to eliminate rows or columns:

# drop columns with ALL NaN
df_drop_col = df.dropna(axis=1, thresh=1)
df_drop_col.head()

	Year	Month	DayofMonth	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum	...	AirTime	ArrDelay	DepDelay	Origin	Dest	Distance	TaxiIn	TaxiOut	Cancelled
0	2000	1	28	5	1647.0	1647	1906.0	1859	HP	154	...	233.0	7.0	0.0	ATL	PHX	1587	15	11	0
1	2000	1	29	6	1648.0	1647	1939.0	1859	HP	154	...	239.0	40.0	1.0	ATL	PHX	1587	5	47	0
2	2000	1	30	7	NaN	1647	NaN	1859	HP	154	...	NaN	NaN	NaN	ATL	PHX	1587	0	0	1
3	2000	1	31	1	1645.0	1647	1852.0	1859	HP	154	...	226.0	-7.0	-2.0	ATL	PHX	1587	7	14	0
4	2000	1	1	6	842.0	846	1057.0	1101	HP	609	...	244.0	-4.0	-4.0	ATL	PHX	1587	3	8	0

5 rows × 23 columns

# drop rows with ANY NaN
df_drop_col_row = df_drop_col.dropna(axis=0, thresh=df_drop_col.shape[1])
df_drop_col_row.head()

	Year	Month	DayofMonth	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum	...	AirTime	ArrDelay	DepDelay	Origin	Dest	Distance	TaxiIn	TaxiOut
0	2000	1	28	5	1647.0	1647	1906.0	1859	HP	154	...	233.0	7.0	0.0	ATL	PHX	1587	15	11
1	2000	1	29	6	1648.0	1647	1939.0	1859	HP	154	...	239.0	40.0	1.0	ATL	PHX	1587	5	47
3	2000	1	31	1	1645.0	1647	1852.0	1859	HP	154	...	226.0	-7.0	-2.0	ATL	PHX	1587	7	14
4	2000	1	1	6	842.0	846	1057.0	1101	HP	609	...	244.0	-4.0	-4.0	ATL	PHX	1587	3	8
5	2000	1	2	7	849.0	846	1148.0	1101	HP	609	...	267.0	47.0	3.0	ATL	PHX	1587	8	24

5 rows × 23 columns

Split Target Class From Attributes

X = df_drop_col_row.drop('ArrDelay', 1)
y = [int(arrDelay<=0) for arrDelay in df_drop_col_row['ArrDelay']]
#cols = df_drop_col_row.columns
#new_cols = ['ArrDelay'] + list(set(cols)-set(['ArrDelay']))
#X = df_drop_col_row[new_cols]
X.head()

	Year	Month	DayofMonth	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum	...	CRSElapsedTime	AirTime	DepDelay	Origin	Dest	Distance	TaxiIn	TaxiOut
0	2000	1	28	5	1647.0	1647	1906.0	1859	HP	154	...	252.0	233.0	0.0	ATL	PHX	1587	15	11
1	2000	1	29	6	1648.0	1647	1939.0	1859	HP	154	...	252.0	239.0	1.0	ATL	PHX	1587	5	47
3	2000	1	31	1	1645.0	1647	1852.0	1859	HP	154	...	252.0	226.0	-2.0	ATL	PHX	1587	7	14
4	2000	1	1	6	842.0	846	1057.0	1101	HP	609	...	255.0	244.0	-4.0	ATL	PHX	1587	3	8
5	2000	1	2	7	849.0	846	1148.0	1101	HP	609	...	255.0	267.0	3.0	ATL	PHX	1587	8	24

5 rows × 22 columns

Dealing with categorical Dara

One-Hot Encoding is to create a new dummy feature column for each unique value in the nominal feature. To perform this transformation, we can use the OneHotEncoder from Scikit-learn:

print('Shape of input before one-hot: {}'.format(X.shape))

Shape of input before one-hot: (958, 22)

Select categorical columns

Recognize non-numeric columns as categorical columns
Manually select some numeric columns (ex. ‘Year’, ‘Month’) as categorical columns

# Recognize non-numeric columns as categorical columns
cols = X.columns
num_cols = X._get_numeric_data().columns
catego_cols = list(set(cols) - set(num_cols))

# Add other categorical columns
catego_cols.extend(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum'])#, 'Origin', 'Dest'])

print('Categorical Columns: {}'.format(catego_cols))

Categorical Columns: ['Origin', 'Dest', 'TailNum', 'UniqueCarrier', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum']

Encode categorical columns

First, convert string to interger since The input to OneHotEncoder transformer should be a matrix of integers.

# encode label first
catego_le = LabelEncoder()

for i in catego_cols:
    X[i] = catego_le.fit_transform(X[i].values)
    classes_list = catego_le.classes_.tolist()
    
X.head()

	DayofMonth	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	FlightNum	...	CRSElapsedTime	AirTime	DepDelay	Distance	TaxiIn	TaxiOut
0	27	4	1647.0	1647	1906.0	1859	3	...	252.0	233.0	0.0	1587	15	11
1	28	5	1648.0	1647	1939.0	1859	3	...	252.0	239.0	1.0	1587	5	47
3	30	0	1645.0	1647	1852.0	1859	3	...	252.0	226.0	-2.0	1587	7	14
4	0	5	842.0	846	1057.0	1101	13	...	255.0	244.0	-4.0	1587	3	8
5	1	6	849.0	846	1148.0	1101	13	...	255.0	267.0	3.0	1587	8	24

5 rows × 22 columns

Then we can convert categorical columns using OneHotEncoder.

# find the index of the categorical feature
catego_cols_idx = []
for str in catego_cols:
    catego_cols_idx.append(X.columns.tolist().index(str))

# give the column index you want to do one-hot encoding
ohe = OneHotEncoder(categorical_features = catego_cols_idx)

# fit one-hot encoder
onehot_data = ohe.fit_transform(X.values).toarray()
print('Shape of input after one-hot: {}'.format(onehot_data.shape))

Shape of input after one-hot: (958, 449)

data = pd.DataFrame(onehot_data, index=X.index)
data.head()

	0	1	2	3	...	439	440	441	442	443	444	445	446
0	1.0	1.0	0.0	0.0	...	1859.0	259.0	252.0	233.0	0.0	1587.0	15.0	11.0
1	1.0	1.0	0.0	0.0	...	1859.0	291.0	252.0	239.0	1.0	1587.0	5.0	47.0
3	1.0	1.0	0.0	0.0	...	1859.0	247.0	252.0	226.0	-2.0	1587.0	7.0	14.0
4	1.0	1.0	1.0	0.0	...	1101.0	255.0	255.0	244.0	-4.0	1587.0	3.0	8.0
5	1.0	1.0	0.0	1.0	...	1101.0	299.0	255.0	267.0	3.0	1587.0	8.0	24.0

5 rows × 449 columns

Append Target Class Back to Dataset

Note that the target class should be at the last column.

data['ArrDelay'] = y
data.head()

	0	1	2	3	...	440	441	442	443	444	445	446	ArrDelay
0	1.0	1.0	0.0	0.0	...	259.0	252.0	233.0	0.0	1587.0	15.0	11.0	0
1	1.0	1.0	0.0	0.0	...	291.0	252.0	239.0	1.0	1587.0	5.0	47.0	0
3	1.0	1.0	0.0	0.0	...	247.0	252.0	226.0	-2.0	1587.0	7.0	14.0	1
4	1.0	1.0	1.0	0.0	...	255.0	255.0	244.0	-4.0	1587.0	3.0	8.0	1
5	1.0	1.0	0.0	1.0	...	299.0	255.0	267.0	3.0	1587.0	8.0	24.0	0

5 rows × 450 columns

Export Preprocessed Data

Note that we set headre=False to avoid mapreduce function mistaken header as a data row.

data.to_csv('logistic_input', header=False)

An Explorer of Things

Hello, my name is Chihling :)