Data Preprocessing on Airline Data
Data Source and Variable Definition:
Statistical Computing Statistical Graphics We are going to use flight information for 2000.
Python Libraries to be used:
import pandas as pd
from IPython.display import display
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import numpy as np
Load Dataset
pandas.read_csv()
Useful parameters:
- sep : str, default ‘,’
- header : int or list of ints. Row number(s) to use as the column names, and the start of the data.
- index_col : int or sequence or False, default None. Column to use as the row labels of the DataFrame.
df = pd.read_csv('data/air2000_test.csv', header=0, index_col=0)
df.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 15 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 5 | 47 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
2 | 2000 | 1 | 30 | 7 | NaN | 1647 | NaN | 1859 | HP | 154 | ... | 0 | 0 | 1 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 7 | 14 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 3 | 8 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
Dealing with Missing Data
# count the number of missing values per column
display(df.isnull().sum())
Year 0
Month 0
DayofMonth 0
DayOfWeek 0
DepTime 40
CRSDepTime 0
ArrTime 42
CRSArrTime 0
UniqueCarrier 0
FlightNum 0
TailNum 0
ActualElapsedTime 42
CRSElapsedTime 0
AirTime 42
ArrDelay 42
DepDelay 40
Origin 0
Dest 0
Distance 0
TaxiIn 0
TaxiOut 0
Cancelled 0
CancellationCode 1000
Diverted 0
CarrierDelay 1000
WeatherDelay 1000
NASDelay 1000
SecurityDelay 1000
LateAircraftDelay 1000
dtype: int64
Eliminating Samples or Features with Missing Values
One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely. We can call the dropna() method of Dataframe to eliminate rows or columns:
# drop columns with ALL NaN
df_drop_col = df.dropna(axis=1, thresh=1)
df_drop_col.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 233.0 | 7.0 | 0.0 | ATL | PHX | 1587 | 15 | 11 | 0 | 0 |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 239.0 | 40.0 | 1.0 | ATL | PHX | 1587 | 5 | 47 | 0 | 0 |
2 | 2000 | 1 | 30 | 7 | NaN | 1647 | NaN | 1859 | HP | 154 | ... | NaN | NaN | NaN | ATL | PHX | 1587 | 0 | 0 | 1 | 0 |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 226.0 | -7.0 | -2.0 | ATL | PHX | 1587 | 7 | 14 | 0 | 0 |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 244.0 | -4.0 | -4.0 | ATL | PHX | 1587 | 3 | 8 | 0 | 0 |
5 rows × 23 columns
# drop rows with ANY NaN
df_drop_col_row = df_drop_col.dropna(axis=0, thresh=df_drop_col.shape[1])
df_drop_col_row.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 233.0 | 7.0 | 0.0 | ATL | PHX | 1587 | 15 | 11 | 0 | 0 |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 239.0 | 40.0 | 1.0 | ATL | PHX | 1587 | 5 | 47 | 0 | 0 |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 226.0 | -7.0 | -2.0 | ATL | PHX | 1587 | 7 | 14 | 0 | 0 |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 244.0 | -4.0 | -4.0 | ATL | PHX | 1587 | 3 | 8 | 0 | 0 |
5 | 2000 | 1 | 2 | 7 | 849.0 | 846 | 1148.0 | 1101 | HP | 609 | ... | 267.0 | 47.0 | 3.0 | ATL | PHX | 1587 | 8 | 24 | 0 | 0 |
5 rows × 23 columns
Split Target Class From Attributes
X = df_drop_col_row.drop('ArrDelay', 1)
y = [int(arrDelay<=0) for arrDelay in df_drop_col_row['ArrDelay']]
#cols = df_drop_col_row.columns
#new_cols = ['ArrDelay'] + list(set(cols)-set(['ArrDelay']))
#X = df_drop_col_row[new_cols]
X.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | CRSElapsedTime | AirTime | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 252.0 | 233.0 | 0.0 | ATL | PHX | 1587 | 15 | 11 | 0 | 0 |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 252.0 | 239.0 | 1.0 | ATL | PHX | 1587 | 5 | 47 | 0 | 0 |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 252.0 | 226.0 | -2.0 | ATL | PHX | 1587 | 7 | 14 | 0 | 0 |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 255.0 | 244.0 | -4.0 | ATL | PHX | 1587 | 3 | 8 | 0 | 0 |
5 | 2000 | 1 | 2 | 7 | 849.0 | 846 | 1148.0 | 1101 | HP | 609 | ... | 255.0 | 267.0 | 3.0 | ATL | PHX | 1587 | 8 | 24 | 0 | 0 |
5 rows × 22 columns
Dealing with categorical Dara
One-Hot Encoding is to create a new dummy feature column for each unique value in the nominal feature. To perform this transformation, we can use the OneHotEncoder from Scikit-learn:
print('Shape of input before one-hot: {}'.format(X.shape))
Shape of input before one-hot: (958, 22)
Select categorical columns
- Recognize non-numeric columns as categorical columns
- Manually select some numeric columns (ex. ‘Year’, ‘Month’) as categorical columns
# Recognize non-numeric columns as categorical columns
cols = X.columns
num_cols = X._get_numeric_data().columns
catego_cols = list(set(cols) - set(num_cols))
# Add other categorical columns
catego_cols.extend(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum'])#, 'Origin', 'Dest'])
print('Categorical Columns: {}'.format(catego_cols))
Categorical Columns: ['Origin', 'Dest', 'TailNum', 'UniqueCarrier', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum']
Encode categorical columns
First, convert string to interger since The input to OneHotEncoder transformer should be a matrix of integers.
# encode label first
catego_le = LabelEncoder()
for i in catego_cols:
X[i] = catego_le.fit_transform(X[i].values)
classes_list = catego_le.classes_.tolist()
X.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | CRSElapsedTime | AirTime | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 27 | 4 | 1647.0 | 1647 | 1906.0 | 1859 | 0 | 3 | ... | 252.0 | 233.0 | 0.0 | 0 | 0 | 1587 | 15 | 11 | 0 | 0 |
1 | 0 | 0 | 28 | 5 | 1648.0 | 1647 | 1939.0 | 1859 | 0 | 3 | ... | 252.0 | 239.0 | 1.0 | 0 | 0 | 1587 | 5 | 47 | 0 | 0 |
3 | 0 | 0 | 30 | 0 | 1645.0 | 1647 | 1852.0 | 1859 | 0 | 3 | ... | 252.0 | 226.0 | -2.0 | 0 | 0 | 1587 | 7 | 14 | 0 | 0 |
4 | 0 | 0 | 0 | 5 | 842.0 | 846 | 1057.0 | 1101 | 0 | 13 | ... | 255.0 | 244.0 | -4.0 | 0 | 0 | 1587 | 3 | 8 | 0 | 0 |
5 | 0 | 0 | 1 | 6 | 849.0 | 846 | 1148.0 | 1101 | 0 | 13 | ... | 255.0 | 267.0 | 3.0 | 0 | 0 | 1587 | 8 | 24 | 0 | 0 |
5 rows × 22 columns
Then we can convert categorical columns using OneHotEncoder.
# find the index of the categorical feature
catego_cols_idx = []
for str in catego_cols:
catego_cols_idx.append(X.columns.tolist().index(str))
# give the column index you want to do one-hot encoding
ohe = OneHotEncoder(categorical_features = catego_cols_idx)
# fit one-hot encoder
onehot_data = ohe.fit_transform(X.values).toarray()
print('Shape of input after one-hot: {}'.format(onehot_data.shape))
Shape of input after one-hot: (958, 449)
data = pd.DataFrame(onehot_data, index=X.index)
data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1859.0 | 259.0 | 252.0 | 233.0 | 0.0 | 1587.0 | 15.0 | 11.0 | 0.0 | 0.0 |
1 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1859.0 | 291.0 | 252.0 | 239.0 | 1.0 | 1587.0 | 5.0 | 47.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1859.0 | 247.0 | 252.0 | 226.0 | -2.0 | 1587.0 | 7.0 | 14.0 | 0.0 | 0.0 |
4 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1101.0 | 255.0 | 255.0 | 244.0 | -4.0 | 1587.0 | 3.0 | 8.0 | 0.0 | 0.0 |
5 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1101.0 | 299.0 | 255.0 | 267.0 | 3.0 | 1587.0 | 8.0 | 24.0 | 0.0 | 0.0 |
5 rows × 449 columns
Append Target Class Back to Dataset
Note that the target class should be at the last column.
data['ArrDelay'] = y
data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | ArrDelay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 259.0 | 252.0 | 233.0 | 0.0 | 1587.0 | 15.0 | 11.0 | 0.0 | 0.0 | 0 |
1 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 291.0 | 252.0 | 239.0 | 1.0 | 1587.0 | 5.0 | 47.0 | 0.0 | 0.0 | 0 |
3 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 247.0 | 252.0 | 226.0 | -2.0 | 1587.0 | 7.0 | 14.0 | 0.0 | 0.0 | 1 |
4 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 255.0 | 255.0 | 244.0 | -4.0 | 1587.0 | 3.0 | 8.0 | 0.0 | 0.0 | 1 |
5 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 299.0 | 255.0 | 267.0 | 3.0 | 1587.0 | 8.0 | 24.0 | 0.0 | 0.0 | 0 |
5 rows × 450 columns
Export Preprocessed Data
Note that we set headre=False to avoid mapreduce function mistaken header as a data row.
data.to_csv('logistic_input', header=False)