Try   HackMD

Data Preprocessing on Airline Data

tags: DataScience Python Sklearn Pandas

Data Source and Variable Definition:

Statistical Computing Statistical Graphics
We are going to use flight information for 2000.

Python Libraries to be used:

import pandas as pd
from IPython.display import display
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import numpy as np

Load Dataset

pandas.read_csv()

Useful parameters:

  • sep : str, default ‘,’
  • header : int or list of ints. Row number(s) to use as the column names, and the start of the data.
  • index_col : int or sequence or False, default None. Column to use as the row labels of the DataFrame.
df = pd.read_csv('data/air2000_5000.csv', header=0, index_col=False)
df.head()
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum ... TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
0 2000 1 28 5 1647.0 1647 1906.0 1859 HP 154 ... 15 11 0 NaN 0 NaN NaN NaN NaN NaN
1 2000 1 29 6 1648.0 1647 1939.0 1859 HP 154 ... 5 47 0 NaN 0 NaN NaN NaN NaN NaN
2 2000 1 30 7 NaN 1647 NaN 1859 HP 154 ... 0 0 1 NaN 0 NaN NaN NaN NaN NaN
3 2000 1 31 1 1645.0 1647 1852.0 1859 HP 154 ... 7 14 0 NaN 0 NaN NaN NaN NaN NaN
4 2000 1 1 6 842.0 846 1057.0 1101 HP 609 ... 3 8 0 NaN 0 NaN NaN NaN NaN NaN

5 rows × 29 columns

Dealing with Missing Data

# count the number of missing values per column
display(df.isnull().sum())
​​​​Year                    0
​​​​Month                   0
​​​​DayofMonth              0
​​​​DayOfWeek               0
​​​​DepTime               174
​​​​CRSDepTime              0
​​​​ArrTime               178
​​​​CRSArrTime              0
​​​​UniqueCarrier           0
​​​​FlightNum               0
​​​​TailNum                 0
​​​​ActualElapsedTime     178
​​​​CRSElapsedTime          0
​​​​AirTime               178
​​​​ArrDelay              178
​​​​DepDelay              174
​​​​Origin                  0
​​​​Dest                    0
​​​​Distance                0
​​​​TaxiIn                  0
​​​​TaxiOut                 0
​​​​Cancelled               0
​​​​CancellationCode     4999
​​​​Diverted                0
​​​​CarrierDelay         4999
​​​​WeatherDelay         4999
​​​​NASDelay             4999
​​​​SecurityDelay        4999
​​​​LateAircraftDelay    4999
​​​​dtype: int64

Eliminating Samples or Features with Missing Values

One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely. We can call the dropna() method of Dataframe to eliminate rows or columns:

# drop columns with ALL NaN
df_drop_col = df.dropna(axis=1, thresh=1)
df_drop_col.head()
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum ... AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Diverted
0 2000 1 28 5 1647.0 1647 1906.0 1859 HP 154 ... 233.0 7.0 0.0 ATL PHX 1587 15 11 0 0
1 2000 1 29 6 1648.0 1647 1939.0 1859 HP 154 ... 239.0 40.0 1.0 ATL PHX 1587 5 47 0 0
2 2000 1 30 7 NaN 1647 NaN 1859 HP 154 ... NaN NaN NaN ATL PHX 1587 0 0 1 0
3 2000 1 31 1 1645.0 1647 1852.0 1859 HP 154 ... 226.0 -7.0 -2.0 ATL PHX 1587 7 14 0 0
4 2000 1 1 6 842.0 846 1057.0 1101 HP 609 ... 244.0 -4.0 -4.0 ATL PHX 1587 3 8 0 0

5 rows × 23 columns

# drop rows with ANY NaN
df_drop_col_row = df_drop_col.dropna(axis=0, thresh=df_drop_col.shape[1])
df_drop_col_row.head()
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum ... AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Diverted
0 2000 1 28 5 1647.0 1647 1906.0 1859 HP 154 ... 233.0 7.0 0.0 ATL PHX 1587 15 11 0 0
1 2000 1 29 6 1648.0 1647 1939.0 1859 HP 154 ... 239.0 40.0 1.0 ATL PHX 1587 5 47 0 0
3 2000 1 31 1 1645.0 1647 1852.0 1859 HP 154 ... 226.0 -7.0 -2.0 ATL PHX 1587 7 14 0 0
4 2000 1 1 6 842.0 846 1057.0 1101 HP 609 ... 244.0 -4.0 -4.0 ATL PHX 1587 3 8 0 0
5 2000 1 2 7 849.0 846 1148.0 1101 HP 609 ... 267.0 47.0 3.0 ATL PHX 1587 8 24 0 0

5 rows × 23 columns

Split Target Class From Attributes

X = df_drop_col_row.drop('ArrDelay', 1)
y = [int(arrDelay<=0) for arrDelay in df_drop_col_row['ArrDelay']]
X.head()
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum ... CRSElapsedTime AirTime DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Diverted
0 2000 1 28 5 1647.0 1647 1906.0 1859 HP 154 ... 252 233.0 0.0 ATL PHX 1587 15 11 0 0
1 2000 1 29 6 1648.0 1647 1939.0 1859 HP 154 ... 252 239.0 1.0 ATL PHX 1587 5 47 0 0
3 2000 1 31 1 1645.0 1647 1852.0 1859 HP 154 ... 252 226.0 -2.0 ATL PHX 1587 7 14 0 0
4 2000 1 1 6 842.0 846 1057.0 1101 HP 609 ... 255 244.0 -4.0 ATL PHX 1587 3 8 0 0
5 2000 1 2 7 849.0 846 1148.0 1101 HP 609 ... 255 267.0 3.0 ATL PHX 1587 8 24 0 0

5 rows × 22 columns

Deal with categorical Dara

One-Hot Encoding is to create a new dummy feature column for each unique value in the nominal feature. To perform this transformation, we can use the OneHotEncoder from Scikit-learn:

print('Shape of input before one-hot: {}'.format(X.shape))
​​​​Shape of input before one-hot: (4821, 22)

Select categorical columns

  1. Recognize non-numeric columns as categorical columns
  2. Manually select some numeric columns (ex. 'Year', 'Month') as categorical columns
# Recognize non-numeric columns as categorical columns
cols = X.columns
num_cols = X._get_numeric_data().columns
catego_cols = list(set(cols) - set(num_cols))

# Add other categorical columns
catego_cols.extend(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum'])#, 'Origin', 'Dest'])

print('Categorical Columns: {}'.format(catego_cols))
​​​​Categorical Columns: ['UniqueCarrier', 'TailNum', 'Dest', 'Origin', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum']

Encode categorical columns

First, convert string to interger since The input to OneHotEncoder transformer should be a matrix of integers.

# encode label first
catego_le = LabelEncoder()

for i in catego_cols:
    X[i] = catego_le.fit_transform(X[i].values)
    classes_list = catego_le.classes_.tolist()
    
X.head()
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum ... CRSElapsedTime AirTime DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Diverted
0 0 0 27 4 1647.0 1647 1906.0 1859 3 12 ... 252 233.0 0.0 0 0 1587 15 11 0 0
1 0 0 28 5 1648.0 1647 1939.0 1859 3 12 ... 252 239.0 1.0 0 0 1587 5 47 0 0
3 0 0 30 0 1645.0 1647 1852.0 1859 3 12 ... 252 226.0 -2.0 0 0 1587 7 14 0 0
4 0 0 0 5 842.0 846 1057.0 1101 3 54 ... 255 244.0 -4.0 0 0 1587 3 8 0 0
5 0 0 1 6 849.0 846 1148.0 1101 3 54 ... 255 267.0 3.0 0 0 1587 8 24 0 0

5 rows × 22 columns

Then we can convert categorical columns using OneHotEncoder.

# find the index of the categorical feature
catego_cols_idx = []
for str in catego_cols:
    catego_cols_idx.append(X.columns.tolist().index(str))

# give the column index you want to do one-hot encoding
ohe = OneHotEncoder(categorical_features = catego_cols_idx)

# fit one-hot encoder
onehot_data = ohe.fit_transform(X.values).toarray()
print('Shape of input after one-hot: {}'.format(onehot_data.shape))
​​​​Shape of input after one-hot: (4821, 1361)
data = pd.DataFrame(onehot_data, index=X.index)

# This is the format for testing input
classification_test_input = data

data.head()
0 1 2 3 4 5 6 7 8 9 ... 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360
0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1859.0 259.0 252.0 233.0 0.0 1587.0 15.0 11.0 0.0 0.0
1 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1859.0 291.0 252.0 239.0 1.0 1587.0 5.0 47.0 0.0 0.0
3 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1859.0 247.0 252.0 226.0 -2.0 1587.0 7.0 14.0 0.0 0.0
4 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1101.0 255.0 255.0 244.0 -4.0 1587.0 3.0 8.0 0.0 0.0
5 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1101.0 299.0 255.0 267.0 3.0 1587.0 8.0 24.0 0.0 0.0

5 rows × 1361 columns

Deal with Negative Values

Since Naive Bayes Classification requires nonnegative feature values, we need to deal with negative values. Here I use softmax function to transform negative values.

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()
data_nonneg = data
for col in data_nonneg.columns[(data < 0).any()].tolist():
    data_nonneg[col] = softmax(data[col])

# This isthe format for nonnegative testing input
classification_nonnegative_test_input = data_nonneg

Append Target Class Back to Dataset

Note that the target class should be at the last column.

data['ArrDelay'] = y
data_nonneg['ArrDelay'] = y

# This is the format for training input
classification_train_input = data
# This is the format for nonnegative training input
classification_nonnegative_train_input = data_nonneg

data.head()
0 1 2 3 4 5 6 7 8 9 ... 2484 2485 2486 2487 2488 2489 2490 2491 2492 ArrDelay
2000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 259.0 252.0 233.0 7.0 1587.0 15.0 11.0 0.0 0.0 1
2000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 291.0 252.0 239.0 40.0 1587.0 5.0 47.0 0.0 0.0 0
2000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 247.0 252.0 226.0 -7.0 1587.0 7.0 14.0 0.0 0.0 1
2000 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 255.0 255.0 244.0 -4.0 1587.0 3.0 8.0 0.0 0.0 1
2000 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 299.0 255.0 267.0 47.0 1587.0 8.0 24.0 0.0 0.0 0

5 rows × 2494 columns

Export Preprocessed Data

Note that we set headre=False to avoid mapreduce function mistaken header as a data row.

classification_train_input.to_csv('classification_train_input', header=False)
classification_test_input.to_csv('classification_test_input', header=False)
classification_nonnegative_train_input.to_csv('classification_nonnegative_train_input', header=False)
classification_nonnegative_test_input.to_csv('classification_nonnegative_test_input', header=False)

Note that validation file should be split form training file by yourself.