DataScience
Python
Sklearn
Pandas
Data Source and Variable Definition:
Statistical Computing Statistical Graphics
We are going to use flight information for 2000.
Python Libraries to be used:
import pandas as pd
from IPython.display import display
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import numpy as np
Useful parameters:
df = pd.read_csv('data/air2000_5000.csv', header=0, index_col=False)
df.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 15 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 5 | 47 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
2 | 2000 | 1 | 30 | 7 | NaN | 1647 | NaN | 1859 | HP | 154 | ... | 0 | 0 | 1 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 7 | 14 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 3 | 8 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
# count the number of missing values per column
display(df.isnull().sum())
Year 0
Month 0
DayofMonth 0
DayOfWeek 0
DepTime 174
CRSDepTime 0
ArrTime 178
CRSArrTime 0
UniqueCarrier 0
FlightNum 0
TailNum 0
ActualElapsedTime 178
CRSElapsedTime 0
AirTime 178
ArrDelay 178
DepDelay 174
Origin 0
Dest 0
Distance 0
TaxiIn 0
TaxiOut 0
Cancelled 0
CancellationCode 4999
Diverted 0
CarrierDelay 4999
WeatherDelay 4999
NASDelay 4999
SecurityDelay 4999
LateAircraftDelay 4999
dtype: int64
One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely. We can call the dropna() method of Dataframe to eliminate rows or columns:
# drop columns with ALL NaN
df_drop_col = df.dropna(axis=1, thresh=1)
df_drop_col.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 233.0 | 7.0 | 0.0 | ATL | PHX | 1587 | 15 | 11 | 0 | 0 |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 239.0 | 40.0 | 1.0 | ATL | PHX | 1587 | 5 | 47 | 0 | 0 |
2 | 2000 | 1 | 30 | 7 | NaN | 1647 | NaN | 1859 | HP | 154 | ... | NaN | NaN | NaN | ATL | PHX | 1587 | 0 | 0 | 1 | 0 |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 226.0 | -7.0 | -2.0 | ATL | PHX | 1587 | 7 | 14 | 0 | 0 |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 244.0 | -4.0 | -4.0 | ATL | PHX | 1587 | 3 | 8 | 0 | 0 |
5 rows × 23 columns
# drop rows with ANY NaN
df_drop_col_row = df_drop_col.dropna(axis=0, thresh=df_drop_col.shape[1])
df_drop_col_row.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 233.0 | 7.0 | 0.0 | ATL | PHX | 1587 | 15 | 11 | 0 | 0 |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 239.0 | 40.0 | 1.0 | ATL | PHX | 1587 | 5 | 47 | 0 | 0 |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 226.0 | -7.0 | -2.0 | ATL | PHX | 1587 | 7 | 14 | 0 | 0 |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 244.0 | -4.0 | -4.0 | ATL | PHX | 1587 | 3 | 8 | 0 | 0 |
5 | 2000 | 1 | 2 | 7 | 849.0 | 846 | 1148.0 | 1101 | HP | 609 | ... | 267.0 | 47.0 | 3.0 | ATL | PHX | 1587 | 8 | 24 | 0 | 0 |
5 rows × 23 columns
X = df_drop_col_row.drop('ArrDelay', 1)
y = [int(arrDelay<=0) for arrDelay in df_drop_col_row['ArrDelay']]
X.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | CRSElapsedTime | AirTime | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 252 | 233.0 | 0.0 | ATL | PHX | 1587 | 15 | 11 | 0 | 0 |
1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 252 | 239.0 | 1.0 | ATL | PHX | 1587 | 5 | 47 | 0 | 0 |
3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 252 | 226.0 | -2.0 | ATL | PHX | 1587 | 7 | 14 | 0 | 0 |
4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 255 | 244.0 | -4.0 | ATL | PHX | 1587 | 3 | 8 | 0 | 0 |
5 | 2000 | 1 | 2 | 7 | 849.0 | 846 | 1148.0 | 1101 | HP | 609 | ... | 255 | 267.0 | 3.0 | ATL | PHX | 1587 | 8 | 24 | 0 | 0 |
5 rows × 22 columns
One-Hot Encoding is to create a new dummy feature column for each unique value in the nominal feature. To perform this transformation, we can use the OneHotEncoder from Scikit-learn:
print('Shape of input before one-hot: {}'.format(X.shape))
Shape of input before one-hot: (4821, 22)
# Recognize non-numeric columns as categorical columns
cols = X.columns
num_cols = X._get_numeric_data().columns
catego_cols = list(set(cols) - set(num_cols))
# Add other categorical columns
catego_cols.extend(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum'])#, 'Origin', 'Dest'])
print('Categorical Columns: {}'.format(catego_cols))
Categorical Columns: ['UniqueCarrier', 'TailNum', 'Dest', 'Origin', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightNum']
First, convert string to interger since The input to OneHotEncoder transformer should be a matrix of integers.
# encode label first
catego_le = LabelEncoder()
for i in catego_cols:
X[i] = catego_le.fit_transform(X[i].values)
classes_list = catego_le.classes_.tolist()
X.head()
Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | CRSElapsedTime | AirTime | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 27 | 4 | 1647.0 | 1647 | 1906.0 | 1859 | 3 | 12 | ... | 252 | 233.0 | 0.0 | 0 | 0 | 1587 | 15 | 11 | 0 | 0 |
1 | 0 | 0 | 28 | 5 | 1648.0 | 1647 | 1939.0 | 1859 | 3 | 12 | ... | 252 | 239.0 | 1.0 | 0 | 0 | 1587 | 5 | 47 | 0 | 0 |
3 | 0 | 0 | 30 | 0 | 1645.0 | 1647 | 1852.0 | 1859 | 3 | 12 | ... | 252 | 226.0 | -2.0 | 0 | 0 | 1587 | 7 | 14 | 0 | 0 |
4 | 0 | 0 | 0 | 5 | 842.0 | 846 | 1057.0 | 1101 | 3 | 54 | ... | 255 | 244.0 | -4.0 | 0 | 0 | 1587 | 3 | 8 | 0 | 0 |
5 | 0 | 0 | 1 | 6 | 849.0 | 846 | 1148.0 | 1101 | 3 | 54 | ... | 255 | 267.0 | 3.0 | 0 | 0 | 1587 | 8 | 24 | 0 | 0 |
5 rows × 22 columns
Then we can convert categorical columns using OneHotEncoder.
# find the index of the categorical feature
catego_cols_idx = []
for str in catego_cols:
catego_cols_idx.append(X.columns.tolist().index(str))
# give the column index you want to do one-hot encoding
ohe = OneHotEncoder(categorical_features = catego_cols_idx)
# fit one-hot encoder
onehot_data = ohe.fit_transform(X.values).toarray()
print('Shape of input after one-hot: {}'.format(onehot_data.shape))
Shape of input after one-hot: (4821, 1361)
data = pd.DataFrame(onehot_data, index=X.index)
# This is the format for testing input
classification_test_input = data
data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 1351 | 1352 | 1353 | 1354 | 1355 | 1356 | 1357 | 1358 | 1359 | 1360 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1859.0 | 259.0 | 252.0 | 233.0 | 0.0 | 1587.0 | 15.0 | 11.0 | 0.0 | 0.0 |
1 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1859.0 | 291.0 | 252.0 | 239.0 | 1.0 | 1587.0 | 5.0 | 47.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1859.0 | 247.0 | 252.0 | 226.0 | -2.0 | 1587.0 | 7.0 | 14.0 | 0.0 | 0.0 |
4 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1101.0 | 255.0 | 255.0 | 244.0 | -4.0 | 1587.0 | 3.0 | 8.0 | 0.0 | 0.0 |
5 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1101.0 | 299.0 | 255.0 | 267.0 | 3.0 | 1587.0 | 8.0 | 24.0 | 0.0 | 0.0 |
5 rows × 1361 columns
Since Naive Bayes Classification requires nonnegative feature values, we need to deal with negative values. Here I use softmax function to transform negative values.
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
data_nonneg = data
for col in data_nonneg.columns[(data < 0).any()].tolist():
data_nonneg[col] = softmax(data[col])
# This isthe format for nonnegative testing input
classification_nonnegative_test_input = data_nonneg
Note that the target class should be at the last column.
data['ArrDelay'] = y
data_nonneg['ArrDelay'] = y
# This is the format for training input
classification_train_input = data
# This is the format for nonnegative training input
classification_nonnegative_train_input = data_nonneg
data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 2484 | 2485 | 2486 | 2487 | 2488 | 2489 | 2490 | 2491 | 2492 | ArrDelay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2000 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 259.0 | 252.0 | 233.0 | 7.0 | 1587.0 | 15.0 | 11.0 | 0.0 | 0.0 | 1 |
2000 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 291.0 | 252.0 | 239.0 | 40.0 | 1587.0 | 5.0 | 47.0 | 0.0 | 0.0 | 0 |
2000 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 247.0 | 252.0 | 226.0 | -7.0 | 1587.0 | 7.0 | 14.0 | 0.0 | 0.0 | 1 |
2000 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 255.0 | 255.0 | 244.0 | -4.0 | 1587.0 | 3.0 | 8.0 | 0.0 | 0.0 | 1 |
2000 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 299.0 | 255.0 | 267.0 | 47.0 | 1587.0 | 8.0 | 24.0 | 0.0 | 0.0 | 0 |
5 rows × 2494 columns
Note that we set headre=False to avoid mapreduce function mistaken header as a data row.
classification_train_input.to_csv('classification_train_input', header=False)
classification_test_input.to_csv('classification_test_input', header=False)
classification_nonnegative_train_input.to_csv('classification_nonnegative_train_input', header=False)
classification_nonnegative_test_input.to_csv('classification_nonnegative_test_input', header=False)
Note that validation file should be split form training file by yourself.