Deep Learnig for Flights Delay Prediction in US

Nima Salehi - nsalehi@umich.edu

1. Introduction

Bureau of Transportation Statistics (BTS) reported a record of 3.5 million commercial flights from/to California in 2016. According to BTS, a flight is considered as delayed when it departs/arrives 15 or more minutes than its scheduled time. As we can see in this Figure, around 82% of all the flights are done on time and about 18% of them are delayed. The US Department of Transportation (DOT) identifies weather, National Aviation System (NAS), security, aircraft arriving late delays as the most important causes. In US, the Federal Aviation Administration [1] estimates that flight delays cost airlines $22 billion yearly. Time is money, and delayed flights are a frequent cause of frustration for both travelers and airlines [2]. Knowledge of the factors leading to specific flight delays can help Aviation authorities and Airlines in taking necessary actions to ensure smooth operations. A delayed flight can have other consequences like economic (missed connections and cancellations), environmental (wasted fuel), and social (loss of productivity and airport congestion). The objective of this study is to predict delayed departure of a flight. Correctly predicting flight delays allows passengers to be prepared for the disruption of the journey and airlines to proactively respond to the potential causes of the flight delay to mitigate the impact. This will help airlines to save money and helps people to expect less waiting times at airports.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import tensorflow as tf
import pandas as pd
import csv

#Reding Data
df_jan = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/jan16.csv', error_bad_lines=False,sep=',')
df_feb = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/feb16.csv', error_bad_lines=False,sep=',')
df_mar = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/mar16.csv', error_bad_lines=False,sep=',')
df_apr = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/apr16.csv', error_bad_lines=False,sep=',')
df_may = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/may16.csv', error_bad_lines=False,sep=',')
df_jun = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/jun16.csv', error_bad_lines=False,sep=',')
df_jul = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/jul16.csv', error_bad_lines=False,sep=',')
df_aug = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/aug16.csv', error_bad_lines=False,sep=',')
df_sep = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/sep16.csv', error_bad_lines=False,sep=',')
df_oct = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/oct16.csv', error_bad_lines=False,sep=',')
df_nov = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/nov16.csv', error_bad_lines=False,sep=',')
df_dec = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/dec16.csv', error_bad_lines=False,sep=',')

df = [df_jan,df_feb,df_mar,df_apr,df_may,df_jun,df_jul,df_aug,df_sep,df_oct,df_nov,df_dec]
df = pd.concat(df)
df.drop(['Unnamed: 35', 'YEAR'], axis=1, inplace=True)
print df.shape

#Removing rows related to cancelled or diverted flights
df = df.drop(df[df.CANCELLED != 0].index)
df = df.drop(df[df.DIVERTED != 0].index)
#Removing cancelled and diverted columns in df
df.drop(['CANCELLED', 'DIVERTED'], axis=1, inplace=True)
print df.shape
df.to_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/all_16.csv',sep=',')
print df.shape

#Create new variables: Hour of CRS Departure/Arrival Time
df['CRS_DEP_HR'] = df['CRS_DEP_TIME']//100
df['CRS_ARR_HR'] = df['CRS_ARR_TIME']//100

df['DEP_HR'] = df['DEP_TIME']//100
df['ARR_HR'] = df['ARR_TIME']//100

df.DEP_HR[df['DEP_HR'].apply(lambda x: x==24)] = 0
df.ARR_HR[df['ARR_HR'].apply(lambda x: x==24)] = 0

df[df['MONTH'].apply(lambda x: x==0)] = 12

print 'done!!!!!'
(3433625, 34)
(3103820, 32)
(3103820, 32)
/home/nsalehi/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:43: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/nsalehi/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:44: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
done!!!!!
In [2]:
import seaborn

plt.figure(figsize=(4,6))

dep_ratio = df[['DEP_DEL15']].groupby('DEP_DEL15').size()
print 'Delay ratio to total=', 100*(1-dep_ratio[0]/float(dep_ratio[0]+dep_ratio[1])), "%"
arr_ratio = df[['ARR_DEL15']].groupby('ARR_DEL15').size()
bar_width = 0.4
plt.bar(np.arange(0,2)-0.5*bar_width,dep_ratio, width = bar_width, alpha=0.5, color= 'g')


x_labels = ['On-time', 'Delayed']
plt.xticks(np.arange(0,2), x_labels)
    
plt.xticks(np.arange(0,2))
plt.ylabel('Number of Flights')
plt.title('2016 Delay vs. On-time Flights')
Delay ratio to total= 17.925105193 %
Out[2]:
<matplotlib.text.Text at 0x7fa4492ef090>

2. Data Wrangling

In this study we use a dataset from the Bureau of Transportation Statistics (BTS) [3], known as the on-time performance data. This dataset contains scheduled and actual departure and arrival times reported by U.S. air carriers. Additional data elements included in this database includes departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. Since the whole data is very large, we only focus on flights in 2016 from/to all airports in California. First, we delete some unrelated variables. Moreover, we check the correlations among the variables and select the appropriate ones. For example, quarter and month are highly correlated and have similar relationships to the departure/arrival delays. Between them, we decide to only keep month because it is more detailed. After the above process, our final dataset contains 31038207 observations, 7 predictors (Month, Day of Week, Departure Time, Unique Carrier, origin (airport), dest (airport) and distance group) that could be used to predict delays and the categorical and continuous response variables correspond to departure delay and arrivals delay. According to their attributes, we divide the predictors into three categories: time, carrier and location. We will analyze the relationships between them and delays in the next section.

2.1. Variable Description

Variable Description
Month Month
DayOfWeek Day of Week
DayOfMonth Day of Month
UniqueCarrier Unique Carrier Code
Origin Origin Airport
Dest Destination Airport
CRSDepTime CRS Departure Time (local time: hhmm)
DepTime Actual Departure Time (local time: hhmm)
DepDelay Difference in minutes between scheduled and actual departure time
DepDel15 Departure Delay Indicator, 15 Minutes or More (1=Yes)
CRSArrTime CRS Arrival Time (local time: hhmm)
ArrTime Actual Arrival Time (local time: hhmm)
ArrDelay Difference in minutes between scheduled and actual arrival time
ArrDel15 Arrival Delay Indicator, 15 Minutes or More (1=Yes)
Cancelled Cancelled Flight Indicator (1=Yes)
Diverted Diverted Flight Indicator (1=Yes)
CRSElapsedTime CRS Elapsed Time of Flight, in Minutes
Distance Distance between airports (miles)
DistanceGroup Distance Intervals, every 250 Miles, for Flight Segment

3. Data Exploration Analysis

In this section we look at the fights dataset with respect to different features introduced in above.

3.1 Distribution of the Delays

As a first exploratory analysis, we consider the observed probability of delay in minutes on the entire dataset. The most effective way is through a histogram, looking at departure and arrival delays separately.

In [15]:
import matplotlib.pyplot as plt # module for plotting
%matplotlib inline
import seaborn

#plt arrival delay distribution
plt.figure(figsize=(12, 6))
plt.hist(df.DEP_DELAY.dropna(),bins = 1000,normed=1, alpha=0.5, color= 'b', label='Departure')
plt.hist(df.ARR_DELAY.dropna(),bins = 1000,normed=1, alpha=0.3,color= 'r', label='Arrival')

plt.legend(loc='upper right')
plt.xlim(-65,190)
plt.xlabel('Minutes')
plt.ylabel('Probability')
plt.title('2016 Departure/Arrival Delay Distribution')
Out[15]:
<matplotlib.text.Text at 0x7ff128773d90>

As we can see in this figure, shorter delays have higher probability for both departure and arrival. It seems the mean of both delays are negative. Both distributions have long and thin right tail. This means we have some flights that are delayed for very long time. The range of arrival delays is slightly wider than the departure delays. In both cases, the mode of the distribution is less than zero, meaning that most of the flights left from gates and arrived at gates before the schedule times. The X-axis for the two plots are the delay time in Minutes. A departure delay is defined by the schedule departure time compared to the actual departure time. The arrival delay is defined as the schedule departure time plus the estimated flight duration compared to the actual arrival time. The airlines may consider some buffer time in their estimations of the on-air time, therefore, the departure and arrival delay distributions difference indicates that some departure delays are recovered during the flights due to the extra amount of time embedded in the on-air time.

3.2. Time Dependant Variables

First, we explore the relationships between delays and several variables related to time.

3.2.1. Distribution of Flights by Month

As we can see in this Figure, We have significantly lower number of flights in the first 6 months of the year!

In [4]:
plt.figure(figsize=(12, 6))

dep_avg_month_df = df[['MONTH']].groupby('MONTH').size()

bar_width = 0.4
plt.bar(np.arange(1,13)-0.5*bar_width,dep_avg_month_df, width = bar_width, alpha=0.5, color= 'g')


x_labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
pos = [i for i in range(len(x_labels)) ]
plt.xticks(pos, x_labels, rotation='vertical')

plt.legend(loc='upper left')
plt.xticks(np.arange(1,13))
plt.xlabel('Month')
plt.ylabel('Number of Flights')
plt.title('2016 Number of Flights by Month')
/home/nsalehi/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "
Out[4]:
<matplotlib.text.Text at 0x7ff16695d610>
In [5]:
plt.figure(figsize=(12, 6))

dep_avg_month_df = df[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_month_df = df[['MONTH','ARR_DELAY']].groupby('MONTH').mean()

bar_width = 0.4
plt.bar(np.arange(1,13)-bar_width,dep_avg_month_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(1,13),arr_avg_month_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')


x_labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
pos = [i for i in range(len(x_labels)) ]
plt.xticks(pos, x_labels, rotation='vertical')

plt.legend(loc='upper left')
plt.xticks(np.arange(1,13))
plt.xlabel('Month')
plt.ylabel('Average')
plt.title('2016 Departure/Arrival Average Delay Time by Month')
Out[5]:
<matplotlib.text.Text at 0x7ff1667e0890>

For both departure and arrival delays, summer months and December have the highest average delay time (as expected). We were expecting this result due to the high traffic in these months. On the other hand, September, October and November are the months with the least amount of delay. Also,March posts high delay values as well. A positive correlation between departure and arriavl delays can be observed.

3.2.2. Distribution of Flights by Day of the Week

Evenly distributed! The average of the departure delays on weekends seems to be slightly lower than the week days.

In [6]:
plt.figure(figsize=(12, 6))

dep_avg_week_df = df[['DAY_OF_WEEK']].groupby('DAY_OF_WEEK').size()

bar_width = 0.4
plt.bar(np.arange(1,8)-0.5*bar_width,dep_avg_week_df, width = bar_width, alpha=0.5, color= 'g')

x_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
pos = [i for i in range(len(dep_avg_week_df)) ]
plt.xticks(pos, x_labels, rotation='vertical')

plt.xticks(np.arange(1,8))
plt.xlabel('Day of the Week')
plt.ylabel('Number of flighst')
plt.title('2016 Number of Flights by Day of The Week')
Out[6]:
<matplotlib.text.Text at 0x7ff1666ac4d0>
In [7]:
plt.figure(figsize=(12, 6))

dep_avg_week_df = df[['DAY_OF_WEEK','DEP_DELAY']].groupby('DAY_OF_WEEK').mean()
arr_avg_week_df = df[['DAY_OF_WEEK','ARR_DELAY']].groupby('DAY_OF_WEEK').mean()

bar_width = 0.4
plt.bar(np.arange(1,8)-bar_width,dep_avg_week_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(1,8),arr_avg_week_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')

x_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
pos = [i for i in range(len(arr_avg_week_df)) ]
plt.xticks(pos, x_labels, rotation='vertical')

plt.legend(loc='upper left')
plt.xticks(np.arange(1,8))
plt.xlabel('Day of the Week')
plt.ylabel('Average')
plt.title('2016 Departure/Arrival Average Delay Time by Day of The Week')
Out[7]:
<matplotlib.text.Text at 0x7ff166596150>

The average delays are pretty much similar. We have the highets delays in Friday and the lowest in Tuesday. A positive correlation between departure and arriavl delays can be observed.

3.2.3. Distribution of The Flights by Time of the Day

We could expect less flights during the early hours of each days (1:00 AM to 5:00 AM).

In [8]:
plt.figure(figsize=(12,6))   
dep_avg_hr_df = df[['DEP_HR']].groupby('DEP_HR').size()
crs_dep_avg_hr_df = df[['CRS_DEP_HR']].groupby('CRS_DEP_HR').size()
bar_width = 0.4
plt.bar(np.arange(0,24),dep_avg_hr_df, width = bar_width, alpha=0.5, color= 'g', label='Actual')
plt.bar(np.arange(0,24),crs_dep_avg_hr_df, width = bar_width, alpha=0.3, color= 'g', label='Scheduled')

plt.legend(loc='upper left')
plt.xticks(np.arange(0,24))
plt.xlabel('Time of the Day')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Number of Departures by Time of the Day')
Out[8]:
<matplotlib.text.Text at 0x7ff1664a59d0>
In [9]:
plt.figure(figsize=(12,6))   
dep_avg_hr_df = df[['CRS_DEP_HR','DEP_DELAY']].groupby('CRS_DEP_HR').mean()
arr_avg_hr_df = df[['CRS_ARR_HR','ARR_DELAY']].groupby('CRS_ARR_HR').mean()

bar_width = 0.4
plt.bar(np.arange(0,24),dep_avg_hr_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(0,24),arr_avg_hr_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')

plt.legend(loc='upper center')
plt.xticks(np.arange(0,24))
plt.xlabel('Time of the Day')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Departure/Arrival Average Delay Time by Time of the Day (Scheduled)')
Out[9]:
<matplotlib.text.Text at 0x7ff166259090>

A "V" shaped decline in delays with the lowest delays in early morning hours can be observed. This is due to the low traffic at these hours. Both departure and arrival delays accumulate from earlier morning reaching their peaks in the evening hours (18:00 PM to 21:00 PM). The peak is flat for arrival delay in 19:00 PM to 22:00 PM. The increasing trend of average flight delays by the hours of the day is mainly caused by flight delay propagation throughout the day. Although some flights are scheduled with some buffer time for unforeseeable flight delays, but this buffer is not sufficient to cover all types of delay. As a result, if a flight is delayed, the next flight has to wait for the late arrival flight to be ready before it can be operated. Hence, flight delays for both departure and arrival flights propagate over time.

In [32]:
plt.figure(figsize=(12,6))   
dep_avg_hr_df = df[['DEP_HR','DEP_DELAY']].groupby('DEP_HR').mean()
arr_avg_hr_df = df[['ARR_HR','ARR_DELAY']].groupby('ARR_HR').mean()

bar_width = 0.4
plt.bar(np.arange(0,24),dep_avg_hr_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(0,24),arr_avg_hr_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')

plt.legend(loc='upper center')
plt.xticks(np.arange(0,24))
plt.xlabel('Time of the Day')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Departure/Arrival Average Delay Time by Time of the Day (Actual)')
Out[32]:
<matplotlib.text.Text at 0x7ff14f57c8d0>

3.3. Location Dependant Variables

3.3.1. Distribution of The Flights by Origin/Destination Airports

There are 308 airports in California, we recognized these four airports as the main important ones: SFO (San Francisco), LAX (Los Angles), SAN (San Diego), OAK (Oakland). Here we focus on the main four airports in California:

In [4]:
df_lax = df[(df['ORIGIN'] == 'LAX') | (df['DEST']=='LAX')]
df_sfo = df[(df['ORIGIN'] == 'SFO') | (df['DEST']=='SFO')]
df_san = df[(df['ORIGIN'] == 'SAN') | (df['DEST']=='SAN')]
df_oak = df[(df['ORIGIN'] == 'OAK') | (df['DEST']=='OAK')]
In [4]:
fig = plt.figure()
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(12,8))   

dep_avg_lax_df = df_lax[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_lax_df = df_lax[['MONTH','ARR_DELAY']].groupby('MONTH').mean()

bar_width = 0.4
ax1.bar(np.arange(1,13)-bar_width,dep_avg_lax_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax1.bar(np.arange(1,13),arr_avg_lax_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax1.set_title('LAX')
ax1.legend(loc='upper left',prop={'size':6})
ax1.set_xticks(np.arange(1,13))

dep_avg_sfo_df = df_sfo[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_sfo_df = df_sfo[['MONTH','ARR_DELAY']].groupby('MONTH').mean()

ax2.bar(np.arange(1,13)-bar_width,dep_avg_sfo_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax2.bar(np.arange(1,13),arr_avg_sfo_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax2.set_title('SFO')
ax2.set_xticks(np.arange(1,13))

dep_avg_san_df = df_san[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_san_df = df_san[['MONTH','ARR_DELAY']].groupby('MONTH').mean()

ax3.bar(np.arange(1,13)-bar_width,dep_avg_san_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax3.bar(np.arange(1,13),arr_avg_san_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax3.set_title('SAN')
ax3.set_xticks(np.arange(1,13))


dep_avg_oak_df = df_oak[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_oak_df = df_oak[['MONTH','ARR_DELAY']].groupby('MONTH').mean()

ax4.bar(np.arange(1,13)-bar_width,dep_avg_oak_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax4.bar(np.arange(1,13),arr_avg_oak_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax4.set_title('OAK')
ax4.set_xticks(np.arange(1,13))
Out[4]:
[<matplotlib.axis.XTick at 0x7fa43f069410>,
 <matplotlib.axis.XTick at 0x7fa43f069cd0>,
 <matplotlib.axis.XTick at 0x7fa419173490>,
 <matplotlib.axis.XTick at 0x7fa4190f2650>,
 <matplotlib.axis.XTick at 0x7fa4190f2dd0>,
 <matplotlib.axis.XTick at 0x7fa4190fd590>,
 <matplotlib.axis.XTick at 0x7fa4190fdd10>,
 <matplotlib.axis.XTick at 0x7fa4191064d0>,
 <matplotlib.axis.XTick at 0x7fa419106c50>,
 <matplotlib.axis.XTick at 0x7fa4191df710>,
 <matplotlib.axis.XTick at 0x7fa4191d6290>,
 <matplotlib.axis.XTick at 0x7fa4191c1b10>]
<matplotlib.figure.Figure at 0x7fa44923ee90>

In above Figure, we can see flights departing from SFO and LAX tend to have longer delay time than the overall average for every month as they are the two most busiest airports in Carlifornia. Specifically, flights departing from SFO have longer delay time in March, October and December. Located in the popular tourist city, LAX is expected to be at the peak of its traffic during the summer vocation and Chrismas. The result is consistent with our expectations. Flights departing from LAX have longer delay time in December, June, July and Auguest. Among these four airports, SAN has the shortest departure and arrival delay.

3.3.1.1. Interactive from/to delay Heatmap for the main four airports
In [5]:
from ipywidgets import *
from IPython.display import display
import seaborn as sns

def HeatPlotting(delay_threshold):
    
    try:
        fig = plt.figure(figsize=(12,12))
        dep_avg_org_lax = df_lax[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
        dep_avg_org_lax = dep_avg_org_lax.reset_index()
        dep_avg_org_lax = dep_avg_org_lax[dep_avg_org_lax.DEP_DELAY>delay_threshold]
        origins_lax = dep_avg_org_lax.ORIGIN.unique()
        destinations_lax = dep_avg_org_lax.DEST.unique()

        df_dep_lax = pd.DataFrame(dep_avg_org_lax.values).pivot(0,1,2).fillna(0)
        ax1 = fig.add_subplot(221)
        ax1 = sns.heatmap(df_dep_lax,cmap="YlGnBu")
        ax1.set_title('LAX')

        dep_avg_org_sfo = df_sfo[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
        dep_avg_org_sfo = dep_avg_org_sfo.reset_index()
        dep_avg_org_sfo = dep_avg_org_sfo[dep_avg_org_sfo.DEP_DELAY>delay_threshold]
        origins_sfo = dep_avg_org_sfo.ORIGIN.unique()
        destinations_sfo = dep_avg_org_sfo.DEST.unique()

        df_dep_sfo = pd.DataFrame(dep_avg_org_sfo.values).pivot(0,1,2).fillna(0)
        ax2 = fig.add_subplot(222)
        ax2 = sns.heatmap(df_dep_sfo,cmap="YlGnBu")
        ax2.set_title('SFO')

        dep_avg_org_san = df_san[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
        dep_avg_org_san = dep_avg_org_san.reset_index()
        dep_avg_org_san = dep_avg_org_san[dep_avg_org_san.DEP_DELAY>delay_threshold]
        origins_san = dep_avg_org_san.ORIGIN.unique()
        destinations_san = dep_avg_org_san.DEST.unique()

        df_dep_san = pd.DataFrame(dep_avg_org_san.values).pivot(0,1,2).fillna(0)
        ax3 = fig.add_subplot(223)
        ax3 = sns.heatmap(df_dep_san,cmap="YlGnBu")
        ax3.set_title('SAN')

        dep_avg_org_oak = df_oak[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
        dep_avg_org_oak = dep_avg_org_oak.reset_index()
        dep_avg_org_oak = dep_avg_org_oak[dep_avg_org_oak.DEP_DELAY>delay_threshold]
        origins_oak = dep_avg_org_oak.ORIGIN.unique()
        destinations_oak = dep_avg_org_oak.DEST.unique()

        df_dep_oak = pd.DataFrame(dep_avg_org_oak.values).pivot(0,1,2).fillna(0)
        ax4 = fig.add_subplot(224)
        ax4 = sns.heatmap(df_dep_san,cmap="YlGnBu")
        ax4.set_title('OAK')
        
    except:
        print ("No available flight in some airpots!")
    
w = widgets.IntSlider(
    value=15,
    min=0,
    max=50,
    step=5,
    description='Delay (min):',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='i',
    slider_color='white'
)
display (w)
interact(HeatPlotting, delay_threshold=w)
Out[5]:
<function __main__.HeatPlotting>

Most of the longer delays belong to destinations with smaller airports. It seems that small airports have a higher delays than medium or large ones. This fact shows we need a measure of size for the aiports.

3.3.2. Interactive Geographical Distribution of Delays

In this section, we explore the relationship between delays and the locations of the different airports. In order to do that, we used the OpenFlights database [4] and merge that with our flight dataset. Basically, the OpenFlights database connects the IATA [5] coding of the airlines to the same columns in the flights dataset. The geographical distribution of the flights in our dataset is illustrated in below:

In [6]:
df_airport = pd.read_table('/mnt/c/Users/nsalehi/Desktop/airline_ca/airports.dat', error_bad_lines=False,sep=',', header=None)
df_airport.columns =['Airport ID','Name','City','Country','code','ICAO','lat','long','Altitude','Timezone','DST','Tz database time zone','Type','Source']

df_airport.drop(['Airport ID','Name','City','Country','ICAO','Altitude','Timezone','DST','Tz database time zone','Type','Source'], axis=1, inplace=True)

dep_avg_org_df = df[['ORIGIN','DEP_DELAY']].groupby('ORIGIN').mean()
dep_avg_org_df = dep_avg_org_df.reset_index()

arr_avg_org_df = df[['ORIGIN','ARR_DELAY']].groupby('ORIGIN').mean()
arr_avg_org_df = arr_avg_org_df.reset_index()

dep_map_airport_delay = pd.merge(dep_avg_org_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
arr_map_airport_delay = pd.merge(arr_avg_org_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
In [7]:
import folium
from folium.plugins import MarkerCluster
lax_cord = (33.942809, -118.404706)

 
# create empty map zoomed in on San Francisco
map = folium.Map(location=lax_cord, zoom_start=3,tiles='Mapbox Bright')

for i in dep_map_airport_delay.iterrows():
    folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['DEP_DELAY'],
        popup=i[1]['code'] + ', Departure Delay Average=' + str(i[1]['DEP_DELAY']),
                    fill_color='#4c4cff').add_to(map)

for i in arr_map_airport_delay.iterrows():
    folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['ARR_DELAY'],
        popup=i[1]['code'] + ', Arrival Delay Average=' + str(i[1]['ARR_DELAY']),
                    fill_color='#ff4c4c').add_to(map)
    
display(map)
3.3.2.1. Interactive Geographical Distribution of Delays for LAX
In [8]:
dep_avg_lax_df = df_lax[['ORIGIN','DEP_DELAY']].groupby('ORIGIN').mean()
dep_avg_lax_df = dep_avg_lax_df.reset_index()

arr_avg_lax_df = df[['ORIGIN','ARR_DELAY']].groupby('ORIGIN').mean()
arr_avg_lax_df = arr_avg_lax_df.reset_index()

dep_map_lax_delay = pd.merge(dep_avg_lax_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
arr_map_lax_delay = pd.merge(arr_avg_lax_df, df_airport, left_on = 'ORIGIN', right_on = 'code')

lax_cord = (33.942809, -118.404706)

 
# create empty map zoomed in on San Francisco
map = folium.Map(location=lax_cord, zoom_start=4,tiles='Mapbox Bright')

for i in dep_map_lax_delay.iterrows():
    folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['DEP_DELAY'],
        popup=i[1]['code'] + ', Departure Delay Average=' + str(i[1]['DEP_DELAY']),
                    fill_color='#4c4cff').add_to(map)

for i in arr_map_lax_delay.iterrows():
    folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['ARR_DELAY'],
        popup=i[1]['code'] + ', Arrival Delay Average=' + str(i[1]['ARR_DELAY']),
                    fill_color='#ff4c4c').add_to(map)
    
display(map)

As we mentioned before, most of the longer delays belong to destinations with smaller airports. This fact shows we need a measure of size for the aiports. Additionally, we can consider the distance between the origin and destination airports as another factor.

3.3.3. Interactive Geographical Distribution of Delays by Flow of the Flights

In [9]:
def RoutePlotting(delay_threshold2,delay_threshold3):
    org_dest = df[['DEST','ORIGIN','DEP_DELAY','ARR_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
    org_dest = org_dest.reset_index()

    org_dest_map = pd.merge(org_dest, df_airport, left_on = 'ORIGIN', right_on = 'code')
    org_dest_map = pd.merge(org_dest_map, df_airport, left_on = 'DEST', right_on = 'code')

    org_dest_map = org_dest_map[(org_dest_map.DEP_DELAY>delay_threshold2) &
                               (org_dest_map.DEP_DELAY>delay_threshold3)]

    lax_cord = (33.942809, -118.404706)
    map = folium.Map(location=lax_cord, zoom_start=3,tiles='Mapbox Bright')

    for i in org_dest_map.iterrows():
        pointA = [i[1]['lat_x'],i[1]['long_x']]
        pointB = [i[1]['lat_y'],i[1]['long_y']]
        if i[1]['ARR_DELAY']>15 or i[1]['DEP_DELAY']>15:
            set_color = '#ff4c4c'
        else:
            set_color = '#4c4cff'
        folium.PolyLine([pointA,pointB], color=set_color, weight=0.2, opacity=0.7).add_to(map)

    display(map)


w2 = widgets.IntSlider(
    value=10,
    min=0,
    max=50,
    step=5,
    description='Departure Delay (min):',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='i',
    slider_color='white',
)

display (w2)

w3 = widgets.IntSlider(
    value=10,
    min=0,
    max=50,
    step=5,
    description='Arrival Delay (min):',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='i',
    slider_color='white'
)
display (w3)
interact(RoutePlotting, delay_threshold2=w2, delay_threshold3=w3) 
Out[9]:
<function __main__.RoutePlotting>

As we can see in this Figure, all the flights are either originated from California or the destination is somewhere in California. From this Figure, we can also see the main hubs of the domestic and international flights (higher flight density at the hubs, like New York, Chicago, Miami, etc). We can also see, flights to main hubs in the USA, like Florida, NY and Boston have higher chance to have delay.

3.4. Distribution of The Flights by Carriers

Next, we look at delays linked to carriers, for all airports in California. Nine airlines are operating in California: 'DL' 'B6' 'AA' 'AS' 'F9' 'VX' 'WN' 'UA' 'OO' 'HA' 'NK' 'EV'. The scope of their operations is illustrated below:

In [45]:
plt.figure(figsize=(12,6))   
Carriers = df.UNIQUE_CARRIER.unique()

num_ca_df = df[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()

bar_width = 0.4
plt.bar(np.arange(1,13)-0.5*bar_width,num_ca_df, width = bar_width, alpha=0.5, color= 'g')

plt.xticks(np.arange(1,13), Carriers)
plt.xlabel('Carriers')
plt.ylabel('Number of Flights')
plt.title('2016 Number of Flights by Carrier')
Out[45]:
<matplotlib.text.Text at 0x7ff145c3b190>
In [33]:
plt.figure(figsize=(12,6))   
Carriers = df.UNIQUE_CARRIER.unique()

dep_avg_ca_df = df[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_ca_df = df[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()

bar_width = 0.4
plt.bar(np.arange(1,13)-bar_width,dep_avg_ca_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(1,13),arr_avg_ca_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')

plt.legend(loc='upper left')
plt.xticks(np.arange(1,13), Carriers)
plt.xlabel('Carriers')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Departure/Arrival Average Delay Time by Carrier')
Out[33]:
<matplotlib.text.Text at 0x7fa5ce31c110>
In [17]:
fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(221)
dep_avg_lax_df = df_lax[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()

lax_carriers = df_lax.UNIQUE_CARRIER.unique()
lax_car = len(lax_carriers)+1

bar_width = 0.4
ax1.bar(np.arange(1,lax_car)-0.5*bar_width,dep_avg_lax_df, width = bar_width, alpha=0.5, color= 'g')
ax1.set_title('LAX')
ax1.set_xticks(np.arange(1,lax_car))
ax1.set_xticklabels(lax_carriers, rotation='vertical')

ax2 = fig.add_subplot(222)
dep_avg_sfo_df = df_sfo[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()

sfo_carriers = df_sfo.UNIQUE_CARRIER.unique()
sfo_car = len(sfo_carriers)+1

bar_width = 0.4
ax2.bar(np.arange(1,sfo_car)-0.5*bar_width,dep_avg_sfo_df, width = bar_width, alpha=0.5, color= 'g')
ax2.set_title('SFO')
ax2.set_xticks(np.arange(1,sfo_car))
ax2.set_xticklabels(sfo_carriers, rotation='vertical')


ax3 = fig.add_subplot(223)
dep_avg_san_df = df_san[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()

san_carriers = df_san.UNIQUE_CARRIER.unique()
san_car = len(san_carriers)+1

bar_width = 0.4
ax3.bar(np.arange(1,san_car)-0.5*bar_width,dep_avg_san_df, width = bar_width, alpha=0.5, color= 'g')
ax3.set_title('SAN')
ax3.set_xticks(np.arange(1,san_car))
ax3.set_xticklabels(san_carriers, rotation='vertical')

ax4 = fig.add_subplot(224)
dep_avg_oak_df = df_oak[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()

oak_carriers = df_oak.UNIQUE_CARRIER.unique()
oak_car = len(oak_carriers)+1

bar_width = 0.4
ax4.bar(np.arange(1,oak_car)-0.5*bar_width,dep_avg_oak_df, width = bar_width, alpha=0.5, color= 'g')
ax4.set_title('OAK')
ax4.set_xticks(np.arange(1,oak_car))
ax4.set_xticklabels(oak_carriers, rotation='vertical')

fig.tight_layout() 

In the first figure, we have the number of flights per carrier on the vertical axis. In the second one, we have the average departure/arrival delays on the vertical axis categorized by the carriers and in the third one the early flights are removed. For flights with one of 12 unique carriers, average flight delays vary considerably. However, this analysis is affected by the number of flights per carrier. Virgin America (VX), Southwest Airlines (WN), United Airlines (UA), and Spirit Airlines (NK) have lower flights in 2016 comparing to other 8 airlines. This can be the reason why WN has a low average delay. Another interesting observation is ExpressJet (EV) with highest number of flights but relatively a low mean delay. Furthermore, we look into the distribution of the departure/arrival delays in 4 main airports in California (Los angles, San Francisco, San Diego, and Oakland):

In [24]:
fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(221)
dep_avg_lax_df = df_lax[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_lax_df = df_lax[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()

lax_carriers = df_lax.UNIQUE_CARRIER.unique()
lax_car = len(lax_carriers)+1

bar_width = 0.4
ax1.bar(np.arange(1,lax_car)-bar_width,dep_avg_lax_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax1.bar(np.arange(1,lax_car),arr_avg_lax_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax1.set_title('LAX')
ax1.legend(loc='upper left',prop={'size':6})
ax1.set_xticks(np.arange(1,lax_car))
ax1.set_xticklabels(lax_carriers, rotation='vertical')

ax2 = fig.add_subplot(222)
dep_avg_sfo_df = df_sfo[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_sfo_df = df_sfo[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()


sfo_carriers = df_sfo.UNIQUE_CARRIER.unique()
sfo_car = len(sfo_carriers)+1

ax2.bar(np.arange(1,sfo_car)-bar_width,dep_avg_sfo_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax2.bar(np.arange(1,sfo_car),arr_avg_sfo_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax2.set_title('SFO')
ax2.set_xticks(np.arange(1,sfo_car))
ax2.set_xticklabels(sfo_carriers, rotation='vertical')


ax3 = fig.add_subplot(223)
dep_avg_san_df = df_san[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_san_df = df_san[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()

san_carriers = df_san.UNIQUE_CARRIER.unique()
san_car = len(san_carriers)+1

ax3.bar(np.arange(1,san_car)-bar_width,dep_avg_san_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax3.bar(np.arange(1,san_car),arr_avg_san_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax3.set_title('SAN')
ax3.set_xticks(np.arange(1,san_car))
ax3.set_xticklabels(san_carriers, rotation='vertical')

ax4 = fig.add_subplot(224)
dep_avg_oak_df = df_oak[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_oak_df = df_oak[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()

oak_carriers = df_oak.UNIQUE_CARRIER.unique()
oak_car = len(oak_carriers)+1

ax4.bar(np.arange(1,oak_car)-bar_width,dep_avg_oak_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax4.bar(np.arange(1,oak_car),arr_avg_oak_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax4.set_title('OAK')
ax4.set_xticks(np.arange(1,oak_car))
ax4.set_xticklabels(oak_carriers, rotation='vertical')

fig.tight_layout() 

The first figure illustrates the distribution of the flights in this four airports. In the second and third (early flights subtracted) figures, we can clear see the effect of the origin/destination airport on the delays. Airports with less flights (SAN and OAK) have smaller average delays comparing to larger airports. This can be due to the traffic at that airport and the scope of the airport’s operations. Usually, more crowded airports (LAX and SFO have more flights) have higher average delays. Small size and large size airports also have higher average delays in compare to medium size airports. We can also see the mainstream airlines (like Delta) have lower arrival delays on average comparing to smaller airlines.

4. Database and Data Preprocessing

4.1. Kmeans clustering for Size of the Airports

Here we use a heuristic measure for the airport size as a new feature into the dataset. To do that we computed the total number of ingoing/outgoing flights from/to each airport and cluster them into three categories of small, medium, and large using K-means algorithm.

In [2]:
from sklearn.cluster import KMeans
df_org_size = df[['ORIGIN']].groupby('ORIGIN').size()
airport_names =list(df_org_size.index)
df_org_size = np.reshape(df_org_size, (len(df_org_size), 1)) 
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_org_size)
air_org_cluster = pd.concat([pd.DataFrame(airport_names),pd.DataFrame(kmeans.labels_)],axis=1)
air_org_cluster.columns = ['Airport', 'Size']

df_dest_size = df[['DEST']].groupby('DEST').size()
airport_names =list(df_dest_size.index)
df_dest_size = np.reshape(df_dest_size, (len(df_dest_size), 1)) 
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_dest_size)
air_dest_cluster = pd.concat([pd.DataFrame(airport_names),pd.DataFrame(kmeans.labels_)],axis=1)
air_dest_cluster.columns = ['Airport', 'Size']
In [3]:
df = pd.merge(df, air_org_cluster, left_on = 'ORIGIN', right_on = 'Airport')
df.drop(['Airport'], axis=1, inplace=True)
df = df.rename(columns={'Size': 'org_size'})

df = pd.merge(df, air_dest_cluster, left_on = 'DEST', right_on = 'Airport')
df.drop(['Airport'], axis=1, inplace=True)
df = df.rename(columns={'Size': 'dest_size'})

4.2. California Database 2016 Transformations

Since Neural Networks are scale sensitive, first we need to scale the departure delay times. To do that we used Boxcox [7] to find the appropriate scale considering the very left skewed distribution of the departure delay. In this project we decided to go with cube root transformation after running the Boxcox algorithm on the flight data. Cube root was appropriate because we have both negative and positive delays. Then we correct for mean and finally, we consider the ynew= (y - ymin)/ (ymax - ymin) transformation. Therefore, our new labels are scaled between 0 and 1. This transformation is necessary since our predictions are outputs of a sigmoid function.

The next transformation is on the data features. Since the selected columns have categorical nature, we consider two Factorized and One-Hot [7] Encoded transformations for each column. In factorization [8], different observations in each column will be coded with different factor levels. In One-Hot encoding we create a binary representation of the factors in each column of the original dataset. It is known that One-Hot Encoding tends to give more accurate predictions. However, the size of the dataframe increases drastically.

Finally, we need to divide the data into train and test subsets. To do that, first we randomly shuffle the rows and split the dataset into 70% training and 30% test subsets. Since the obtained dataframes are huge in size, we use the Hadoop Distributed File System (HDFS) [9] to store them. Otherwise, the data fills the memory quickly and the system crashes. With HDFS we can load the memory with chunks of the data whenever we need them.

In [4]:
##Calif - one-hot

from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py

dep_df = df[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','org_size','dest_size']]

dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)

dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = np.cbrt(dep_train_y)
dep_train_y = dep_train_y - np.mean(dep_train_y)
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))

dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))

print dep_train_x.shape
print dep_train_y.shape

del df, dep_df, dep_train, dep_test

train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = pd.get_dummies(dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']])
dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]

dep_train_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size'], axis=1, inplace=True)

col_train = dep_train_x.columns

dep_test_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size'], axis=1, inplace=True)

col_test = dep_test_x.columns

print dep_train_x.shape
print dep_train_y.shape

h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x, chunks=True)
h5f.create_dataset('train_y', data=dep_train_y, chunks=True)
h5f.create_dataset('train_cy',data=dep_train_cy, chunks=True)

h5f.create_dataset('test_X', data=dep_test_x, chunks=True)
h5f.create_dataset('test_y', data=dep_test_y, chunks=True)
h5f.create_dataset('test_cy', data=dep_test_cy, chunks=True)
h5f.close()

print 'saved!'
(2172674, 10)
(2172674,)
/home/nsalehi/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:35: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/nsalehi/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(2172674, 629)
(2172674,)
saved!
In [ ]:
##Calif - factorized


from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py

dep_df = df[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','org_size','dest_size']]

dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)

dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))

dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = np.cbrt(dep_test_y)
dep_test_y = dep_test_y - np.mean(dep_test_y)
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))


train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']].apply(lambda x: pd.factorize(x)[0])

dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]


h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data_fac.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x)
h5f.create_dataset('train_y', data=dep_train_y)
h5f.create_dataset('train_cy', data=dep_train_cy)

h5f.create_dataset('test_X', data=dep_test_x)
h5f.create_dataset('test_y', data=dep_test_y)
h5f.create_dataset('test_cy', data=dep_test_cy)

h5f.close()

print 'saved!'
In [50]:
## 2016 LAX-ORIGIN Database - One-hot Encoded

from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py

df_lax_org = df[df['ORIGIN'] == 'LAX']

dep_df = df_lax_org[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','dest_size']]

dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)

dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = np.cbrt(dep_train_y)
dep_train_y = dep_train_y - np.mean(dep_train_y)
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))

dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER', 'DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = np.cbrt(dep_test_y)
dep_test_y = dep_test_y - np.mean(dep_test_y)
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))


print dep_train_y.shape


train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = pd.get_dummies(dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']])

dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]


dep_train_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size'], axis=1, inplace=True)

col_train = dep_train_x.columns

dep_test_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size'], axis=1, inplace=True)

col_test = dep_test_x.columns

h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data_lax.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x)
h5f.create_dataset('train_y', data=dep_train_y)
h5f.create_dataset('train_cy', data=dep_train_cy)

h5f.create_dataset('test_X', data=dep_test_x)
h5f.create_dataset('test_y', data=dep_test_y)
h5f.create_dataset('test_cy', data=dep_test_cy)
h5f.close()

print 'saved!'
(133174,)
saved!
/home/nsalehi/anaconda2/envs/tensorflow/lib/python2.7/site-packages/ipykernel/__main__.py:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/nsalehi/anaconda2/envs/tensorflow/lib/python2.7/site-packages/ipykernel/__main__.py:42: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [11]:
## 2016 LAX-ORIGIN Database - Factorized
from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py

df_lax_org = df[df['ORIGIN'] == 'LAX']

dep_df = df_lax_org[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','dest_size']]
dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)

dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))

dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER', 'DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))

print dep_train_y.shape


train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']].apply(lambda x: pd.factorize(x)[0])

dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]


h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data_fac_lax.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x)
h5f.create_dataset('train_y', data=dep_train_y)
h5f.create_dataset('train_cy', data=dep_train_cy)

h5f.create_dataset('test_X', data=dep_test_x)
h5f.create_dataset('test_y', data=dep_test_y)
h5f.create_dataset('test_cy', data=dep_test_cy)

h5f.close()

print 'saved!'
(133174,)
saved!
In [1]:
import pandas as pd
import csv
import numpy as np
import h5py

h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data.h5', 'r')

train_X = h5f['train_X']
train_y = h5f['train_y']
train_y = np.reshape(train_y, (len(train_y), 1)) 


test_X = h5f['test_X']
test_y = np.array(h5f['test_y'])
test_y = np.reshape(test_y, (len(test_y), 1)) 

print 'loaded!'
loaded!

5. Deep Neural Networks (DNN)

In this section we use Deep Neural Networks (DNN) for predicting the departure delays in the flight dataset. In here, we do both regression to predict the amount of delays and classification to predict whether a flight is going to be delayed or not.

5.1. Network Structure for the Regression

For this study we consider “fully connected” structure with one or two hidden layers for different experiments. The number of hidden nodes are different in each experiments, having time constraints we tried to find the optimum number of nodes for each experiment. The activation function for each node is the sigmoid function. The weights and biases are initialized randomly following a Normal Distribution. The output node has only one node and returns the scaled predicted delay. We used Tensorflow [10] to model the DNN in Python with the MSE minimization objective and Gradient Descent Optimizer to update the weights in each epoch. The learning rates are different in each experiment in [0.01, 1.00] range. We used different strategies such as constant, time inverse decaying, and exponential decaying to change the learning rates in each epoch and for each experiment. We have different number of epochs for different experiments. Loosely speaking, we tried to optimize all the variables for each experiment.

5.2. Network Structure for the Classification

For classification we encoded the labels with One-Hot. Therefore, the argmax of the predicted outputs shows the actual label. The structure of the network is similar to regression in previous section, except in here we have two nodes in the output layer. The objective of this network is to minimize “softmax cross entropy with logits” between logits and labels. This is a measure for the probability error in discrete classification tasks in which the classes are mutually exclusive. This objective tends to maximize the accuracy.

5.3. Handling the Imbalanced Dataset

As it is mentioned in the introduction only 18% of the labels are showing delayed flights. This creates a problem commonly known biased prediction toward the majority class. Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.

Indeed, if the goal is to maximize simple accuracy (or, equivalently, minimize error rate), treating every example as the majority is an acceptable solution with usually very high accuracies. But if we assume that the rare class examples are much more important to classify, then we have to be more careful and more sophisticated about attacking the problem.

There are several ways to deal with this problem. First, we consider over/under sampling by using the SMOTE algorithm. We used the SMOTE algorithm in some of the experiments, however, in general this is a very expensive approach in terms of both time and memory. The second approach we obtained in this study was by combination of adjusting weights and considering recall and precision in the objective. To do that, we consider the 1/ratio as the weight for the minority class (delayed labels) in which the “ratio” is 18% and we set the weight of the other class to be one. We also changed the objective function to “weighted cross entropy with logits” which allows us to trade-off between recall and precision.

5.4. Computational Results for DNNs

In our first attempt to create a model that can describe the flight dataset we tried different variations of the linear model and it all failed to get an adjusted R-square more than 0.7%. This fact indicates that the relationship between the selected predictors cannot be expressed by linear models. Here we omit the results related to the linear models. By running a step-wise selection algorithm, we were able to identify month, departure time, and arrival time as the most important predictors. The logistic regression models for the classification tend to ignore the minority class (delayed flights). Therefore, we need to use SMOTE before using them to make sure we are not classifying all the examples as not delayed. The confusion matrix of one instance of our runs for the LAX airport is presented in this Figure below:

In [14]:
from IPython.display import Image
Image("/mnt/c/Users/nsalehi/Desktop/111.png")
Out[14]:

As we can see in this Figure above, we can predict not delayed flights with accuracy of 55% but this accuracy obtained by having a lot of misclassification of the not delayed flights after we applied SMOTE on this dataset. The precision of this classifier is around 27% and the recall is around 57%. This results can easily justify using DNNs as nonlinear regression models and nonlinear classifiers.

For the classification problem, we used the weighted cross entropy objective instead of the SMOTE (since SMOTE is very expensive both in terms of required time and memory with 629 categorical predictors). The best accuracy for the prediction is 77% which is exactly the average number of not delayed flights in the test dataset. As we have mentioned before this accuracy can only be obtained by predicting all the flights to be not delayed. However, this classifier has a very low precision and recall. The classifier obtained with weighted cross entropy objective yields the following confusion matrix:

In [5]:
Image("/mnt/c/Users/nsalehi/Desktop/222.png")
Out[5]:

As we can see also the accuracy dropped from 77% to 53%, however, the precision and recall increased significantly to 54% and 43% (we can see this by comparing the color of delayed-delayed cell among different settings). For the regression model, first we start with one hidden layer and 2 nodes then we move to 64 nodes and then we consider a deeper model with two hidden layers. The comparison of the MSE of two models after 500 epochs is illustrated in Figure above.

Moreover in this Figure, the MSE converges faster by increasing the number of nodes in the hidden layer and the accuracies increase by adding an extra hidden layer. It seems the regression models predict the delays to be close to zero in most cases and it is mostly an underestimation for the actual delays. However, MSE may not be the best way to evaluate the performance of our predictors, since a smooth and linear prediction of the delays is not what we want. Although these predictions are minimizers of the MSE, but we are looking for predictions that can better capture the delays in delayed flights (not in the whole dataset). Therefore, the first step is to define an asymmetric MSE with different weights for overestimations and underestimations. Here we also used a similar ratio define in previous sections (Handling the Imbalanced Dataset) as the weight for underestimations of the delayed flights and one for overestimations. This way the DNN tends to give more variations in the predicted values. Here is a sample prediction for a deep network with two layers with [93, 64, 16, 1] nodes in each layer and sigmoid activation on each node:

In [6]:
Image("/mnt/c/Users/nsalehi/Desktop/333.png")
Out[6]:

As we can see in the top part of this Figure, the model can capture some of the fluctuations (long delays) and it is not giving an underestimations of the delays like before. The trade-off is the test MSE is 3.97% which is considerably higher than before.

Finally, we repeat the same experiments with the whole California database. The designs of the optimized networks are different but we got very similar results with lower accuracies and higher MSEs. This is due to the fact that there are so many smaller airports with unusually high departure times and they reduced the accuracies of our predictions. It is worth mentioning that since the database is very large for the whole California state, we use Stochastic Gradient Descent (SGD) to update the weights. The MSE is not very sensitive to our choices of the mini sizes. This means the MSE is aggregated error of the mini batches. Results of the regression DNN on the California dataset is also shown in here.

6. Concluding Remarks

In this project we study the delays in flights schedules, in particular we focused in the flights in the state of California as one of the main hubs in the United States for both domestic and international flights. We engineered our features by focusing on our knowledge of the dataset (domain knowledge) and the various visualizations we present in data exploratory analysis section. The results of our feature selection revealed that we need to focus on some categorical predictors in our dataset (introduced at the end of data analysis section). The categorical predictors requires appropriate encoding and some preprocessing before they can be used for prediction modeling.

Next, we examine several supervised learning methods on the dataset including linear regression, logistic regression, and deep neural nets. The results from these methods suggest linear models are incapable of capturing the relationships among our data features and our outputs. The performance of the non-linear methods is superior to linear models, however, the imbalanced dataset creates some questionable predictions. We dealt with this problem by over-/under-sampling in some of our experiments and by balancing the precision, accuracy, and recall in some others. In conclusion, the accuracy of our predictions are acceptable considering the fact that we did not have access to weather information, traffic, and the actual complex structure of the flights network. By adding these features, we would most probably have better predictions. For example, the severe weather conditions caused a significant amount of delay according to the BTS, or having the tail number of the flights we could examine if they need any repairs before the departure or not which can be the cause of departure delays. The traffic of the passengers at the airports considering the number of arrivals and departures is another helpful factor that we did not have access to it. More crowded airports in terms of the number of flights have higher chances of departure/arrival delays. All in all, we can say predicting the departure delays at all airports is a very complicated task that requires more complicated methodologies like hierarchical modeling. Therefore, splitting the dataset into smaller subsets with more similar predictors seems to be a reasonable assumption we had in our analysis.

In [10]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>''')
Out[10]:
In [ ]: