Bureau of Transportation Statistics (BTS) reported a record of 3.5 million commercial flights from/to California in 2016. According to BTS, a flight is considered as delayed when it departs/arrives 15 or more minutes than its scheduled time. As we can see in this Figure, around 82% of all the flights are done on time and about 18% of them are delayed. The US Department of Transportation (DOT) identifies weather, National Aviation System (NAS), security, aircraft arriving late delays as the most important causes. In US, the Federal Aviation Administration [1] estimates that flight delays cost airlines $22 billion yearly. Time is money, and delayed flights are a frequent cause of frustration for both travelers and airlines [2]. Knowledge of the factors leading to specific flight delays can help Aviation authorities and Airlines in taking necessary actions to ensure smooth operations. A delayed flight can have other consequences like economic (missed connections and cancellations), environmental (wasted fuel), and social (loss of productivity and airport congestion). The objective of this study is to predict delayed departure of a flight. Correctly predicting flight delays allows passengers to be prepared for the disruption of the journey and airlines to proactively respond to the potential causes of the flight delay to mitigate the impact. This will help airlines to save money and helps people to expect less waiting times at airports.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import pandas as pd
import csv
#Reding Data
df_jan = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/jan16.csv', error_bad_lines=False,sep=',')
df_feb = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/feb16.csv', error_bad_lines=False,sep=',')
df_mar = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/mar16.csv', error_bad_lines=False,sep=',')
df_apr = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/apr16.csv', error_bad_lines=False,sep=',')
df_may = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/may16.csv', error_bad_lines=False,sep=',')
df_jun = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/jun16.csv', error_bad_lines=False,sep=',')
df_jul = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/jul16.csv', error_bad_lines=False,sep=',')
df_aug = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/aug16.csv', error_bad_lines=False,sep=',')
df_sep = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/sep16.csv', error_bad_lines=False,sep=',')
df_oct = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/oct16.csv', error_bad_lines=False,sep=',')
df_nov = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/nov16.csv', error_bad_lines=False,sep=',')
df_dec = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/dec16.csv', error_bad_lines=False,sep=',')
df = [df_jan,df_feb,df_mar,df_apr,df_may,df_jun,df_jul,df_aug,df_sep,df_oct,df_nov,df_dec]
df = pd.concat(df)
df.drop(['Unnamed: 35', 'YEAR'], axis=1, inplace=True)
print df.shape
#Removing rows related to cancelled or diverted flights
df = df.drop(df[df.CANCELLED != 0].index)
df = df.drop(df[df.DIVERTED != 0].index)
#Removing cancelled and diverted columns in df
df.drop(['CANCELLED', 'DIVERTED'], axis=1, inplace=True)
print df.shape
df.to_csv('/mnt/c/Users/nsalehi/Desktop/airline_ca/all_16.csv',sep=',')
print df.shape
#Create new variables: Hour of CRS Departure/Arrival Time
df['CRS_DEP_HR'] = df['CRS_DEP_TIME']//100
df['CRS_ARR_HR'] = df['CRS_ARR_TIME']//100
df['DEP_HR'] = df['DEP_TIME']//100
df['ARR_HR'] = df['ARR_TIME']//100
df.DEP_HR[df['DEP_HR'].apply(lambda x: x==24)] = 0
df.ARR_HR[df['ARR_HR'].apply(lambda x: x==24)] = 0
df[df['MONTH'].apply(lambda x: x==0)] = 12
print 'done!!!!!'
import seaborn
plt.figure(figsize=(4,6))
dep_ratio = df[['DEP_DEL15']].groupby('DEP_DEL15').size()
print 'Delay ratio to total=', 100*(1-dep_ratio[0]/float(dep_ratio[0]+dep_ratio[1])), "%"
arr_ratio = df[['ARR_DEL15']].groupby('ARR_DEL15').size()
bar_width = 0.4
plt.bar(np.arange(0,2)-0.5*bar_width,dep_ratio, width = bar_width, alpha=0.5, color= 'g')
x_labels = ['On-time', 'Delayed']
plt.xticks(np.arange(0,2), x_labels)
plt.xticks(np.arange(0,2))
plt.ylabel('Number of Flights')
plt.title('2016 Delay vs. On-time Flights')
In this study we use a dataset from the Bureau of Transportation Statistics (BTS) [3], known as the on-time performance data. This dataset contains scheduled and actual departure and arrival times reported by U.S. air carriers. Additional data elements included in this database includes departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. Since the whole data is very large, we only focus on flights in 2016 from/to all airports in California. First, we delete some unrelated variables. Moreover, we check the correlations among the variables and select the appropriate ones. For example, quarter and month are highly correlated and have similar relationships to the departure/arrival delays. Between them, we decide to only keep month because it is more detailed. After the above process, our final dataset contains 31038207 observations, 7 predictors (Month, Day of Week, Departure Time, Unique Carrier, origin (airport), dest (airport) and distance group) that could be used to predict delays and the categorical and continuous response variables correspond to departure delay and arrivals delay. According to their attributes, we divide the predictors into three categories: time, carrier and location. We will analyze the relationships between them and delays in the next section.
Variable | Description |
---|---|
Month | Month |
DayOfWeek | Day of Week |
DayOfMonth | Day of Month |
UniqueCarrier | Unique Carrier Code |
Origin | Origin Airport |
Dest | Destination Airport |
CRSDepTime | CRS Departure Time (local time: hhmm) |
DepTime | Actual Departure Time (local time: hhmm) |
DepDelay | Difference in minutes between scheduled and actual departure time |
DepDel15 | Departure Delay Indicator, 15 Minutes or More (1=Yes) |
CRSArrTime | CRS Arrival Time (local time: hhmm) |
ArrTime | Actual Arrival Time (local time: hhmm) |
ArrDelay | Difference in minutes between scheduled and actual arrival time |
ArrDel15 | Arrival Delay Indicator, 15 Minutes or More (1=Yes) |
Cancelled | Cancelled Flight Indicator (1=Yes) |
Diverted | Diverted Flight Indicator (1=Yes) |
CRSElapsedTime | CRS Elapsed Time of Flight, in Minutes |
Distance | Distance between airports (miles) |
DistanceGroup | Distance Intervals, every 250 Miles, for Flight Segment |
In this section we look at the fights dataset with respect to different features introduced in above.
As a first exploratory analysis, we consider the observed probability of delay in minutes on the entire dataset. The most effective way is through a histogram, looking at departure and arrival delays separately.
import matplotlib.pyplot as plt # module for plotting
%matplotlib inline
import seaborn
#plt arrival delay distribution
plt.figure(figsize=(12, 6))
plt.hist(df.DEP_DELAY.dropna(),bins = 1000,normed=1, alpha=0.5, color= 'b', label='Departure')
plt.hist(df.ARR_DELAY.dropna(),bins = 1000,normed=1, alpha=0.3,color= 'r', label='Arrival')
plt.legend(loc='upper right')
plt.xlim(-65,190)
plt.xlabel('Minutes')
plt.ylabel('Probability')
plt.title('2016 Departure/Arrival Delay Distribution')
As we can see in this figure, shorter delays have higher probability for both departure and arrival. It seems the mean of both delays are negative. Both distributions have long and thin right tail. This means we have some flights that are delayed for very long time. The range of arrival delays is slightly wider than the departure delays. In both cases, the mode of the distribution is less than zero, meaning that most of the flights left from gates and arrived at gates before the schedule times. The X-axis for the two plots are the delay time in Minutes. A departure delay is defined by the schedule departure time compared to the actual departure time. The arrival delay is defined as the schedule departure time plus the estimated flight duration compared to the actual arrival time. The airlines may consider some buffer time in their estimations of the on-air time, therefore, the departure and arrival delay distributions difference indicates that some departure delays are recovered during the flights due to the extra amount of time embedded in the on-air time.
First, we explore the relationships between delays and several variables related to time.
As we can see in this Figure, We have significantly lower number of flights in the first 6 months of the year!
plt.figure(figsize=(12, 6))
dep_avg_month_df = df[['MONTH']].groupby('MONTH').size()
bar_width = 0.4
plt.bar(np.arange(1,13)-0.5*bar_width,dep_avg_month_df, width = bar_width, alpha=0.5, color= 'g')
x_labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
pos = [i for i in range(len(x_labels)) ]
plt.xticks(pos, x_labels, rotation='vertical')
plt.legend(loc='upper left')
plt.xticks(np.arange(1,13))
plt.xlabel('Month')
plt.ylabel('Number of Flights')
plt.title('2016 Number of Flights by Month')
plt.figure(figsize=(12, 6))
dep_avg_month_df = df[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_month_df = df[['MONTH','ARR_DELAY']].groupby('MONTH').mean()
bar_width = 0.4
plt.bar(np.arange(1,13)-bar_width,dep_avg_month_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(1,13),arr_avg_month_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
x_labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
pos = [i for i in range(len(x_labels)) ]
plt.xticks(pos, x_labels, rotation='vertical')
plt.legend(loc='upper left')
plt.xticks(np.arange(1,13))
plt.xlabel('Month')
plt.ylabel('Average')
plt.title('2016 Departure/Arrival Average Delay Time by Month')
For both departure and arrival delays, summer months and December have the highest average delay time (as expected). We were expecting this result due to the high traffic in these months. On the other hand, September, October and November are the months with the least amount of delay. Also,March posts high delay values as well. A positive correlation between departure and arriavl delays can be observed.
Evenly distributed! The average of the departure delays on weekends seems to be slightly lower than the week days.
plt.figure(figsize=(12, 6))
dep_avg_week_df = df[['DAY_OF_WEEK']].groupby('DAY_OF_WEEK').size()
bar_width = 0.4
plt.bar(np.arange(1,8)-0.5*bar_width,dep_avg_week_df, width = bar_width, alpha=0.5, color= 'g')
x_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
pos = [i for i in range(len(dep_avg_week_df)) ]
plt.xticks(pos, x_labels, rotation='vertical')
plt.xticks(np.arange(1,8))
plt.xlabel('Day of the Week')
plt.ylabel('Number of flighst')
plt.title('2016 Number of Flights by Day of The Week')
plt.figure(figsize=(12, 6))
dep_avg_week_df = df[['DAY_OF_WEEK','DEP_DELAY']].groupby('DAY_OF_WEEK').mean()
arr_avg_week_df = df[['DAY_OF_WEEK','ARR_DELAY']].groupby('DAY_OF_WEEK').mean()
bar_width = 0.4
plt.bar(np.arange(1,8)-bar_width,dep_avg_week_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(1,8),arr_avg_week_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
x_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
pos = [i for i in range(len(arr_avg_week_df)) ]
plt.xticks(pos, x_labels, rotation='vertical')
plt.legend(loc='upper left')
plt.xticks(np.arange(1,8))
plt.xlabel('Day of the Week')
plt.ylabel('Average')
plt.title('2016 Departure/Arrival Average Delay Time by Day of The Week')
The average delays are pretty much similar. We have the highets delays in Friday and the lowest in Tuesday. A positive correlation between departure and arriavl delays can be observed.
We could expect less flights during the early hours of each days (1:00 AM to 5:00 AM).
plt.figure(figsize=(12,6))
dep_avg_hr_df = df[['DEP_HR']].groupby('DEP_HR').size()
crs_dep_avg_hr_df = df[['CRS_DEP_HR']].groupby('CRS_DEP_HR').size()
bar_width = 0.4
plt.bar(np.arange(0,24),dep_avg_hr_df, width = bar_width, alpha=0.5, color= 'g', label='Actual')
plt.bar(np.arange(0,24),crs_dep_avg_hr_df, width = bar_width, alpha=0.3, color= 'g', label='Scheduled')
plt.legend(loc='upper left')
plt.xticks(np.arange(0,24))
plt.xlabel('Time of the Day')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Number of Departures by Time of the Day')
plt.figure(figsize=(12,6))
dep_avg_hr_df = df[['CRS_DEP_HR','DEP_DELAY']].groupby('CRS_DEP_HR').mean()
arr_avg_hr_df = df[['CRS_ARR_HR','ARR_DELAY']].groupby('CRS_ARR_HR').mean()
bar_width = 0.4
plt.bar(np.arange(0,24),dep_avg_hr_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(0,24),arr_avg_hr_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
plt.legend(loc='upper center')
plt.xticks(np.arange(0,24))
plt.xlabel('Time of the Day')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Departure/Arrival Average Delay Time by Time of the Day (Scheduled)')
A "V" shaped decline in delays with the lowest delays in early morning hours can be observed. This is due to the low traffic at these hours. Both departure and arrival delays accumulate from earlier morning reaching their peaks in the evening hours (18:00 PM to 21:00 PM). The peak is flat for arrival delay in 19:00 PM to 22:00 PM. The increasing trend of average flight delays by the hours of the day is mainly caused by flight delay propagation throughout the day. Although some flights are scheduled with some buffer time for unforeseeable flight delays, but this buffer is not sufficient to cover all types of delay. As a result, if a flight is delayed, the next flight has to wait for the late arrival flight to be ready before it can be operated. Hence, flight delays for both departure and arrival flights propagate over time.
plt.figure(figsize=(12,6))
dep_avg_hr_df = df[['DEP_HR','DEP_DELAY']].groupby('DEP_HR').mean()
arr_avg_hr_df = df[['ARR_HR','ARR_DELAY']].groupby('ARR_HR').mean()
bar_width = 0.4
plt.bar(np.arange(0,24),dep_avg_hr_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(0,24),arr_avg_hr_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
plt.legend(loc='upper center')
plt.xticks(np.arange(0,24))
plt.xlabel('Time of the Day')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Departure/Arrival Average Delay Time by Time of the Day (Actual)')
There are 308 airports in California, we recognized these four airports as the main important ones: SFO (San Francisco), LAX (Los Angles), SAN (San Diego), OAK (Oakland). Here we focus on the main four airports in California:
df_lax = df[(df['ORIGIN'] == 'LAX') | (df['DEST']=='LAX')]
df_sfo = df[(df['ORIGIN'] == 'SFO') | (df['DEST']=='SFO')]
df_san = df[(df['ORIGIN'] == 'SAN') | (df['DEST']=='SAN')]
df_oak = df[(df['ORIGIN'] == 'OAK') | (df['DEST']=='OAK')]
fig = plt.figure()
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(12,8))
dep_avg_lax_df = df_lax[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_lax_df = df_lax[['MONTH','ARR_DELAY']].groupby('MONTH').mean()
bar_width = 0.4
ax1.bar(np.arange(1,13)-bar_width,dep_avg_lax_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax1.bar(np.arange(1,13),arr_avg_lax_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax1.set_title('LAX')
ax1.legend(loc='upper left',prop={'size':6})
ax1.set_xticks(np.arange(1,13))
dep_avg_sfo_df = df_sfo[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_sfo_df = df_sfo[['MONTH','ARR_DELAY']].groupby('MONTH').mean()
ax2.bar(np.arange(1,13)-bar_width,dep_avg_sfo_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax2.bar(np.arange(1,13),arr_avg_sfo_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax2.set_title('SFO')
ax2.set_xticks(np.arange(1,13))
dep_avg_san_df = df_san[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_san_df = df_san[['MONTH','ARR_DELAY']].groupby('MONTH').mean()
ax3.bar(np.arange(1,13)-bar_width,dep_avg_san_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax3.bar(np.arange(1,13),arr_avg_san_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax3.set_title('SAN')
ax3.set_xticks(np.arange(1,13))
dep_avg_oak_df = df_oak[['MONTH','DEP_DELAY']].groupby('MONTH').mean()
arr_avg_oak_df = df_oak[['MONTH','ARR_DELAY']].groupby('MONTH').mean()
ax4.bar(np.arange(1,13)-bar_width,dep_avg_oak_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax4.bar(np.arange(1,13),arr_avg_oak_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax4.set_title('OAK')
ax4.set_xticks(np.arange(1,13))
In above Figure, we can see flights departing from SFO and LAX tend to have longer delay time than the overall average for every month as they are the two most busiest airports in Carlifornia. Specifically, flights departing from SFO have longer delay time in March, October and December. Located in the popular tourist city, LAX is expected to be at the peak of its traffic during the summer vocation and Chrismas. The result is consistent with our expectations. Flights departing from LAX have longer delay time in December, June, July and Auguest. Among these four airports, SAN has the shortest departure and arrival delay.
from ipywidgets import *
from IPython.display import display
import seaborn as sns
def HeatPlotting(delay_threshold):
try:
fig = plt.figure(figsize=(12,12))
dep_avg_org_lax = df_lax[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
dep_avg_org_lax = dep_avg_org_lax.reset_index()
dep_avg_org_lax = dep_avg_org_lax[dep_avg_org_lax.DEP_DELAY>delay_threshold]
origins_lax = dep_avg_org_lax.ORIGIN.unique()
destinations_lax = dep_avg_org_lax.DEST.unique()
df_dep_lax = pd.DataFrame(dep_avg_org_lax.values).pivot(0,1,2).fillna(0)
ax1 = fig.add_subplot(221)
ax1 = sns.heatmap(df_dep_lax,cmap="YlGnBu")
ax1.set_title('LAX')
dep_avg_org_sfo = df_sfo[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
dep_avg_org_sfo = dep_avg_org_sfo.reset_index()
dep_avg_org_sfo = dep_avg_org_sfo[dep_avg_org_sfo.DEP_DELAY>delay_threshold]
origins_sfo = dep_avg_org_sfo.ORIGIN.unique()
destinations_sfo = dep_avg_org_sfo.DEST.unique()
df_dep_sfo = pd.DataFrame(dep_avg_org_sfo.values).pivot(0,1,2).fillna(0)
ax2 = fig.add_subplot(222)
ax2 = sns.heatmap(df_dep_sfo,cmap="YlGnBu")
ax2.set_title('SFO')
dep_avg_org_san = df_san[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
dep_avg_org_san = dep_avg_org_san.reset_index()
dep_avg_org_san = dep_avg_org_san[dep_avg_org_san.DEP_DELAY>delay_threshold]
origins_san = dep_avg_org_san.ORIGIN.unique()
destinations_san = dep_avg_org_san.DEST.unique()
df_dep_san = pd.DataFrame(dep_avg_org_san.values).pivot(0,1,2).fillna(0)
ax3 = fig.add_subplot(223)
ax3 = sns.heatmap(df_dep_san,cmap="YlGnBu")
ax3.set_title('SAN')
dep_avg_org_oak = df_oak[['DEST','ORIGIN','DEP_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
dep_avg_org_oak = dep_avg_org_oak.reset_index()
dep_avg_org_oak = dep_avg_org_oak[dep_avg_org_oak.DEP_DELAY>delay_threshold]
origins_oak = dep_avg_org_oak.ORIGIN.unique()
destinations_oak = dep_avg_org_oak.DEST.unique()
df_dep_oak = pd.DataFrame(dep_avg_org_oak.values).pivot(0,1,2).fillna(0)
ax4 = fig.add_subplot(224)
ax4 = sns.heatmap(df_dep_san,cmap="YlGnBu")
ax4.set_title('OAK')
except:
print ("No available flight in some airpots!")
w = widgets.IntSlider(
value=15,
min=0,
max=50,
step=5,
description='Delay (min):',
disabled=False,
continuous_update=False,
orientation='horizontal',
readout=True,
readout_format='i',
slider_color='white'
)
display (w)
interact(HeatPlotting, delay_threshold=w)
Most of the longer delays belong to destinations with smaller airports. It seems that small airports have a higher delays than medium or large ones. This fact shows we need a measure of size for the aiports.
In this section, we explore the relationship between delays and the locations of the different airports. In order to do that, we used the OpenFlights database [4] and merge that with our flight dataset. Basically, the OpenFlights database connects the IATA [5] coding of the airlines to the same columns in the flights dataset. The geographical distribution of the flights in our dataset is illustrated in below:
df_airport = pd.read_table('/mnt/c/Users/nsalehi/Desktop/airline_ca/airports.dat', error_bad_lines=False,sep=',', header=None)
df_airport.columns =['Airport ID','Name','City','Country','code','ICAO','lat','long','Altitude','Timezone','DST','Tz database time zone','Type','Source']
df_airport.drop(['Airport ID','Name','City','Country','ICAO','Altitude','Timezone','DST','Tz database time zone','Type','Source'], axis=1, inplace=True)
dep_avg_org_df = df[['ORIGIN','DEP_DELAY']].groupby('ORIGIN').mean()
dep_avg_org_df = dep_avg_org_df.reset_index()
arr_avg_org_df = df[['ORIGIN','ARR_DELAY']].groupby('ORIGIN').mean()
arr_avg_org_df = arr_avg_org_df.reset_index()
dep_map_airport_delay = pd.merge(dep_avg_org_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
arr_map_airport_delay = pd.merge(arr_avg_org_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
import folium
from folium.plugins import MarkerCluster
lax_cord = (33.942809, -118.404706)
# create empty map zoomed in on San Francisco
map = folium.Map(location=lax_cord, zoom_start=3,tiles='Mapbox Bright')
for i in dep_map_airport_delay.iterrows():
folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['DEP_DELAY'],
popup=i[1]['code'] + ', Departure Delay Average=' + str(i[1]['DEP_DELAY']),
fill_color='#4c4cff').add_to(map)
for i in arr_map_airport_delay.iterrows():
folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['ARR_DELAY'],
popup=i[1]['code'] + ', Arrival Delay Average=' + str(i[1]['ARR_DELAY']),
fill_color='#ff4c4c').add_to(map)
display(map)
dep_avg_lax_df = df_lax[['ORIGIN','DEP_DELAY']].groupby('ORIGIN').mean()
dep_avg_lax_df = dep_avg_lax_df.reset_index()
arr_avg_lax_df = df[['ORIGIN','ARR_DELAY']].groupby('ORIGIN').mean()
arr_avg_lax_df = arr_avg_lax_df.reset_index()
dep_map_lax_delay = pd.merge(dep_avg_lax_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
arr_map_lax_delay = pd.merge(arr_avg_lax_df, df_airport, left_on = 'ORIGIN', right_on = 'code')
lax_cord = (33.942809, -118.404706)
# create empty map zoomed in on San Francisco
map = folium.Map(location=lax_cord, zoom_start=4,tiles='Mapbox Bright')
for i in dep_map_lax_delay.iterrows():
folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['DEP_DELAY'],
popup=i[1]['code'] + ', Departure Delay Average=' + str(i[1]['DEP_DELAY']),
fill_color='#4c4cff').add_to(map)
for i in arr_map_lax_delay.iterrows():
folium.CircleMarker(location=[i[1]['lat'],i[1]['long']], radius=i[1]['ARR_DELAY'],
popup=i[1]['code'] + ', Arrival Delay Average=' + str(i[1]['ARR_DELAY']),
fill_color='#ff4c4c').add_to(map)
display(map)
As we mentioned before, most of the longer delays belong to destinations with smaller airports. This fact shows we need a measure of size for the aiports. Additionally, we can consider the distance between the origin and destination airports as another factor.
def RoutePlotting(delay_threshold2,delay_threshold3):
org_dest = df[['DEST','ORIGIN','DEP_DELAY','ARR_DELAY']].groupby(['ORIGIN', 'DEST']).mean()
org_dest = org_dest.reset_index()
org_dest_map = pd.merge(org_dest, df_airport, left_on = 'ORIGIN', right_on = 'code')
org_dest_map = pd.merge(org_dest_map, df_airport, left_on = 'DEST', right_on = 'code')
org_dest_map = org_dest_map[(org_dest_map.DEP_DELAY>delay_threshold2) &
(org_dest_map.DEP_DELAY>delay_threshold3)]
lax_cord = (33.942809, -118.404706)
map = folium.Map(location=lax_cord, zoom_start=3,tiles='Mapbox Bright')
for i in org_dest_map.iterrows():
pointA = [i[1]['lat_x'],i[1]['long_x']]
pointB = [i[1]['lat_y'],i[1]['long_y']]
if i[1]['ARR_DELAY']>15 or i[1]['DEP_DELAY']>15:
set_color = '#ff4c4c'
else:
set_color = '#4c4cff'
folium.PolyLine([pointA,pointB], color=set_color, weight=0.2, opacity=0.7).add_to(map)
display(map)
w2 = widgets.IntSlider(
value=10,
min=0,
max=50,
step=5,
description='Departure Delay (min):',
disabled=False,
continuous_update=False,
orientation='horizontal',
readout=True,
readout_format='i',
slider_color='white',
)
display (w2)
w3 = widgets.IntSlider(
value=10,
min=0,
max=50,
step=5,
description='Arrival Delay (min):',
disabled=False,
continuous_update=False,
orientation='horizontal',
readout=True,
readout_format='i',
slider_color='white'
)
display (w3)
interact(RoutePlotting, delay_threshold2=w2, delay_threshold3=w3)
As we can see in this Figure, all the flights are either originated from California or the destination is somewhere in California. From this Figure, we can also see the main hubs of the domestic and international flights (higher flight density at the hubs, like New York, Chicago, Miami, etc). We can also see, flights to main hubs in the USA, like Florida, NY and Boston have higher chance to have delay.
Next, we look at delays linked to carriers, for all airports in California. Nine airlines are operating in California: 'DL' 'B6' 'AA' 'AS' 'F9' 'VX' 'WN' 'UA' 'OO' 'HA' 'NK' 'EV'. The scope of their operations is illustrated below:
plt.figure(figsize=(12,6))
Carriers = df.UNIQUE_CARRIER.unique()
num_ca_df = df[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()
bar_width = 0.4
plt.bar(np.arange(1,13)-0.5*bar_width,num_ca_df, width = bar_width, alpha=0.5, color= 'g')
plt.xticks(np.arange(1,13), Carriers)
plt.xlabel('Carriers')
plt.ylabel('Number of Flights')
plt.title('2016 Number of Flights by Carrier')
plt.figure(figsize=(12,6))
Carriers = df.UNIQUE_CARRIER.unique()
dep_avg_ca_df = df[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_ca_df = df[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()
bar_width = 0.4
plt.bar(np.arange(1,13)-bar_width,dep_avg_ca_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
plt.bar(np.arange(1,13),arr_avg_ca_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
plt.legend(loc='upper left')
plt.xticks(np.arange(1,13), Carriers)
plt.xlabel('Carriers')
plt.ylabel('Average Delay in Minutes')
plt.title('2016 Departure/Arrival Average Delay Time by Carrier')
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(221)
dep_avg_lax_df = df_lax[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()
lax_carriers = df_lax.UNIQUE_CARRIER.unique()
lax_car = len(lax_carriers)+1
bar_width = 0.4
ax1.bar(np.arange(1,lax_car)-0.5*bar_width,dep_avg_lax_df, width = bar_width, alpha=0.5, color= 'g')
ax1.set_title('LAX')
ax1.set_xticks(np.arange(1,lax_car))
ax1.set_xticklabels(lax_carriers, rotation='vertical')
ax2 = fig.add_subplot(222)
dep_avg_sfo_df = df_sfo[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()
sfo_carriers = df_sfo.UNIQUE_CARRIER.unique()
sfo_car = len(sfo_carriers)+1
bar_width = 0.4
ax2.bar(np.arange(1,sfo_car)-0.5*bar_width,dep_avg_sfo_df, width = bar_width, alpha=0.5, color= 'g')
ax2.set_title('SFO')
ax2.set_xticks(np.arange(1,sfo_car))
ax2.set_xticklabels(sfo_carriers, rotation='vertical')
ax3 = fig.add_subplot(223)
dep_avg_san_df = df_san[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()
san_carriers = df_san.UNIQUE_CARRIER.unique()
san_car = len(san_carriers)+1
bar_width = 0.4
ax3.bar(np.arange(1,san_car)-0.5*bar_width,dep_avg_san_df, width = bar_width, alpha=0.5, color= 'g')
ax3.set_title('SAN')
ax3.set_xticks(np.arange(1,san_car))
ax3.set_xticklabels(san_carriers, rotation='vertical')
ax4 = fig.add_subplot(224)
dep_avg_oak_df = df_oak[['UNIQUE_CARRIER']].groupby('UNIQUE_CARRIER').size()
oak_carriers = df_oak.UNIQUE_CARRIER.unique()
oak_car = len(oak_carriers)+1
bar_width = 0.4
ax4.bar(np.arange(1,oak_car)-0.5*bar_width,dep_avg_oak_df, width = bar_width, alpha=0.5, color= 'g')
ax4.set_title('OAK')
ax4.set_xticks(np.arange(1,oak_car))
ax4.set_xticklabels(oak_carriers, rotation='vertical')
fig.tight_layout()
In the first figure, we have the number of flights per carrier on the vertical axis. In the second one, we have the average departure/arrival delays on the vertical axis categorized by the carriers and in the third one the early flights are removed. For flights with one of 12 unique carriers, average flight delays vary considerably. However, this analysis is affected by the number of flights per carrier. Virgin America (VX), Southwest Airlines (WN), United Airlines (UA), and Spirit Airlines (NK) have lower flights in 2016 comparing to other 8 airlines. This can be the reason why WN has a low average delay. Another interesting observation is ExpressJet (EV) with highest number of flights but relatively a low mean delay. Furthermore, we look into the distribution of the departure/arrival delays in 4 main airports in California (Los angles, San Francisco, San Diego, and Oakland):
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(221)
dep_avg_lax_df = df_lax[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_lax_df = df_lax[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()
lax_carriers = df_lax.UNIQUE_CARRIER.unique()
lax_car = len(lax_carriers)+1
bar_width = 0.4
ax1.bar(np.arange(1,lax_car)-bar_width,dep_avg_lax_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax1.bar(np.arange(1,lax_car),arr_avg_lax_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax1.set_title('LAX')
ax1.legend(loc='upper left',prop={'size':6})
ax1.set_xticks(np.arange(1,lax_car))
ax1.set_xticklabels(lax_carriers, rotation='vertical')
ax2 = fig.add_subplot(222)
dep_avg_sfo_df = df_sfo[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_sfo_df = df_sfo[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()
sfo_carriers = df_sfo.UNIQUE_CARRIER.unique()
sfo_car = len(sfo_carriers)+1
ax2.bar(np.arange(1,sfo_car)-bar_width,dep_avg_sfo_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax2.bar(np.arange(1,sfo_car),arr_avg_sfo_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax2.set_title('SFO')
ax2.set_xticks(np.arange(1,sfo_car))
ax2.set_xticklabels(sfo_carriers, rotation='vertical')
ax3 = fig.add_subplot(223)
dep_avg_san_df = df_san[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_san_df = df_san[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()
san_carriers = df_san.UNIQUE_CARRIER.unique()
san_car = len(san_carriers)+1
ax3.bar(np.arange(1,san_car)-bar_width,dep_avg_san_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax3.bar(np.arange(1,san_car),arr_avg_san_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax3.set_title('SAN')
ax3.set_xticks(np.arange(1,san_car))
ax3.set_xticklabels(san_carriers, rotation='vertical')
ax4 = fig.add_subplot(224)
dep_avg_oak_df = df_oak[['UNIQUE_CARRIER','DEP_DELAY']].groupby('UNIQUE_CARRIER').mean()
arr_avg_oak_df = df_oak[['UNIQUE_CARRIER','ARR_DELAY']].groupby('UNIQUE_CARRIER').mean()
oak_carriers = df_oak.UNIQUE_CARRIER.unique()
oak_car = len(oak_carriers)+1
ax4.bar(np.arange(1,oak_car)-bar_width,dep_avg_oak_df.DEP_DELAY, width = bar_width, alpha=0.5, color= 'b', label='Departure')
ax4.bar(np.arange(1,oak_car),arr_avg_oak_df.ARR_DELAY, width = bar_width, alpha=0.3, color= 'r', label='Arrival')
ax4.set_title('OAK')
ax4.set_xticks(np.arange(1,oak_car))
ax4.set_xticklabels(oak_carriers, rotation='vertical')
fig.tight_layout()
The first figure illustrates the distribution of the flights in this four airports. In the second and third (early flights subtracted) figures, we can clear see the effect of the origin/destination airport on the delays. Airports with less flights (SAN and OAK) have smaller average delays comparing to larger airports. This can be due to the traffic at that airport and the scope of the airport’s operations. Usually, more crowded airports (LAX and SFO have more flights) have higher average delays. Small size and large size airports also have higher average delays in compare to medium size airports. We can also see the mainstream airlines (like Delta) have lower arrival delays on average comparing to smaller airlines.
Here we use a heuristic measure for the airport size as a new feature into the dataset. To do that we computed the total number of ingoing/outgoing flights from/to each airport and cluster them into three categories of small, medium, and large using K-means algorithm.
from sklearn.cluster import KMeans
df_org_size = df[['ORIGIN']].groupby('ORIGIN').size()
airport_names =list(df_org_size.index)
df_org_size = np.reshape(df_org_size, (len(df_org_size), 1))
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_org_size)
air_org_cluster = pd.concat([pd.DataFrame(airport_names),pd.DataFrame(kmeans.labels_)],axis=1)
air_org_cluster.columns = ['Airport', 'Size']
df_dest_size = df[['DEST']].groupby('DEST').size()
airport_names =list(df_dest_size.index)
df_dest_size = np.reshape(df_dest_size, (len(df_dest_size), 1))
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_dest_size)
air_dest_cluster = pd.concat([pd.DataFrame(airport_names),pd.DataFrame(kmeans.labels_)],axis=1)
air_dest_cluster.columns = ['Airport', 'Size']
df = pd.merge(df, air_org_cluster, left_on = 'ORIGIN', right_on = 'Airport')
df.drop(['Airport'], axis=1, inplace=True)
df = df.rename(columns={'Size': 'org_size'})
df = pd.merge(df, air_dest_cluster, left_on = 'DEST', right_on = 'Airport')
df.drop(['Airport'], axis=1, inplace=True)
df = df.rename(columns={'Size': 'dest_size'})
Since Neural Networks are scale sensitive, first we need to scale the departure delay times. To do that we used Boxcox [7] to find the appropriate scale considering the very left skewed distribution of the departure delay. In this project we decided to go with cube root transformation after running the Boxcox algorithm on the flight data. Cube root was appropriate because we have both negative and positive delays. Then we correct for mean and finally, we consider the ynew= (y - ymin)/ (ymax - ymin) transformation. Therefore, our new labels are scaled between 0 and 1. This transformation is necessary since our predictions are outputs of a sigmoid function.
The next transformation is on the data features. Since the selected columns have categorical nature, we consider two Factorized and One-Hot [7] Encoded transformations for each column. In factorization [8], different observations in each column will be coded with different factor levels. In One-Hot encoding we create a binary representation of the factors in each column of the original dataset. It is known that One-Hot Encoding tends to give more accurate predictions. However, the size of the dataframe increases drastically.
Finally, we need to divide the data into train and test subsets. To do that, first we randomly shuffle the rows and split the dataset into 70% training and 30% test subsets. Since the obtained dataframes are huge in size, we use the Hadoop Distributed File System (HDFS) [9] to store them. Otherwise, the data fills the memory quickly and the system crashes. With HDFS we can load the memory with chunks of the data whenever we need them.
##Calif - one-hot
from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py
dep_df = df[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','org_size','dest_size']]
dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)
dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = np.cbrt(dep_train_y)
dep_train_y = dep_train_y - np.mean(dep_train_y)
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))
dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))
print dep_train_x.shape
print dep_train_y.shape
del df, dep_df, dep_train, dep_test
train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = pd.get_dummies(dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']])
dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]
dep_train_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size'], axis=1, inplace=True)
col_train = dep_train_x.columns
dep_test_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size'], axis=1, inplace=True)
col_test = dep_test_x.columns
print dep_train_x.shape
print dep_train_y.shape
h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x, chunks=True)
h5f.create_dataset('train_y', data=dep_train_y, chunks=True)
h5f.create_dataset('train_cy',data=dep_train_cy, chunks=True)
h5f.create_dataset('test_X', data=dep_test_x, chunks=True)
h5f.create_dataset('test_y', data=dep_test_y, chunks=True)
h5f.create_dataset('test_cy', data=dep_test_cy, chunks=True)
h5f.close()
print 'saved!'
##Calif - factorized
from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py
dep_df = df[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','org_size','dest_size']]
dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)
dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))
dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = np.cbrt(dep_test_y)
dep_test_y = dep_test_y - np.mean(dep_test_y)
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))
train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','org_size','dest_size']].apply(lambda x: pd.factorize(x)[0])
dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]
h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data_fac.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x)
h5f.create_dataset('train_y', data=dep_train_y)
h5f.create_dataset('train_cy', data=dep_train_cy)
h5f.create_dataset('test_X', data=dep_test_x)
h5f.create_dataset('test_y', data=dep_test_y)
h5f.create_dataset('test_cy', data=dep_test_cy)
h5f.close()
print 'saved!'
## 2016 LAX-ORIGIN Database - One-hot Encoded
from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py
df_lax_org = df[df['ORIGIN'] == 'LAX']
dep_df = df_lax_org[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','dest_size']]
dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)
dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = np.cbrt(dep_train_y)
dep_train_y = dep_train_y - np.mean(dep_train_y)
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))
dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER', 'DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = np.cbrt(dep_test_y)
dep_test_y = dep_test_y - np.mean(dep_test_y)
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))
print dep_train_y.shape
train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = pd.get_dummies(dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']])
dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]
dep_train_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size'], axis=1, inplace=True)
col_train = dep_train_x.columns
dep_test_x.drop(['MONTH','DAY_OF_WEEK','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size'], axis=1, inplace=True)
col_test = dep_test_x.columns
h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data_lax.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x)
h5f.create_dataset('train_y', data=dep_train_y)
h5f.create_dataset('train_cy', data=dep_train_cy)
h5f.create_dataset('test_X', data=dep_test_x)
h5f.create_dataset('test_y', data=dep_test_y)
h5f.create_dataset('test_cy', data=dep_test_cy)
h5f.close()
print 'saved!'
## 2016 LAX-ORIGIN Database - Factorized
from sklearn.model_selection import train_test_split
import tables
import numpy as np
import h5py
df_lax_org = df[df['ORIGIN'] == 'LAX']
dep_df = df_lax_org[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','DEP_DELAY','DEP_DEL15','dest_size']]
dep_train, dep_test = train_test_split(dep_df, test_size = 0.3)
dep_train_x = dep_train[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_train_cy = dep_train['DEP_DEL15']
dep_train_y = dep_train['DEP_DELAY']
dep_train_y = (dep_train_y -np.min(dep_train_y,axis=0))/(np.max(dep_train_y,axis=0)-np.min(dep_train_y,axis=0))
dep_test_x = dep_test[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER', 'DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']]
dep_test_cy = dep_test['DEP_DEL15']
dep_test_y = dep_test['DEP_DELAY']
dep_test_y = (dep_test_y -np.min(dep_test_y,axis=0))/(np.max(dep_test_y,axis=0)-np.min(dep_test_y,axis=0))
print dep_train_y.shape
train_objs_num = len(dep_train_x)
dataset = pd.concat(objs=[dep_train_x, dep_test_x], axis=0)
dataset_preprocessed = dataset[['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','DEST','CRS_DEP_HR','CRS_ARR_HR','DISTANCE_GROUP','dest_size']].apply(lambda x: pd.factorize(x)[0])
dep_train_x = dataset_preprocessed[:train_objs_num]
dep_test_x = dataset_preprocessed[train_objs_num:]
h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data_fac_lax.h5', 'w')
h5f.create_dataset('train_X', data=dep_train_x)
h5f.create_dataset('train_y', data=dep_train_y)
h5f.create_dataset('train_cy', data=dep_train_cy)
h5f.create_dataset('test_X', data=dep_test_x)
h5f.create_dataset('test_y', data=dep_test_y)
h5f.create_dataset('test_cy', data=dep_test_cy)
h5f.close()
print 'saved!'
import pandas as pd
import csv
import numpy as np
import h5py
h5f = h5py.File('/mnt/c/Users/nsalehi/Desktop/airline_ca/data.h5', 'r')
train_X = h5f['train_X']
train_y = h5f['train_y']
train_y = np.reshape(train_y, (len(train_y), 1))
test_X = h5f['test_X']
test_y = np.array(h5f['test_y'])
test_y = np.reshape(test_y, (len(test_y), 1))
print 'loaded!'
In this section we use Deep Neural Networks (DNN) for predicting the departure delays in the flight dataset. In here, we do both regression to predict the amount of delays and classification to predict whether a flight is going to be delayed or not.
For this study we consider “fully connected” structure with one or two hidden layers for different experiments. The number of hidden nodes are different in each experiments, having time constraints we tried to find the optimum number of nodes for each experiment. The activation function for each node is the sigmoid function. The weights and biases are initialized randomly following a Normal Distribution. The output node has only one node and returns the scaled predicted delay. We used Tensorflow [10] to model the DNN in Python with the MSE minimization objective and Gradient Descent Optimizer to update the weights in each epoch. The learning rates are different in each experiment in [0.01, 1.00] range. We used different strategies such as constant, time inverse decaying, and exponential decaying to change the learning rates in each epoch and for each experiment. We have different number of epochs for different experiments. Loosely speaking, we tried to optimize all the variables for each experiment.
For classification we encoded the labels with One-Hot. Therefore, the argmax of the predicted outputs shows the actual label. The structure of the network is similar to regression in previous section, except in here we have two nodes in the output layer. The objective of this network is to minimize “softmax cross entropy with logits” between logits and labels. This is a measure for the probability error in discrete classification tasks in which the classes are mutually exclusive. This objective tends to maximize the accuracy.
As it is mentioned in the introduction only 18% of the labels are showing delayed flights. This creates a problem commonly known biased prediction toward the majority class. Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.
Indeed, if the goal is to maximize simple accuracy (or, equivalently, minimize error rate), treating every example as the majority is an acceptable solution with usually very high accuracies. But if we assume that the rare class examples are much more important to classify, then we have to be more careful and more sophisticated about attacking the problem.
There are several ways to deal with this problem. First, we consider over/under sampling by using the SMOTE algorithm. We used the SMOTE algorithm in some of the experiments, however, in general this is a very expensive approach in terms of both time and memory. The second approach we obtained in this study was by combination of adjusting weights and considering recall and precision in the objective. To do that, we consider the 1/ratio as the weight for the minority class (delayed labels) in which the “ratio” is 18% and we set the weight of the other class to be one. We also changed the objective function to “weighted cross entropy with logits” which allows us to trade-off between recall and precision.
In our first attempt to create a model that can describe the flight dataset we tried different variations of the linear model and it all failed to get an adjusted R-square more than 0.7%. This fact indicates that the relationship between the selected predictors cannot be expressed by linear models. Here we omit the results related to the linear models. By running a step-wise selection algorithm, we were able to identify month, departure time, and arrival time as the most important predictors. The logistic regression models for the classification tend to ignore the minority class (delayed flights). Therefore, we need to use SMOTE before using them to make sure we are not classifying all the examples as not delayed. The confusion matrix of one instance of our runs for the LAX airport is presented in this Figure below:
from IPython.display import Image
Image("/mnt/c/Users/nsalehi/Desktop/111.png")
As we can see in this Figure above, we can predict not delayed flights with accuracy of 55% but this accuracy obtained by having a lot of misclassification of the not delayed flights after we applied SMOTE on this dataset. The precision of this classifier is around 27% and the recall is around 57%. This results can easily justify using DNNs as nonlinear regression models and nonlinear classifiers.
For the classification problem, we used the weighted cross entropy objective instead of the SMOTE (since SMOTE is very expensive both in terms of required time and memory with 629 categorical predictors). The best accuracy for the prediction is 77% which is exactly the average number of not delayed flights in the test dataset. As we have mentioned before this accuracy can only be obtained by predicting all the flights to be not delayed. However, this classifier has a very low precision and recall. The classifier obtained with weighted cross entropy objective yields the following confusion matrix:
Image("/mnt/c/Users/nsalehi/Desktop/222.png")
As we can see also the accuracy dropped from 77% to 53%, however, the precision and recall increased significantly to 54% and 43% (we can see this by comparing the color of delayed-delayed cell among different settings). For the regression model, first we start with one hidden layer and 2 nodes then we move to 64 nodes and then we consider a deeper model with two hidden layers. The comparison of the MSE of two models after 500 epochs is illustrated in Figure above.
Moreover in this Figure, the MSE converges faster by increasing the number of nodes in the hidden layer and the accuracies increase by adding an extra hidden layer. It seems the regression models predict the delays to be close to zero in most cases and it is mostly an underestimation for the actual delays. However, MSE may not be the best way to evaluate the performance of our predictors, since a smooth and linear prediction of the delays is not what we want. Although these predictions are minimizers of the MSE, but we are looking for predictions that can better capture the delays in delayed flights (not in the whole dataset). Therefore, the first step is to define an asymmetric MSE with different weights for overestimations and underestimations. Here we also used a similar ratio define in previous sections (Handling the Imbalanced Dataset) as the weight for underestimations of the delayed flights and one for overestimations. This way the DNN tends to give more variations in the predicted values. Here is a sample prediction for a deep network with two layers with [93, 64, 16, 1] nodes in each layer and sigmoid activation on each node:
Image("/mnt/c/Users/nsalehi/Desktop/333.png")
As we can see in the top part of this Figure, the model can capture some of the fluctuations (long delays) and it is not giving an underestimations of the delays like before. The trade-off is the test MSE is 3.97% which is considerably higher than before.
Finally, we repeat the same experiments with the whole California database. The designs of the optimized networks are different but we got very similar results with lower accuracies and higher MSEs. This is due to the fact that there are so many smaller airports with unusually high departure times and they reduced the accuracies of our predictions. It is worth mentioning that since the database is very large for the whole California state, we use Stochastic Gradient Descent (SGD) to update the weights. The MSE is not very sensitive to our choices of the mini sizes. This means the MSE is aggregated error of the mini batches. Results of the regression DNN on the California dataset is also shown in here.
In this project we study the delays in flights schedules, in particular we focused in the flights in the state of California as one of the main hubs in the United States for both domestic and international flights. We engineered our features by focusing on our knowledge of the dataset (domain knowledge) and the various visualizations we present in data exploratory analysis section. The results of our feature selection revealed that we need to focus on some categorical predictors in our dataset (introduced at the end of data analysis section). The categorical predictors requires appropriate encoding and some preprocessing before they can be used for prediction modeling.
Next, we examine several supervised learning methods on the dataset including linear regression, logistic regression, and deep neural nets. The results from these methods suggest linear models are incapable of capturing the relationships among our data features and our outputs. The performance of the non-linear methods is superior to linear models, however, the imbalanced dataset creates some questionable predictions. We dealt with this problem by over-/under-sampling in some of our experiments and by balancing the precision, accuracy, and recall in some others. In conclusion, the accuracy of our predictions are acceptable considering the fact that we did not have access to weather information, traffic, and the actual complex structure of the flights network. By adding these features, we would most probably have better predictions. For example, the severe weather conditions caused a significant amount of delay according to the BTS, or having the tail number of the flights we could examine if they need any repairs before the departure or not which can be the cause of departure delays. The traffic of the passengers at the airports considering the number of arrivals and departures is another helpful factor that we did not have access to it. More crowded airports in terms of the number of flights have higher chances of departure/arrival delays. All in all, we can say predicting the departure delays at all airports is a very complicated task that requires more complicated methodologies like hierarchical modeling. Therefore, splitting the dataset into smaller subsets with more similar predictors seems to be a reasonable assumption we had in our analysis.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>''')