Year of Award

2021

Document Type

Thesis

Degree Type

Master of Science (MS)

Other Degree Name/Area of Focus

Data Science

Department or School/College

Mathematical Sciences

Committee Chair

Dr. Javier Perez Alvaro

Commitee Members

Dr. Johnathan Bardsley, Dr. Simona Stanmach

Keywords

Machine Learning, Random Forest, Multiple Linear Regression, Grid Search, Feature Importance

Subject Categories

Data Science

Abstract

Flight delays cost airlines and affect passenger’s satisfaction. In this research work, we predicted the daily percentage of delayed flights based on the national weather data using the multiple linear regression and the random forest models. We extracted the passenger flight on-time performance data from the Bureau of Transportation Statistics and the weather dataset from NOAA National Centers for Environmental Information for the years from 2015 to 2019. We used the flight dataset for Seattle airport as the origin. We predicted the daily percentage of delayed flights for the Seattle-originated flights based on the features such as weather conditions of the origin and its top 10 destination airports on the date of flight, weather features of the day before the flight for the origin, the number of daily flights from Seattle to these destinations, year, month, and day of week. We conducted the random forest model by training and rigorously hyper-parameter tuning. We measured the assessment of the fitted model with the evaluation metrics, such as mean absolute error, root mean squared error, and coefficient of determination scores. The random forest model with the evaluation scores of 2.68, 4.08, and 0.79, respectively, outperformed the multiple linear regression model to predict the daily percentage of delayed flights.

Included in

Data Science Commons

Share

COinS
 

© Copyright 2021 Parto Mahmoudi