Skip to content

JessePMelo/anac-flight-delay-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

105 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Radar de Atrasos ✈️

Machine Learning system that predicts flight departure delays in Brazil using ANAC aviation data and provides model explainability through SHAP.

🌐 Live Demo: Radar de Atrasos

Project Overview

Flight delays are a major operational challenge for airlines and airports. Late departures impact passengers, increase operational costs, and disrupt airport logistics.

This project uses historical flight data from the Brazilian Civil Aviation Agency (ANAC) to build machine learning models capable of predicting whether a flight will depart late.

The final system includes:

  • Data analysis and feature engineering
  • Machine learning classification models
  • Delay prediction interface
  • Model explainability using SHAP

Architecture Overview

The system architecture integrates the prediction interface, backend services and the machine learning model responsible for estimating flight delays.

  1. ANAC Dataset
  2. Data Cleaning
  3. Feature Engineering
  4. Feature Selection
  5. Model Training (Logistic Regression / Random Forest / XGBoost)
  6. Model Evaluation
  7. SHAP Explainability
  8. Prediction Interface

Live Prediction Interface

The project includes a prediction interface that allows users to simulate flight conditions and obtain delay predictions.

Inputs include:

  • Airline
  • Origin airport
  • Destination airport
  • Flight time

The system returns:

  • Delay probability
  • Predicted status (delayed or on-time)
  • Main factors influencing the prediction

Machine Learning Pipeline

The modeling process follows a structured workflow:

  1. Data collection from ANAC public datasets
  2. Data cleaning and validation
  3. Feature engineering
  4. Leakage prevention
  5. Feature selection (Correlation + VIF)
  6. Model training
  7. Model evaluation
  8. Model explainability using SHAP

Model Card

Model Type
Binary classification model predicting whether a flight will depart late.

Target Definition
A flight is considered delayed if departure delay > 15 minutes.

Models Tested

  • Logistic Regression
  • Random Forest
  • XGBoost

Selected Model

XGBoost was selected as the primary model due to its performance and ability to capture nonlinear relationships.

Evaluation Metrics

The models were evaluated using:

  • Accuracy
  • Precision
  • Recall
  • Confusion Matrix

Explainability

SHAP (SHapley Additive Explanations) was used to interpret model predictions and identify the most influential features.

Data Dictionary

Main variables used in the project:

Feature Description
airline Airline operating the flight
origin Departure airport
destination Arrival airport
flight_hour Hour of the flight
previous_delay Delay history for that route
airport_traffic Estimated airport traffic level
route_volume Volume of flights on the route

Target variable:

Variable Description
delayed Binary variable indicating whether the flight departed late

Key Insights

Analysis of the ANAC flight dataset revealed several patterns associated with flight delays:

  • Flights scheduled during early morning waves show increased delay probability due to accumulated operational constraints.
  • High route traffic volume increases the likelihood of departure delays.
  • Some airlines demonstrate consistently lower delay rates, suggesting operational efficiency differences.
  • Holiday proximity and peak travel periods contribute to increased delay risk.

These insights highlight the importance of operational context when modeling flight delays and demonstrate how machine learning can support decision making in aviation operations.

Technologies Used

  • Python
  • Pandas
  • Scikit-Learn
  • XGBoost
  • Matplotlib
  • SHAP
  • Jupyter Notebook

Project Structure

anac-flight-delay-predictor/

backend/
  model/
  services/

data_science/
  artifacts/
  data/
    raw/
    processed/
    sampled/
  model/
  notebooks/
  src/

frontend/
  assets/

requirements.txt
README.md

backend/ API logic and services responsible for running predictions.

data_science/ Data analysis, feature engineering, model training and experimentation.

frontend/ User interface used to simulate flight conditions and visualize predictions.

artifacts/ Saved objects such as trained models or intermediate outputs.

data/ Datasets used in the project.

notebooks/ Exploratory analysis and modeling notebooks.

src/ Core scripts used for data processing and model training.

How to Run

Clone the repository

git clone https://github.com/JessePMelo/anac-flight-delay-prediction.git

Install dependencies

pip install -r requirements.txt

Then open the notebook to explore the analysis or run the prediction interface.

Author

Jessé Pereira de Melo

About

Predicts flight delays in Brazil using ANAC data and machine learning techniques, including exploratory data analysis and model evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages