The NASA Airport Throughput Prediction Challenge | bitgrit
Competition banner

The NASA Airport Throughput Prediction Challenge

Forecast the number of arrivals at airports in the US

NASA
17 days to go
340 Participants
177 Submissions
bitgrit linkedin
Updates
  • 08 Nov 2024

    We would like to offer clarification to the challenge rule #7, specifically regarding the use of pre-trained models as part of enhancing airport throughput predictions.

    Rule #7 states: The use of external datasets outside of those indicated in this competition page is strictly forbidden. However, we encourage the creative use of derivative datasets, such as calculated features, synthetic training data and other data augmentation techniques.

    After careful consideration, it has been determined that participants/teams ARE permitted to incorporate pre-trained general open-source models that use components compatible with the Apache-2.0 license, or other permissive licenses.
    Such permissive licenses grant rights to use, copy, modify, merge, publish,distribute, sublicense, and sell the software for any purpose, free of charge. (For example: Apache-2, MIT, and BSD-3 licenses). However, participants must follow the guidelines outlined in the “Data Breakdown” section. Specifically, they must avoid data leakage by ensuring that the features used for model training are based exclusively on data that would be available at the time of prediction, and do not include any “future” information.

    In order to be eligible for prizes, the winners will be required to transfer the rights to their solution to bitgrit, and therefore should only use permissive licenses in their work. In return, bitgrit grants CrowdPlat and the U.S. Government an irrevocable, royalty-free, non-exclusive, worldwide license to use, modify, distribute, and sublicense the winning solutions. The U.S. Government will receive the complete solution, and the IP rights will be defined by the agreement and applicable law.

Brief

Welcome to The NASA Airport Throughput Prediction Challenge!

The Digital Information Platform (DIP) Sub-Project of Air Traffic Management - eXploration (ATM-X) is seeking to make available in the National Airspace System a variety of live data feeds and services built on that data. The goal is to allow external partners to build advanced, data-driven services using this data, and to make these services available to flight operators, who will use these capabilities to save fuel and avoid delays. Different wind directions, weather conditions at or near the airport, inoperative runway, etc., affects the runway configurations to be used and impacts the overall arrival throughputs. Knowing the arrival runway and its congestion level ahead of time will enable aviation operators to perform a better flight planning and improve the flight efficiency. This competition seeks to make better predictions of runway throughputs using machine learning or other techniques. The total prize pool for this competition is $120,000.

This competition engages students, faculty members and other individuals employed by United States universities to develop a machine learning model that provides a short-term forecast of estimated airport runway throughput using simulated real-time information from historical NAS and weather forecast data, as well as other factors such as meteorological conditions, airport runway configuration, and airspace congestion.

Prizes
  • 1st Place - $30,000
  • 2nd Place - $22,000
  • 3rd Place - $15,000
  • 4th Place - $13,000
  • 5th Place - $10,000
  • 6th Place - $10,000
  • 7th Place - $10,000
  • 8th Place - $10,000
Timeline
  • Competition Starts – September 13th, 2024
  • Competition Ends – December 8th, 2024
  • Winners Announcement – January 13th, 2025
Data Breakdown

Welcome to The NASA Airport Throughput Prediction Challenge

Before reading these guidelines, it is very important to read the Rules section, with special attention to the ones concerning eligibility for prize awarding. Make sure to carefully read all the rules before starting to work on this challenge.


The objective of this challenge is to develop a regression model that can accurately forecast with greater accuracy the number of arrivals (throughput) of US airports into the near future using existing estimated arrivals and weather forecast information.

More formally, given an airport ABC, and a timestamp T, what is the expected number of arriving flights (throughput) in the next 3 hours with a resolution of 15-minutes time buckets.

The input data can be classified into 2 types: Flights Information and Weather Information. The data is organized into 4 different types of data: FUSER, METAR, TAF and CWAM. These will be explained next. Keep in mind that the data for this competition is quite large (especially FUSER and CWAM), so make sure to have at least 200 GB of free space before downloading all the data.

FUSER data:
The FUSER dataset consists of 8 types of CSV files, each providing various flight and airport-related data. Files are named using the format {airport}_{date_range}.{file_type}_data_set.csv. Below is a breakdown of each file type and its key columns:

configs_data_set (D-ATIS Data):
Columns: airport_id, data_header, src_addr, datis_time (time of D-ATIS message), start_time (time configuration starts), weather_report, departure_runways (parsed departure runways), arrival_runways (parsed arrival runways), timestamp_source_received, timestamp_source_processed, invalid_departure_runways, invalid_arrival_runways, departure_runway_string, arrival_runway_string, airport_configuration_name.
Purpose: Contains airport configuration and weather details extracted from D-ATIS messages, used for understanding airport configurations.

runways_data_set (Arrival/Departure Detection):
Columns: gufi (unique flight identifier), arrival_runway_actual_time (actual time of arrival), arrival_runway_actual (runway used for arrival), departure_runway_actual_time (actual time of departure), departure_runway_actual (runway used for departure).
Purpose: Provides actual times and runways for arrivals and departures. Primary Source for Target Variables: Used to count the number of arrivals in 15-minute intervals for the next 3 hours (target variable).

first_position_data_set:
Columns: gufi (unique flight identifier), time_first_tracked (time when the flight is first detected).
Purpose: Tracks the initial detection time for each flight.

TBFM_data_set (Time-Based Flow Management Data):
Columns: gufi, timestamp (time estimate is made), arrival_runway_sta (scheduled time of arrival).
Purpose: Contains scheduled times of arrival for flights.

TFM_track_data_set (Traffic Flow Management Data):
Columns: gufi, timestamp (time estimate is made), arrival_runway_estimated_time (estimated time of arrival).
Purpose: Provides estimated arrival times for flights; helps to predict future arrival times.

ETD_data_set (Estimated Time of Departure):
Columns: gufi, timestamp (time estimate is made), departure_runway_estimated_time (estimated departure time).
Purpose: Provides estimated departure times for flights.

LAMP_data_set (Local Aviation MOS Program Data):
Columns: timestamp (time of forecast), forecast_timestamp (forecasted time), temperature (Fahrenheit), wind_direction (0-36 scale), wind_speed (knots), wind_gust (knots), cloud_ceiling (code from 1 to 8), visibility (code from 1 to 7), cloud (categories like CL, FW, SC, BK, OV), lightning_prob (N, L, M, H), precip (True/False).
Purpose: Provides short-term weather forecasts relevant for airport operations.

MFS_data_set (FAA SWIM Feeds):
Columns: gufi, aircraft_engine_class (e.g., JET), aircraft_type (e.g., Boeing 737), arrival_aerodrome_icao_name (arrival airport ICAO code), major_carrier (e.g., UAL for United Airlines), flight_type (e.g., SCHEDULED_AIR_TRANSPORT), isarrival (True/False), isdeparture (True/False), arrival_stand_actual (gate at arrival), arrival_stand_actual_time (time at arrival gate), arrival_runway_actual (actual arrival runway), arrival_runway_actual_time (actual arrival time), departure_stand_actual (gate at departure), departure_stand_actual_time (time of push-back), departure_runway_actual (actual departure runway), departure_runway_actual_time (actual departure time).
Purpose: Detailed flight information from FAA SWIM feeds, used for understanding flight paths, types, and status.

Directory Structure
Files are organized by airport directory (e.g., JFK, ATL) and date ranges, with each file type containing data for a specific period and airport.

METAR data:
METAR (Meteorological Aerodrome Report) is a type of aviation weather observation that provides detailed information about the current weather conditions at airports around the world. These reports are critical for flight operations, providing real-time data on factors such as wind speed and direction, visibility, cloud cover, temperature, dew point, and atmospheric pressure. METAR reports are typically issued hourly, but can be updated more frequently if conditions change rapidly.

A METAR report consists of several components, each giving specific weather information. An example entry from a METAR file might look like:

2022/09/25 00:00
KSEA 250000Z 05010KT 9999 SCT020 FEW021CB 31/25 Q1011


2022/09/25 00:00: Date and time of the observation.
KSEA: Station identifier (e.g., Seattle International Airport).
250000Z: Day of the month and time of the report in UTC (e.g., 25th day at 00:00 UTC).
05010KT: Wind direction (050 degrees) and speed (10 knots).
9999: Visibility in meters (e.g., 9999 meters or 10 kilometers).
SCT020 FEW021CB: Cloud cover information (scattered clouds at 2,000 feet and few cumulonimbus clouds at 2,100 feet).
31/25: Temperature (31°C) and dew point (25°C).
Q1011: Altimeter setting (e.g., 1011 hPa).

The METAR data is provided in text files (.txt), organized by hour, with each file representing the weather conditions observed at various airports around a specific hour. The directory structure is as follows:

    data/
        └── METAR/ 
          ├── metar.20220901.00Z.txt 
          ├── metar.20220901.01Z.txt 
          ├── metar.20220901.02Z.txt
          └── ...     

Each file contains multiple entries of 2 row pairs, with each pair representing an individual METAR observation for a specific airport.

If you want to learn more about METAR data, please refer to https://en.wikipedia.org/wiki/METAR

TAF data:
TAF (Terminal Aerodrome Forecast) is a format used to provide weather forecasts specifically for aviation. TAF reports give a detailed forecast for a 24 to 30-hour period and are updated every 6 hours. They provide predictions about wind, visibility, weather phenomena, and cloud cover that are crucial for planning flights.

TAF report includes the forecasted weather conditions for an airport and is typically longer and more complex than a METAR. An example row from a TAF file might look like:

2022/09/24 00:00

 
TAF KJFK 242200Z 2500/2524 07005KT 9999 SCT025 TX34/2517Z TN23/2508Z PROB30
          TEMPO 2508/2511 VRB01KT 3000BR SCT005
          TEMPO 2517/2522 5000 TSRA FEW018CB SCT020

2022/09/24 00:00: Date and time the forecast was issued.
TAF KJFK: The report type and station identifier (e.g., John F. Kennedy International Airport).
242200Z: Day of the month and time of issuance in UTC (e.g., 24th day at 22:00 UTC).
2500/2524: Valid period of the forecast (e.g., from 00:00 UTC on the 25th to 24:00 UTC on the 25th).
07005KT: Forecasted wind direction (070 degrees) and speed (5 knots).
9999: Forecasted visibility (e.g., 9999 meters or 10 kilometers).
SCT025: Scattered clouds at 2,500 feet.
TX34/2517Z: Maximum temperature of 34°C expected at 17:00 UTC on the 25th.
TN23/2508Z: Minimum temperature of 23°C expected at 08:00 UTC on the 25th.
PROB30 TEMPO: 30% probability of temporary conditions, followed by the specific forecast.

TAF data is provided in text files (.txt), organized by issuance time, with each file representing the forecasts issued for a 6-hour period. The directory structure is as follows:

     data/
       └── TAF/
          ├── taf.20220901.00Z.txt
          ├── taf.20220901.06Z.txt
          ├── taf.20220901.12Z.txt
          ├── taf.20220901.18Z.txt
          ├── taf.20220902.00Z.txt
          └── ...     

Each file contains multiple rows, with each row representing a TAF report for a specific airport.

If you want to learn more about TAF data, please refer to https://en.wikipedia.org/wiki/Terminal_aerodrome_forecast

CWAM data (recommended but optional)
CWAM (Convective Weather Avoidance Model) data provide forecast on the potential impact of convective weather. For each time of prediction, the data is a list of polygons at a specific altitude within which there is a percentage of chance greater than 60, 70, 80% to be impacted by convective weather. Polygons are represented by a list of their vertices in latitude, longitude (degrees). Forecasts are produced in general every 15 minutes and each forecast is made up to 2 hours in the future with a 5-minute interval.

CWAM data are stored as compressed HDF5 files (.h5) with BZIP2 compression (.bz2). The name of the files should look like this:

<YYYY>_<MM>_<DD>_<HH>_<MM>_GMT.Forecast.h5.CWAM.h5.bz2
Example: 2022_12_18_23_45_GMT.Forecast.h5.CWAM.h5.bz2

     The data is organized like this:
     data/
     └── CWAM/
        ├── 2022_12_18_23_45_GMT.Forecast.h5.CWAM.h5.bz2
        ├── 2022_12_19_03_45_GMT.Forecast.h5.CWAM.h5.bz2
        └── ... 

The date and time in the file name (e.g. 2022_12_18_23_45) represent the date and time at which the forecast has been made (December 18th, 2022 at 23:45 UTC). After decompressing and reading the HDF5 file, the data is structured in a hierarchical format containing keys representing different forecast parameters:

Keys Structure: Deviation Probability/FCST<Forecast Time>/FLVL<Flight Level>/Contour/TRSH<Threshold>/POLY<Polygon Number>
Forecast Time (FCST): Represents the forecast minute offset (e.g., FCST000 for the initial forecast : time of the file + 0 minute, FCST010 would be time of the file + 10 minute, i.e. 10 minutes in the future).
Flight Level (FLVL): Represents the altitude of the forecast in flight levels (e.g., FLVL250 for 25,000 feet).
Threshold (TRSH): Represents the deviation probability threshold (e.g., TRSH060 for greater than 60%).
Polygon Number (POLY): Represents the polygon index number

Data Format:
Each key points to a dataset for a polygon formatted as a list of latitude (North is positive) - longitude (East is positive) coordinates representing the vertices of that polygon:

     52.515781, -114.384987, ...
     52.529495, -114.269547, ...
     52.536320, -114.211800, ...
     ...

If you want to learn more about CWAM data, please refer to
https://www.faa.gov/nextgen/programs/weather/tfm_support/translation_products
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5347511

Train and Test Split

For this challenge, the data has been split into Training and Testing via the following cyclical structure, starting from 2022/09/01:

Training: 24 days
Testing: 8 days

And so on, for an entire year of data.

Training Data: Training data contains (mostly) continuous information for all variables and can be used to train models. The models should be train so that, given an airport at a particular datetime, and the information (input data) available at that datetime for that airport (flights and weather, including future forecasts made at or before that datetime), the throughput of 15-min time buckets up to 3 hours into the future are used as target variable.
It is encouraged to create as many training examples as possible using this dataset.

Testing/Prediction Data: Testing data is the input data that should be used to make the predictions of this model. Because predictions must be made for 3 hours into the future for every timestamp T, this data is separated by 4-hour gaps that would contain the not yet available information 4 hours into the future of timestamp T. The first 3 of this 4 hours are the ones for which the throughput predictions should be made. For example, given a cycle of 8-days part of the testing data, the input data available for training the model would include data from 0:00 to 1:00 AM (to predict 1:00 to 4:00 AM), 5:00 to 6:00 (to predict 6:00 to 9:00 AM), and so on. This will be clearer when looking at the submission_format.csv file that is included in the dataset.
This is structured this way to ensure that no "future" information is available at time T regarding the next 3 hours, so that the data should be trained with whatever information is available at time T (including system estimates of arrivals into the future and weather forecasts made at or before time T).

The goal of the competition is to create real-time prediction models that can make inference without the use of any information that is not yet available, so this is very important to keep in mind when selecting the input data.

The arrival buckets to be predicted are described in the submission_format.csv file and look like this:

     ID,Value
     KDEN_220925_0100_15,99
     KDEN_220925_0100_30,99
     KDEN_220925_0100_45,99
     KDEN_220925_0100_60,99
     KDEN_220925_0100_75,99
     ...

Where KDEN is the ICAO code for the airport, 220925 is the date (YYMMDD), 0100 is the time of the prediction (HHMM) and 15, 30,.. are the buckets into the future (from 1:00 to 1:15, from 1:15 to 1:30, etc.).
The order of the predictions should be the same as in the submission_format.csv file. Failing to follow this format will result in an error or lower score.

The evaluation metric is based on Root Mean Squared Error (RMSE). Specifically, it is calculated as exp(-RMSE/K), where K is a normalization factor of 10, and exp is the exponential function in order for it to be a number between 0.0 and 1.0, with 1.0 representing a perfect score (RMSE = 0).

The leaderboard will be split into a Public and Private leaderboard, where the preliminary results to advance to the final evaluation stage will be determined by the Private Leaderboard, which will be revealed at the end of the competition period.

We wish you good luck in this challenge. If there are questions, please refer to the FAQ and Rules.

FAQs
Who do I contact if I need help regarding a competition?

For any inquiries, please contact us at competitions AT bitgrit DOT net

How will I know if I've won?
If you are one of the top three winners for this competition, we will email you with the final result and information about how to claim your reward.
How can I report a bug?

Please shoot us an email at competitions AT bitgrit DOT net with details and a description of the bug you are facing, and if possible, please attach a screenshot of the bug itself.

If I win, how can I receive my reward?
Prizes will be paid by bank transfer. If for some reason you are not able to accept payment by bank transfer, please let us know and we will do our best to accommodate your needs as possible.
Rules

1. This competition is governed by the following Terms of Participation. Participants will register using the bitgrit platform and must agree to and comply with these terms to participate.

2. Individual participants must be 18 years old or older, affiliated with a U.S. university to participate and be either a U.S. citizen or a permanent resident of the United States or its territories to be eligible for a prize. Individual participants who do not meet these citizenship and university affiliation requirements may still participate but will not be eligible for a prize.

3. For team entries (i.e., groups of individuals competing together but not representing an established organization, institution, or corporation), the team lead must meet the following requirements: be 18 years old or older, a U.S. citizen or permanent resident of the United States or its territories, and affiliated with an accredited U.S. university as a student or faculty member for the team to qualify for a prize. The team lead will act as the official representative for the team.

4. If there is a dispute over the identity of the team lead who submitted the entry, and it cannot be resolved to the Challenge Sponsor’s satisfaction, the submission will be disqualified. Proof of US citizenship/residency as well as active affiliation to a US university will be required before any prize is awarded.

5. The following restrictions for participation also apply: 
• Solvers may not be an employee of NASA or a competition partner employee. 
• Solvers may not be a Federal employee acting within the scope of their employment. 
• Government contractors working on the same or similar projects are ineligible to participate in the Challenge. 
• Any individuals (including an individual's parent, spouse, or child) or private entities involved with any aspect of the design, production, execution, distribution, or evaluation of the Challenge are not eligible to enter as an individual or member of a team. 
• Federal employees acting in their personal capacities should consult with their respective agency ethics officials to determine whether their participation in this Challenge is permissible. 
• Government contractors that have worked developing a solution under a U.S. government contract, grant, or cooperative agreement, or while performing work for their employer should seek legal advice from their employer’s ethics agency on their conditions of employment which may affect their ability to submit a solution to this challenge and/or to accept an award. 
• Funds from U.S. or foreign government organizations should not be used to directly fund the development of a solution to this Challenge. Solutions that were previously developed with Government/Federal funds, or where Government/Federal funds, including but not limited to, employee time, materials, and reviews, were utilized to prepare the submission or solutions are prohibited. 
• Solvers currently receiving Federal funding through a grant or cooperative agreement unrelated to the scope of this challenge are eligible to compete but may not utilize that Federal funding for competing in this challenge. 

6. Users can make a maximum number of 3 submissions per day. If users want to submit new files after making 3 submissions in a day, they will have to wait until the following day to do so. Please keep this in mind when uploading a submission file. Any attempt to circumvent stated limits will result in disqualification.

7. The use of external datasets outside of those indicated in this competition page is strictly forbidden. However, we encourage the creative use of derivative datasets, such as calculated features, synthetic training data and other data augmentation techniques.

8. Participants are not allowed to upload the competition dataset to other websites or share it without authorization. Users who do not comply with this rule will be immediately disqualified.

9. The final submission has to be selected manually before the end of the competition (the participant may select up to 2), otherwise the final submission will be selected automatically based on your highest public score.

10. If at the time of the end of the competition two or more participants have the same score on the private leaderboard, the participant who submitted the winning file first will be considered for the following review stage.

11. Once the competition period ends, our team will reach out to top scorers based on the Private Leaderboard score, which will be revealed at this point. Top scorers will be asked to provide the following information by December 15th, 2024, to be qualified for the final review stage. Failure to provide this information may result in disqualification. 
a. All source files required to preprocess the data 
b. All source files required to build, train and make predictions with the model using the processed data 
c. A requirements.txt (or equivalent) file indicating all the required libraries and their versions as needed 
d. A ReadMe file containing the following: 
• Clear and unambiguous instructions on how to reproduce the predictions from start to finish including data pre-processing, feature extraction, derivative data generation, model training, and predictions generation. 
• Environment details regarding where the model was developed and trained, including OS, memory (RAM), disk space, CPU/GPU used, and any required environment configurations required to execute the training code.
• Clear answers to the following questions: 
- Which data files are being used to train the model? 
- How is the training dataset prepared, including derivative data? 
- What is the algorithm used and what are its main hyperparameters? 
- Any other comments considered relevant to understanding, replicating and using the model.

12. The submitted solution should be able to generate exactly the same model and the same inferencing output that gives the corresponding score on the leaderboard. If the score obtained from the code is different from what’s shown on the leaderboard, the new score will be used for the final rankings unless a logical explanation is provided. Participants are advised to make sure to set the seed or random state appropriately so bitgrit can obtain the same result from the code that was originally submitted.

13. To ensure fairness and integrity in the competition, participants are prohibited from exploiting any non-statistical patterns, anomalies, or other data artifacts that may exist within the dataset. Any attempt to identify and utilize such patterns, which do not derive from legitimate model analysis or generalization but rather from specific quirks or errors in the data collection or labeling process, will result in immediate disqualification. This rule is intended to maintain a level playing field and ensure that model performance is based solely on genuine predictive ability rather than incidental characteristics of the data.

14. All code submitted as part of this competition must be the original work of the participant or team. By participating, you agree that all submitted content, including code, documentation, and related materials, is your own creation and does not infringe upon any copyrights, trademarks, or other proprietary rights of any third party. Any use of third-party code or libraries must be clearly disclosed, properly attributed, and compliant with the respective licenses. Failure to comply with these requirements will result in disqualification from the competition and forfeiture of any prizes.

15. Competition prizes will only be awarded after the receipt, successful execution, and confirmation of the validity, integrity, and consistency of both the code and the solution, along with the final challenge score determination.

16. The participants retain ownership of any intellectual property they create as part of their submission to the challenge. In order to be eligible for the prize, the awardees must transfer their solution rights to bitgrit. As such, bitgrit will grant CrowdPlat and the United States Government an irrevocable, royalty free, non-exclusive, and worldwide license to use and permit others to use all or any part of the winning solutions including, without limitation, the right to reproduce, offer for sale, use, reproduce, modify prepare derivative works, and distribute all or any part of such solution, modifications, or combinations thereof and to sublicense (directly or indirectly through multiple tiers) or transfer any and all such rights. Government will be provided the complete solution and the intellectual property rights will be defined by said agreement and applicable law.

17. Any prize awards are subject to eligibility verification and compliance with these Terms of Participation. All decisions of bitgrit will be final and binding on all matters relating to this Competition.

18. Payments to winners may be subject to local, state, federal and foreign tax reporting and withholding requirements.

If you have any inquiries about this competition, please don’t hesitate to reach out to us at competitions AT bitgrit DOT net.

New Submission
Step 1
Upload or drop your file
Upload or drop your csv file here.
Your submission should be in .csv format.
Step 2
Description
Briefly describe your submission (400 characters or less)
3 submission(s) left

Thanks for your submission!

We'll send updates to your email. You can check your email and preferences here.
My Submissions
.