The NASA Breath Diagnostics Challenge

Enhance NASA's E-Nose for Accurate Medical Diagnostics

NASA

332 Participants

869 Submissions

Brief Data Breakdown Rules Timeline FAQ Prizes

Brief

Welcome to The NASA Breath Diagnostics Challenge!

The National Aeronautics and Space Administration (NASA) Science Mission Directorate (SMD) seeks innovative solutions to improve the accuracy of its NASA E-Nose as a potential clinical tool that would measure the molecular composition of human breath to provide diagnostic results. We invite data scientists and AI experts to participate in this challenge, leveraging their expertise to develop a classification model that can accurately discriminate between the breath of COVID-positive and COVID-negative individuals, using data obtained from a recent clinical study. The total prize pool for this competition is $55,000.

The objective of this challenge is to develop a diagnostic model by using NASA E-Nose data gathered from exhaled breath of 63 volunteers in a COVID-19 study. Challenge participants will use advanced data preparation and AI techniques to overcome the limited sample size of subjects in the COVID-19 study. The innovative solutions emerging from this challenge may assist NASA in advancing the technical capability of the NASA E-Nose for a wide range of clinical applications relevant to human space exploration.

Prizes

1st Place - $20,000
2nd Place - $12,000
3rd Place - $8,000
4th Place - $4,000
5th Place - $4,000
6th Place - $3,000
7th Place - $2,000
8th Place - $2,000

Timeline

Competition Starts – July 5th, 2024
Competition Ends – September 6th, 2024
Winners Announcement – November 8th, 2024

Data Breakdown

IMPORTANT: This competition uses unique criteria to determine eligible models. It is extremely important that participants read the rules carefully before continuing. In particular, please pay special attention to the Rules section before embarking in this challenge.

Welcome to The NASA Breath Diagnostics Challenge

The objective of this challenge is to develop a classification model that can accurately diagnose patients with COVID-19 based on data that was captured from the exhaled breath of volunteers using the NASA E-Nose device. The total number of patients, and therefore examples, is 63. You will note that this is a limited dataset, so making efficient use of the provided data is of extreme importance in this challenge. We encourage creativity on dealing with the small sample size, since one of the goals of this competition is to address the challenge of limited training data in scenarios such as emerging diseases with few confirmed cases. The limited data also impacts submission testing, so it is very important to understand the rules governing this event and how this will be ultimately scored.

The data consists of 63 txt files representing the 63 patients, numbered 1 to 63.
Each file contains the Patient ID, the COVID-19 Diagnosis Result (POSITIVE or NEGATIVE) and numeric measurements for 64 sensors, D1 to D64. These sensors are installed within the E-Nose device, and they each measure different molecular signals in the breath of the patients.

All sensor data is indexed by a timestamp with the format Min:Sec, which represents the minute of the hour, and the second of that minute in which that sensor was sampled. The hour of the timestamp has been left out, but when the minute counter resets, it means that the next hour has begun. Keep this in mind when working with this time axis.

In order to achieve maximum consistency across patients, the data was exposed to the E-Nose device using a pulsation bag that had previously collected a patient's breath. The E-Nose measurement procedure also includes flushing the sensors with ambient air, which can be used to calibrate the readings taken when exposed to human breath.

The data was exposed to the E-Nose device for all patients using windows of exposure through the following process:

1. 5 min baseline measurement using ambient air
2. 1 min breath sample exposure and
measurement, using the filled breath bag
3. 2 min sensor “recovery” using ambient air
4. 1 min breath sample exposure and
measurement, using the filled breath bag
5. 2 min sensor “recovery” using ambient air
6. 1 min breath sample exposure and
measurement, using the filled breath bag
7. 2 min sensor “recovery” using ambient air
Total time = 14 mins

The data is distributed into training and test sets:
Train: 45 patients
Test: 18 patients

Within the dataset there are also 2 other files: submission_example.csv and train_test_split.csv. The first file represents how all submission files should look like. The values should be 0 for NEGATIVE and 1 for POSITIVE. Failing to follow this format will result in an error or lower score.

The second file (train_test_split.csv) represents which patient IDs are considered for Train (i.e., labeled) and which one are for Test (i.e., not labeled).

The order of the predictions in the submission file should be the same as in the TEST indicated rows in train_test_split_order.csv file.
The index column of this submission file is NOT the ID of the patient, but the order of values in the Result column should follow the one in the train_test_split_order.csv file. This is very important.

The evaluation metric is Accuracy.

The leaderboard will be split into a Public and Private leaderboard, where the preliminary results to advance to the final evaluation stage will be determined by the Private Leaderboard, which will be revealed at the end of the competition period. Again, please refer to the Rules section, especially rules 7 to 11, in order to understand the particular evaluation criteria for this challenge.

Please note that the goal of the Public Leaderboard will be mostly for your reference, as it will represent only an approximate assessment of a model's performance. The final score may deviate substantially from this score. Any attempt to try to "game" the Public Leaderboard score or artificially inflate it will not result in any benefit accounting for the final score and could end in disqualification if the model is found to be purposely overfitting the Public Leaderboard test data.

We wish you good luck in this challenge. If there are questions, please refer to the FAQ, Rules or send an email to [email protected].

FAQs

Who do I contact if I need help regarding a competition?

For any inquiries, please contact us at [email protected]

How will I know if I've won?

If you are one of the top three winners for this competition, we will email you with the final result and information about how to claim your reward.

How can I report a bug?

Please shoot us an email at [email protected] with details and a description of the bug you are facing, and if possible, please attach a screenshot of the bug itself.

If I win, how can I receive my reward?

Prizes will be paid by bank transfer. If for some reason you are not able to accept payment by bank transfer, please let us know and we will do our best to accommodate your needs as possible.

Rules

This competition is governed by the following Terms of Participation. Participants must agree to and comply with these Terms to participate.
Users can make a maximum number of 2 submissions per day. If users want to submit new files after making 2 submissions in a day, they will have to wait until the following day to do so. Please keep this in mind when uploading a submission file. Any attempt to circumvent stated limits will result in disqualification.
The use of external datasets is strictly forbidden. However, we encourage the creative use of derivative datasets, such as calculated features, synthetic training data and other data augmentation techniques.
It is not allowed to upload the competition dataset to other websites. Users who do not comply with this rule will be immediately disqualified.
The final submission has to be selected manually before the end of the competition (you can select up to 2), otherwise the final submission will be selected automatically based on your highest public score.
If at the time of the end of the competition two or more participants have the same score on the private leaderboard, the participant who submitted the winning file first will be considered for the following review stage.
Once the competition period ends, our team will reach out to top scorers based on the Private Leaderboard score, which will be revealed at this point. Top scorers will be asked to provide the following information by September 16th, 2024, to be qualified for the final review stage, Failure to provide this information may result in disqualification.

a. All source files required to preprocess the data
b. All source files required to build, train and make predictions with the model using the processed data
c. A requirements.txt (or equivalent) file indicating all the required libraries and their versions as needed
d. A ReadMe file containing the following:
• Clear and unambiguous instructions on how to reproduce the predictions from start to finish includi
  • Environment details regarding where the model was developed and trained, including OS, memory (RAM), disk space, CPU/GPU used, and any required environment configurations required to execute the training code.
  • Clear answers to the following questions:
  - Which data files are being used to train the model?
  - How is the training dataset prepared, including derivative data?
  - What is the algorithm used and what are its main hyperparameters?
- Any other comments considered relevant to understanding, replicating and using the model.
The submitted solution should be able to generate exactly the same model and the same inferencing output that gives the corresponding score on the leaderboard. If the score obtained from the code is different from what’s shown on the leaderboard, the new score will be used for the final rankings unless a logical explanation is provided. Please make sure to set the seed or random state appropriately so we can obtain the same result from your code.
To ensure fairness and integrity in the competition, participants are prohibited from exploiting any non-statistical patterns, anomalies, or other data artifacts that may exist within the dataset. Any attempt to identify and utilize such patterns, which do not derive from legitimate model analysis or generalization but rather from specific quirks or errors in the data collection or labeling process, will result in immediate disqualification. This rule is intended to maintain a level playing field and ensure that model performance is based solely on genuine predictive ability rather than incidental characteristics of the data.
The submitted models must be capable of performing inference efficiently on standard consumer-grade hardware, such as a tablet or similar mobile device, within a reasonable time frame, typically less than a minute. This requirement ensures that the models are not only accurate but also practical and scalable for real-world applications where resources may be limited.
Given the particularly small size of the data for this competition, additional measures are required to ensure fairness and integrity. To be considered eligible for winning a prize, participants must meet the following criteria, in order:
1. All the required information must have been provided (see rule 7).
2. The scores on the private and public leaderboards must be reproducible (see rule 8).
3. Their models are able to be used for inference on modern consumer-grade hardware within a reasonable time frame (see rule 10).
4. The bitgrit and NASA team will calculate an Internal Score by performing Cross Validation and similar experiments, including testing with different random seeds and stratified data splits, using the provided model and the same size of test data. If the Internal Score deviates by more than 10% from the Overall Score (Private + Public Scores), only the Internal Score will be considered the Final Score. If the results are consistent, the Final Score will be calculated as the average of the Overall Score and the Internal Score.
5. The final ranking will be determined based on the Final Score, and winners will be awarded according to this ranking.
6. Any evidence of exploiting non-statistical patterns, data artifacts, or any other anomalies that do not derive from legitimate model generalization will result in immediate disqualification. (see rule 9)
7. Competition prizes will only be awarded after the receipt, successful execution, and confirmation of the validity, integrity, and consistency of both the code and the solution (see rules 7, 8, 9), along with the final challenge score calculation.
In order to be eligible for the prize, the competition winner must agree to transfer to NASA and the relevant transferee of rights in such Competition all transferable rights, such as copyrights, rights to obtain patents and know-how, etc. in and to all analysis and prediction results, reports, analysis and prediction model, algorithm, source code and documentation for the model reproducibility, etc., and the Submissions contained in the Final Submissions.
Any prize awards are subject to eligibility verification and compliance with these Terms of Participation. All decisions of bitgrit will be final and binding on all matters relating to this Competition.
Payments to winners may be subject to local, state, federal and foreign tax reporting and withholding requirements.
If you have any inquiries about this competition, please don’t hesitate to reach out to us at [email protected].

New Submission

Step 1

Upload or drop your file

Upload or drop your csv file here.

Your submission should be in .csv format.

Step 2

Description

Briefly describe your submission (400 characters or less)

You have exceeded the number of allowed submissions for this competition.

2 submission(s) left

Thanks for your submission!

We'll send updates to your email. You can check your email and preferences here.

My Submissions

Japan Office
+81 3 6671 8256
Koganei Building 4th Floor, 3-4-3 Kami-Meguro,
Meguro City, Tokyo, Japan

UAE Office
DD-14-122-070, WeWork Hub 71 Al Khatem Tower,
ADGM Square Al Maryah Island, Abu Dhabi, UAE