LitCoin NLP Challenge: Part 1

Identify mentions of biomedical entities in research abstracts

NCATS & NASA

232 Participants

455 Submissions

Brief Data Breakdown Rules Timeline FAQ Prizes

Updates

18 Nov 2021
Formula updated in the guidelines.

Brief

This 2-phase competition is part of the NASA Tournament Lab, hosted by NCATS (The National Center for Advancing Translational Sciences) with contributions from the National Library of Medicine (NLM). These institutions, in collaboration with bitgrit and CrowdPlat, have come together to bring you this challenge where you can deploy your data-driven technology solutions towards accelerating scientific research in medicine and ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers. Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information. Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category). Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (pair of nodes, BioLink Model Predicate and novelty). *NOTE: The prizes listed will be awarded based on a competitor’s combined, weighted scores from both phases of the competition. Please see the Rules section for more information.* In order to be eligible for the prize money, participants must be a U.S. citizen or a U.S. permanent resident. Non-U.S. citizens and non-permanent residents can participate as well, as a member of a team that includes a citizen or permanent resident of the U.S., or they can participate on their own. However, such non-U.S. citizens and non-permanent residents are not eligible to win a monetary prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Similarly, if participating on their own, they may be eligible to win a non-cash recognition prize. Proof of citizenship and permanent residency will be required. For more information on competition eligibility requirements, please see https://ncats.nih.gov/funding/challenges/litcoin The National Center for Advancing Translational Sciences (NCATS, a center of the National Institutes of Health): NCATS, is conducting this challenge under the America Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science (COMPETES) Reauthorization Act of 2010. This challenge will spur innovation in NLP to advance the field and allow the generation of more accurate and useful data from biomedical publications, which will enhance the ability for data scientists to create tools to foster discovery and generate new hypotheses. The National Center for Biotechnology Information (NCBI, part of the National Library of Medicine, a division of the National Institutes of Health): NCBI intramural researchers and their collaborators have provided a corpus of annotated abstracts from published scientific research articles and knowledge assertions between these concepts, which will be provided to participants for training and testing purposes. CrowdPlat (Project Company): The LitCoin project was awarded to and is being managed by CrowdPlat under NASA's NOIS2 contract. Located in San Jose, California; CrowdPlat provides crowdsourcing solutions to medium to large scale enterprises seeking project execution through a crowdsourced talent pool.

Prizes

1st Prize ($ 35000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
2nd Prize ($ 25000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
3rd Prize ($ 20000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
4th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
5th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
6th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
7th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!

Timeline

09 Nov 2021 Competition Phase-1 Starts
23 Dec 2021 Competition Phase-1 Ends
27 Dec 2021 Competition Phase-2 Starts

Data Breakdown

The goal of the first part of the LitCoin NLP Challenge is to identify the position and type of biomedical concepts (entities) mentions within a research paper’s title and abstract. The position of a biomedical entity's mention in the text is determined by two ‘offset’ numbers: ‘offset_start’ and ‘offset_finish’, which indicate the index of the character where a given mention substring begins and ends, respectively. The input string considered for these indices is the concatenation of the title and abstract strings in the following manner: string = title + ' ' + abstract (i.e., there is one extra character between the 2 when accounting for the offset). The type of the biomedical entities comes from the BioLink Model Categories, and can be one and only one of the following: ・DiseaseOrPhenotypicFeature ・ChemicalEntity ・OrganismTaxon ・GeneOrGeneProduct ・SequenceVariant ・CellLine To properly understand these categories it would be helpful to get familiar with the BioLink Model and biomedical ontologies in general. The BioLink Model is a high level datamodel of biological entities (genes, diseases, phenotypes, pathways, individuals, substances, etc) and their associations. It can be thought of a high-level biomedical ontology that can help categorize and associate concepts that might come from different lower level biomedical ontologies. Ontologies are controlled vocabularies that allow describing the meaning of data (its semantics) in a human and machine readable way. They have been widely used in the biomedical area to help solve the issue of data heterogeneity, leading to advanced data analysis, knowledge organization and reasoning. To better understand the BioLink Model and learn more about biomedical ontologies it would be helpful to take a look at the following links: ・BioLink Model: https://biolink.github.io/biolink-model/ http://tree-viz-biolink.herokuapp.com ・Ontologies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4300097/ https://disease-ontology.org https://www.omim.org/ https://www.bioontology.org/ The number of mentions of entities in each abstract varies depending on the text and all mentions of entities should be identified, even if they refer to the same concept. (e.g., if the substring "HIV" appears 3 times in the text and the substring "Human Immunodeficiency Virus" appears another 2 times in the same text, all 5 of these mentions should be identified individually, even though they refer to the same concept). The available data is composed of the following files: abstracts_train.csv, entities_train.csv, relations_train.csv and abstracts_test.csv, which will be described below. All these are CSV files delimited with tab instead of ",". ・abstracts_train.csv: CSV file containing research papers (input) that can be used for training. # abstract_id: PubMed ID of the research paper. # title: title of the research paper. # abstract: abstract or summary of the research paper. ・entities_train.csv: CSV file containing all the entities' mentions (output) found in the texts from abstracts_train.csv that can be used for training. # id: unique ID of the entity's mention # abstract_id: PubMed ID of the research paper where the entity's mention appears. # offset_start: position of the character where the entity's mention substring begins in the text (title + abstract). # offset_finish: position of the character where the entity's mention substring ends in the text (title + abstract). # type: type of entity as one of the above-mentioned 6 possible categories. # mention: substring representing the actual entity's mention in the text. Can also be extracted using the offsets and the input text. # entity_ids*: comma separated external IDs from a biomedical ontology to specifically identify the concept. Not used in this first phase. *The ontologies used to obtain these external IDs are the following: Gene: NCBI Gene Disease: MEDIC (which is MESH + OMIM) Chemical: MESH Variant: RS# in dbSNP Species: NCBI Taxonomy CellLine: NCBI Taxonomy ・relations_train.csv: CSV file containing all the relations found in the abstracts that can be used for training for the SECOND PART OF THE COMPETITION. NOT USED IN THIS PHASE. # id: unique ID of the relation # abstract_id: PubMed ID of the research paper where the relation appears. # entity_id_1: external ID of the entity that corresponds to the subject of the relation. # entity_id_2: external ID of the entity that corresponds to the object of the relation. # relation_type: type or predicate connecting the two entities. # novel: whether the relation found corresponds to a novel discovery or not. ・abstracts_test.csv: CSV file containing research papers (input) whose entities have to be identified with a trained model. # abstract_id: PubMed ID of the research paper. # title: title of the research paper. # abstract: abstract or summary of the research paper ・submission_example.csv: CSV file containing an example of how a submission file (output) should look like in order to be uploaded and scored in the platform. It is similar to the format of entities_train.csv but without 'mention' and 'entity_ids' columns. Please note that this is just a small example where types and offset have been randomly selected and the number of entities per abstract might be much smaller than usual. The column order must be respected, as well as using tab as a delimiter and including a header with the column titles. Failure to comply with this format will result in error or lower score. # id: ID of the entity's mention. # abstract_id: PubMed ID of the research where the entity's mention appears. # offset_start: position of the character where the entity's mention substring begins in the text (title + abstract). # offset_finish: position of the character where the entity's mention substring ends in the text (title + abstract). # type: type of entity as one of the above-mentioned 6 possible categories. The evaluation metric for this problem is a modified version of the Jaccard Similarity Score: ・For each abstract_id A, a set of predicted mentions P and a set of correct mentions O, the formula is: |P⋂O| / (|P| + |O| - |P⋂O|), where ⋂ means intersection, ⋃ means union, || means length (amount of entities' mentions for that abstract_id). Matching concepts (for the intersection) are determined by having the same type and the same (or very similar) offsets. The Jaccard similarity scores of all abstract_ids are then averaged to return the final score. Final competition results are based on competitors' combined, weighted scores from both phases of the competition: 30% of the total score will be determined by problem statement 1 and 70% of the total score will be determined by problem statement 2.

FAQs

Who do I contact if I need help regarding a competition?

If you have any inquiries about participating in this competition, please don’t hesitate to reach out to us at [email protected]. For questions about eligibility or prize distribution, email NCATS at [email protected]

How will I know if I’ve won?

If you are one of the top seven winners for this competition, we will email you with the final result and information about how to claim your reward.

How can I report a bug?

Please shoot us an email at [email protected] with details and a description of the bug you are facing, and if possible, please attach a screenshot of the bug itself.

If I win, how can I receive my reward?

The money prize will be awarded by NIH/NCATS directly to the winner (if an individual) or Team Lead of the winning team (if a team). Please check rule number 7 for the eligibility information. Prizes awarded under this Challenge will be paid by electronic funds transfer and may be subject to Federal income taxes. HHS/NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable.

Rules

1. This competition is governed by the following Terms of Participation (“Participation Rules”). Participants must agree to and comply with the Participation Rules to compete. 2. This competition consists of 2 problem statements, herein considered as competition sub-phases. Winners will be determined by a weighted average of scores from the two competition phases: 30% of the total score will be determined by problem statement 1 and 70% of the total score will be determined by problem statement 2. 3. The competition dates are detailed below: Phase 1 Start Date: November 9th, 2021 Phase 1 Closing Date: December 23rd, 2021 Phase 2 Start Date: December 27th, 2021 Phase 2 Closing Date: February 28th, 2022 Submission (Final Source Code): March 11th, 2022 Winner’s Announced: April 8th, 2022 4. Participants are allowed to participate in an individual capacity or as part of a team. 5. It is not allowed to merge teams midway through the competition. 6. Each participant may only be a member of a single team and may not participate as an individual and on a team simultaneously. 7. In order to be eligible for the prize money, participants must be a U.S. citizen or a U.S. permanent resident. Non-U.S. citizens and non-permanent residents can participate as well, as a member of a team that includes a citizen or permanent resident of the U.S., or they can participate on their own. However, such non-U.S. citizens and non-permanent residents are not eligible to win a monetary prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Similarly, if participating on their own, they may be eligible to win a non-cash recognition prize. Proof of citizenship and permanent residency will be required. For more information on competition eligibility requirements, please see https://ncats.nih.gov/funding/challenges/litcoin 8. In the case of a team participation, all submissions must be made by the team lead. 9. All participants who are under the age of 18, or are considered a minor in the country they live in, are required to submit a signed copy of the parent/legal guardian consent form. This form can be found at https://ncats.nih.gov/files/LitCoin-Parental-Consent-Form-508.pdf. Signed forms can be sent to [email protected] 10. The use of external datasets for the purposes of training is allowed, but submissions must be generated using the test corpus provided. 11. During the competition period, participants will be allowed to submit a maximum number of 5 submissions per day. If participants exceed the set submission limit, the platform will be reset to allow additional 5 submissions the following day. Please keep this in mind when uploading a submission file. Any attempt to circumvent stated limits will result in disqualification. 12. Participants are not permitted to share or upload the competition dataset to any platform outside of competition. Participants that do not comply with the confidentiality regulations of the competition will be disqualified. 13. The top seven (7) winning participants will be eligible to receive a competition prize (ranked by performance) after we have received, successfully executed, and confirmed the validity of both the code and the solution (See 15.). In order to ensure that at least 7 participants may be awarded prizes, the top fifteen (15) individuals/teams will be asked to submit their source code for evaluation (see 14.). 14. Once potential competition winners are determined and our team reaches out to them, the top scoring participants must provide the following by March 11, 2022 for evaluation to be qualified as competition winner(s) and receive their prize: Winning Model Documentation template filled in (this document will be available on the “Resources” tab on the competition page of LitCoin NLP Challenge: Part 2) b. All source files required to preprocess the data c. All source files required to build, train and make predictions with the model using the processed data d. A requirements.txt (or equivalent) file indicating all the required libraries and their versions as needed e. A ReadMe file containing the following: • Clear and unambiguous instructions on how to reproduce the predictions from start to finish including data pre-processing, feature extraction, model training and predictions generation • Environment details regarding where the model was developed and trained, including OS, memory (RAM), disk space, CPU/GPU used, and any required environment configurations required to execute the code • Clear answers to the following questions: - Which data files are being used? - How are these files processed? - What is the algorithm used and what are its main hyperparameters? - Any other comments considered relevant to understanding and using the model 15. Solution submissions should be able to generate the exact output that gives the corresponding score on the leaderboard. If the score obtained from the code is different from what’s shown on the leaderboard, the new score (which may be lower) will be used for the final rankings unless a logical explanation is provided. Please make sure to set the seed or random_state etc. so we can obtain the same result from your code. 16. Solution submissions will also be used to generate output based on a validation dataset, generated in the same manner with which the provided test and training sets were generated, which will be kept hidden from all participants, in order to verify that code was not customized for the provided dataset. This output will not be used to determine leaderboard position, but could be used to disqualify a participant from receiving a prize if the output is judged to be severely inaccurate by bitgrit, CrowdPlat and NCATS. 17. In order to be eligible for the prize, a competition winner (whether an individual, group of individuals, or entity) must agree to grant to the NIH an irrevocable, paid-up, royalty-free non-exclusive worldwide license to reproduce, publish, post, link to, share, and display publicly the submission on the web or elsewhere, and a nonexclusive, non transferable, irrevocable, paid-up license to practice or have practiced for or on its behalf, the solution throughout the world. For more detailed information, please visit https://ncats.nih.gov/funding/challenges/litcoin. 18. Any prize awards are subject to verification of eligibility and compliance with these Participation Rules. Novelty and innovation of submissions may also affect the final ranking. All decisions of bitgrit, CrowdPlat and NCATS will be final and binding on all matters relating to this Competition. 19. Cash prizes will be paid directly by NIH/NCATS to the competition winners. In the case of a winning team, the money prize will be paid directly by NIH/NCATS to the Team Lead. Non-U.S. citizens and non-permanent residents are not eligible to receive a cash prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Prizes awarded under this Challenge will be paid by electronic funds transfer and may be subject to local, state, federal and foreign tax reporting and withholding requirements. HHS/NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable. 20. If two or more participants have the same score on the leaderboard, an earlier submission will take precedence and be ranked higher than a later submission. 21. If you have any inquiries about participating in this competition, please don’t hesitate to reach out to us at [email protected]. For questions about eligibility or prize distribution, email NCATS at [email protected]

New Submission

Step 1

Upload or drop your file

Upload or drop your csv file here.

Your submission should be in .csv format.

Step 2

Description

Briefly describe your submission (400 characters or less)

You have exceeded the number of allowed submissions for this competition.

5 submission(s) left

Thanks for your submission!

We'll send updates to your email. You can check your email and preferences here.

My Submissions

Japan Office
+81 3 6671 8256
Koganei Building 4th Floor, 3-4-3 Kami-Meguro,
Meguro City, Tokyo, Japan

UAE Office
DD-14-122-070, WeWork Hub 71 Al Khatem Tower,
ADGM Square Al Maryah Island, Abu Dhabi, UAE

LitCoin NLP Challenge: Part 1

Identify mentions of biomedical entities in research abstracts

Updates

18 Nov 2021

Brief

Prizes

1st Prize ($ 35000)

2nd Prize ($ 25000)

3rd Prize ($ 20000)

4th Prize ($ 5000)

5th Prize ($ 5000)

6th Prize ($ 5000)

7th Prize ($ 5000)