LitCoin NLP Challenge: Part 2

Identify relations between biomedical entities in research abstracts

NCATS & NASA

76 Participants

507 Submissions

Brief Data Breakdown Rules Timeline FAQ Prizes

Brief

This 2-phase competition is part of the NASA Tournament Lab, hosted by NCATS (The National Center for Advancing Translational Sciences) with contributions from the National Library of Medicine (NLM). These institutions, in collaboration with bitgrit and CrowdPlat, have come together to bring you this challenge where you can deploy your data-driven technology solutions towards accelerating scientific research in medicine and ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers. Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information. Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category). Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (position in text and BioLink Model Predicate). *NOTE: The prizes listed will be awarded based on a competitor’s combined, weighted scores from both phases of the competition. Please see the Rules section for more information.* The National Center for Advancing Translational Sciences (NCATS, a center of the National Institutes of Health): NCATS, is conducting this challenge under the America Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science (COMPETES) Reauthorization Act of 2010. This challenge will spur innovation in NLP to advance the field and allow the generation of more accurate and useful data from biomedical publications, which will enhance the ability for data scientists to create tools to foster discovery and generate new hypotheses. The National Center for Biotechnology Information (NCBI, part of the National Library of Medicine, a division of the National Institutes of Health): NCBI intramural researchers and their collaborators have provided a corpus of annotated abstracts from published scientific research articles and knowledge assertions between these concepts, which will be provided to participants for training and testing purposes. CrowdPlat (Project Company): The LitCoin project was awarded to and is being managed by CrowdPlat under NASA's NOIS2 contract. Located in San Jose, California; CrowdPlat provides crowdsourcing solutions to medium to large scale enterprises seeking project execution through a crowdsourced talent pool.

Prizes

1st Prize ($ 35000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
2nd Prize ($ 25000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
3rd Prize ($ 20000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
4th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
5th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
6th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
7th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!

Timeline

23 Dec 2021 Competition Phase-1 Ended
27 Dec 2021 Competition Phase-2 Starts
28 Feb 2022 Competition Phase-2 Ends
08 Apr 2022 Winners Announced (Subject to change based on submission results)

Data Breakdown

The goal of the second part of the LitCoin NLP Challenge is to identify all the relations between biomedical entities within a research paper’s title and abstract. The type of biomedical relation comes from the BioLink Model Predicates, and can be one and only one of the following: ・Association ・Positive Correlation ・Negative Correlation ・Bind ・Cotreatment ・Comparison ・Drug Interaction ・Conversion To properly understand these predicates it would be helpful to get familiar with the BioLink Model and biomedical ontologies in general. The BioLink Model is a high level datamodel of biological entities (genes, diseases, phenotypes, pathways, individuals, substances, etc) and their associations. It can be thought of a high-level biomedical ontology that can help categorize and associate concepts that might come from different lower level biomedical ontologies. Ontologies are controlled vocabularies that allow describing the meaning of data (its semantics) in a human and machine readable way. They have been widely used in the biomedical area to help solve the issue of data heterogeneity, leading to advanced data analysis, knowledge organization and reasoning. To better understand the BioLink Model and learn more about biomedical ontologies it would be helpful to take a look at the following links: ・BioLink Model: https://biolink.github.io/biolink-model/ http://tree-viz-biolink.herokuapp.com ・Ontologies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4300097/ https://disease-ontology.org https://www.omim.org/ https://www.bioontology.org/ The number of relations in each abstract varies depending on the text and all relations should be identified. The available data corresponds to the same data used for the first part of the competition, but also including the entities for the test set: abstracts_train.csv, entities_train.csv, relations_train.cs, abstracts_test.csv and entities_test.csv, which will be described below. All these are CSV files delimited with tab instead of ",". ・abstracts_train.csv: CSV file containing research papers that can be used for training. # abstract_id: PubMed ID of the research paper. # title: title of the research paper. # abstract: abstract or summary of the research paper. ・entities_train.csv: CSV file containing all the entities' mentions found in the texts from abstracts_train.csv that can be used for training. # id: unique ID of the entity's mention # abstract_id: PubMed ID of the research paper where the entity's mention appears. # offset_start: position of the character where the entity's mention substring begins in the text (title + abstract). # offset_finish: position of the character where the entity's mention substring ends in the text (title + abstract). # type: type of entity as one of the 6 possible categories mentioned in the first part of the competition. # mention: substring representing the actual entity's mention in the text. Can also be extracted using the offsets and the input text. # entity_ids*: comma separated external IDs from a biomedical ontology to specifically identify the entity. *The ontologies used to obtain these external IDs are the following: Gene: NCBI Gene Disease: MEDIC (which is MESH + OMIM) Chemical: MESH Variant: RS# in dbSNP Species: NCBI Taxonomy CellLine: NCBI Taxonomy ・relations_train.csv: CSV file containing all the relations found in the abstracts that can be used for training. # id: unique ID of the relation # abstract_id: PubMed ID of the research paper where the relation appears. # type: type or predicate connecting the two entities. # entity_1_id: external ID of the entity that corresponds to the subject of the relation. # entity_2_id: external ID of the entity that corresponds to the object of the relation. # novel: whether the relation found corresponds to a novel discovery or not. ・abstracts_test.csv: CSV file containing research papers whose relations between entities have to be identified with a trained model. # abstract_id: PubMed ID of the research paper. # title: title of the research paper. # abstract: abstract or summary of the research paper. ・entities_test.csv: CSV file containing all the entities' mentions found in the texts from abstracts_test.csv that can be used to identified the test relations and create the submission file. # id: unique ID of the entity's mention # abstract_id: PubMed ID of the research paper where the entity's mention appears. # offset_start: position of the character where the entity's mention substring begins in the text (title + abstract). # offset_finish: position of the character where the entity's mention substring ends in the text (title + abstract). # type: type of entity as one of the 6 possible categories mentioned in the first part of the competition. # mention: substring representing the actual entity's mention in the text. Can also be extracted using the offsets and the input text. # entity_ids: comma separated external IDs from a biomedical ontology to specifically identify the entity. ・submission_example_2.csv: CSV file containing an example of how a submission file (output) should look like in order to be uploaded and scored in the platform. It is similar to the format of relations_train.csv. Please note that this is just a small example where types and entity_ids have been randomly selected, and the number of relations per abstract might be much smaller than usual. The column order must be respected, as well as using tab as a delimiter and including a header with the column titles. Failure to comply with this format will result in error or lower score. # id: unique ID of the relation # abstract_id: PubMed ID of the research paper where the relation appears. # type: type or predicate connecting the two entities. # entity_1_id: external ID of the entity that corresponds to the subject of the relation. # entity_2_id: external ID of the entity that corresponds to the object of the relation. # novel: whether the relation found corresponds to a novel discovery or not. The evaluation metric for this problem is a modified version of the Jaccard Similarity Score: ・For each abstract_id A, a set of predicted relations P and a set of correct relations O, the formula is: |P⋂O| / (|P| + |O| - |P⋂O|), where ⋂ means intersection, || means length (amount of relations for that abstract_id). ・Matching relations (for the intersection) are determined the following way: a predicted relation and a correct relation's match is represented as an "intersection score" between 0 and 1, under the formula intersection_score = 0.25 x {correct pair of entities (irrespective of order)} + 0.5 x {correct pair of entities and correct type}* + 0.25 x {correct pair of entities and correct novelty}. For example, given a correct relation from an abstract, if there is a predicted relation with the same pair of entities, type and novelty, then the "intersection score" of that match is 1. If there is only a predicted relation that contains the same pair of entities, but not the same type or novelty, then the "intersection score" of that match is 0.25. *In the particular case of the types Positive_Correlation and Negative_Correlation, if there is a match on the pair of entities but there is a misclassification between one of these 2 types (e.g. Positive_Correlation is indicated instead of Negative_Correlation or viceversa), instead of 0.5 the score given for it will be 0.175. The Jaccard similarity scores of all abstract_ids are then averaged to return the final score. Final competition results are based on competitors' combined, weighted scores from both phases of the competition: 30% of the total score will be determined by problem statement 1 and 70% of the total score will be determined by problem statement 2.

FAQs

Who do I contact if I need help regarding a competition?

If you have any inquiries about participating in this competition, please don’t hesitate to reach out to us at [email protected]. For questions about eligibility or prize distribution, email NCATS at [email protected]

How will I know if I’ve won?

If you are one of the top seven winners for this competition, we will email you with the final result and information about how to claim your reward.

How can I report a bug?

Please shoot us an email at [email protected] with details and a description of the bug you are facing, and if possible, please attach a screenshot of the bug itself.

If I win, how can I receive my reward?

The money prize will be awarded by NIH/NCATS directly to the winner (if an individual) or Team Lead of the winning team (if a team). Prizes awarded under this Challenge will be paid by electronic funds transfer and may be subject to Federal income taxes. HHS/NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable.

How is novelty defined in the dataset?

Novelty tags were generated by curators based entirely on the abstract as written, without doing an exhaustive search into the history of the work. In other words, when the curators were looking over this abstract, the language used in it suggested to them that this finding was novel, so these tags are based purely on context within the abstract.

Rules

1. This competition is governed by the following Terms of Participation (“Participation Rules”). Participants must agree to and comply with the Participation Rules to compete. 2. This competition consists of 2 problem statements, herein considered as competition sub-phases. Winners will be determined by a weighted average of scores from the two competition phases: 30% of the total score will be determined by problem statement 1 and 70% of the total score will be determined by problem statement 2. 3. The competition dates are detailed below: Phase 1 Start Date: November 9th, 2021 Phase 1 Closing Date: December 23rd, 2021 Phase 2 Start Date: December 27th, 2021 Phase 2 Closing Date: February 28th, 2022 Submission (Final Source Code): March 11th, 2022 Winner’s Announced: April 8th, 2022 4. Participants are allowed to participate in an individual capacity or as part of a team. 5. It is not allowed to merge teams midway through the competition. 6. Each participant may only be a member of a single team and may not participate as individuals and on a team simultaneously. 7. In order to participate in this competition and be eligible for the prize money, participants must be a U.S. citizen or a U.S. permanent resident. Non-U.S. citizens and non-permanent residents can participate as well, as a member of a team that includes a citizen or permanent resident of the U.S, or they can participate on their own. However, such non-U.S. citizens and non-permanent residents are not eligible to win a monetary prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Similarly, if participating on their own, they may be eligible to win a non-cash recognition prize. Proof of citizenship and permanent residency will be required. For more information on competition eligibility requirements, please see https://ncats.nih.gov/funding/challenges/litcoin 8. In the case of a team participation, all submissions must be made by the team lead. 9. The use of external datasets for the purposes of training is allowed, but submissions must be generated using the test corpus provided. 10. During the competition period, participants will be allowed to submit a maximum number of 5 submissions per day. If participants exceed the set submission limit, the platform will be reset to allow additional 5 submissions the following day. Please keep this in mind when uploading a submission file. Any attempt to circumvent stated limits will result in disqualification. 11. Participants are not permitted to share or upload the competition dataset to any platform outside of competition. Participants that do not comply with the confidentiality regulations of the competition will be disqualified. 12. The top seven (7) winning participants will be eligible to receive a competition prize (ranked by performance) after we have received, successfully executed, and confirmed the validity of both the code and the solution (See 14.). In order to ensure that at least 7 participants may be awarded prizes, the top fifteen (15) individuals/teams will be asked to submit their source code for evaluation (see 13.). 13. Once potential competition winners are determined and our team reaches out to them, the top scoring participants must provide the following by March 11, 2022 for evaluation to be qualified as competition winner(s) and receive their prize: Winning Model Documentation template filled in (this document is available on the “Resources” tab on the competition page) b. All source files required to preprocess the data c. All source files required to build, train and make predictions with the model using the processed data d. A requirements.txt (or equivalent) file indicating all the required libraries and their versions as needed e. A ReadMe file containing the following: • Clear and unambiguous instructions on how to reproduce the predictions from start to finish including data pre-processing, feature extraction, model training and predictions generation • Environment details regarding where the model was developed and trained, including OS, memory (RAM), disk space, CPU/GPU used, and any required environment configurations required to execute the code • Clear answers to the following questions: - Which data files are being used? - How are these files processed? - What is the algorithm used and what are its main hyperparameters? - Any other comments considered relevant to understanding and using the model 14. Solution submissions should be able to generate the exact output that gives the corresponding score on the leaderboard. If the score obtained from the code is different from what’s shown on the leaderboard, the new score (which may be lower) will be used for the final rankings unless a logical explanation is provided. Please make sure to set the seed or random_state etc. so we can obtain the same result from your code. 15. Solution submissions will also be used to generate output based on a validation dataset, generated in the same manner with which the provided test and training sets were generated, which will be kept hidden from all participants, in order to verify that code was not customized for the provided dataset. This output will not be used to determine leaderboard position, but could be used to disqualify a participant from receiving a prize if the output is judged to be severely inaccurate by bitgrit, CrowdPlat and NCATS. 16. In order to be eligible for the prize, a competition winner (whether an individual, group of individuals, or entity) must agree to grant to the NIH an irrevocable, paid-up, royalty-free non-exclusive worldwide license to reproduce, publish, post, link to, share, and display publicly the submission on the web or elsewhere, and a nonexclusive, non transferable, irrevocable, paid-up license to practice or have practiced for or on its behalf, the solution throughout the world. For more detailed information, please visit https://ncats.nih.gov/funding/challenges/litcoin. 17. Any prize awards are subject to verification of eligibility and compliance with these Participation Rules. Novelty and innovation of submissions may also affect the final ranking. All decisions of bitgrit, CrowdPlat and NCATS will be final and binding on all matters relating to this Competition. 18. Cash prizes will be paid directly by NIH/NCATS to the competition winners. In the case of a winning team, the money prize will be paid directly by NIH/NCATS to the Team Lead. Non-U.S. citizens and non-permanent residents are not eligible to receive a cash prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Prizes awarded under this Challenge will be paid by electronic funds transfer and may be subject to local, state, federal and foreign tax reporting and withholding requirements. HHS/NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable. 19. If two or more participants have the same score on the leaderboard, an earlier submission will take precedence and be ranked higher than a later submission. 20. If you have any inquiries about participating in this competition, please don’t hesitate to reach out to us at [email protected]. For questions about eligibility or prize distribution, email NCATS at [email protected]

New Submission

Step 1

Upload or drop your file

Upload or drop your csv file here.

Your submission should be in .csv format.

Step 2

Description

Briefly describe your submission (400 characters or less)

You have exceeded the number of allowed submissions for this competition.

5 submission(s) left

Thanks for your submission!

We'll send updates to your email. You can check your email and preferences here.

My Submissions

Japan Office
+81 3 6671 8256
Koganei Building 4th Floor, 3-4-3 Kami-Meguro,
Meguro City, Tokyo, Japan

UAE Office
DD-14-122-070, WeWork Hub 71 Al Khatem Tower,
ADGM Square Al Maryah Island, Abu Dhabi, UAE

LitCoin NLP Challenge: Part 2

Identify relations between biomedical entities in research abstracts

Brief

Prizes

1st Prize ($ 35000)

2nd Prize ($ 25000)

3rd Prize ($ 20000)

4th Prize ($ 5000)

5th Prize ($ 5000)

6th Prize ($ 5000)

7th Prize ($ 5000)