Cargando…
Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction syste...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/ https://www.ncbi.nlm.nih.gov/pubmed/27554092 http://dx.doi.org/10.1093/database/baw113 |
_version_ | 1782449413885526016 |
---|---|
author | Fluck, Juliane Madan, Sumit Ansari, Sam Kodamullil, Alpha T. Karki, Reagon Rastegar-Mojarad, Majid Catlett, Natalie L. Hayes, William Szostak, Justyna Hoeng, Julia Peitsch, Manuel |
author_facet | Fluck, Juliane Madan, Sumit Ansari, Sam Kodamullil, Alpha T. Karki, Reagon Rastegar-Mojarad, Majid Catlett, Natalie L. Hayes, William Szostak, Justyna Hoeng, Julia Peitsch, Manuel |
author_sort | Fluck, Juliane |
collection | PubMed |
description | Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL: http://wiki.openbel.org/display/BIOC/Datasets |
format | Online Article Text |
id | pubmed-4995071 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-49950712016-08-24 Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) Fluck, Juliane Madan, Sumit Ansari, Sam Kodamullil, Alpha T. Karki, Reagon Rastegar-Mojarad, Majid Catlett, Natalie L. Hayes, William Szostak, Justyna Hoeng, Julia Peitsch, Manuel Database (Oxford) Original Article Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL: http://wiki.openbel.org/display/BIOC/Datasets Oxford University Press 2016-08-20 /pmc/articles/PMC4995071/ /pubmed/27554092 http://dx.doi.org/10.1093/database/baw113 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Fluck, Juliane Madan, Sumit Ansari, Sam Kodamullil, Alpha T. Karki, Reagon Rastegar-Mojarad, Majid Catlett, Natalie L. Hayes, William Szostak, Justyna Hoeng, Julia Peitsch, Manuel Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) |
title | Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) |
title_full | Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) |
title_fullStr | Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) |
title_full_unstemmed | Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) |
title_short | Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) |
title_sort | training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (bel) |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/ https://www.ncbi.nlm.nih.gov/pubmed/27554092 http://dx.doi.org/10.1093/database/baw113 |
work_keys_str_mv | AT fluckjuliane trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT madansumit trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT ansarisam trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT kodamullilalphat trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT karkireagon trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT rastegarmojaradmajid trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT catlettnataliel trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT hayeswilliam trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT szostakjustyna trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT hoengjulia trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel AT peitschmanuel trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel |