Cargando…

Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)

Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction syste...

Descripción completa

Detalles Bibliográficos
Autores principales: Fluck, Juliane, Madan, Sumit, Ansari, Sam, Kodamullil, Alpha T., Karki, Reagon, Rastegar-Mojarad, Majid, Catlett, Natalie L., Hayes, William, Szostak, Justyna, Hoeng, Julia, Peitsch, Manuel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/
https://www.ncbi.nlm.nih.gov/pubmed/27554092
http://dx.doi.org/10.1093/database/baw113
_version_ 1782449413885526016
author Fluck, Juliane
Madan, Sumit
Ansari, Sam
Kodamullil, Alpha T.
Karki, Reagon
Rastegar-Mojarad, Majid
Catlett, Natalie L.
Hayes, William
Szostak, Justyna
Hoeng, Julia
Peitsch, Manuel
author_facet Fluck, Juliane
Madan, Sumit
Ansari, Sam
Kodamullil, Alpha T.
Karki, Reagon
Rastegar-Mojarad, Majid
Catlett, Natalie L.
Hayes, William
Szostak, Justyna
Hoeng, Julia
Peitsch, Manuel
author_sort Fluck, Juliane
collection PubMed
description Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL: http://wiki.openbel.org/display/BIOC/Datasets
format Online
Article
Text
id pubmed-4995071
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49950712016-08-24 Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) Fluck, Juliane Madan, Sumit Ansari, Sam Kodamullil, Alpha T. Karki, Reagon Rastegar-Mojarad, Majid Catlett, Natalie L. Hayes, William Szostak, Justyna Hoeng, Julia Peitsch, Manuel Database (Oxford) Original Article Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL: http://wiki.openbel.org/display/BIOC/Datasets Oxford University Press 2016-08-20 /pmc/articles/PMC4995071/ /pubmed/27554092 http://dx.doi.org/10.1093/database/baw113 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Fluck, Juliane
Madan, Sumit
Ansari, Sam
Kodamullil, Alpha T.
Karki, Reagon
Rastegar-Mojarad, Majid
Catlett, Natalie L.
Hayes, William
Szostak, Justyna
Hoeng, Julia
Peitsch, Manuel
Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
title Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
title_full Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
title_fullStr Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
title_full_unstemmed Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
title_short Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)
title_sort training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (bel)
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/
https://www.ncbi.nlm.nih.gov/pubmed/27554092
http://dx.doi.org/10.1093/database/baw113
work_keys_str_mv AT fluckjuliane trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT madansumit trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT ansarisam trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT kodamullilalphat trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT karkireagon trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT rastegarmojaradmajid trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT catlettnataliel trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT hayeswilliam trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT szostakjustyna trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT hoengjulia trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel
AT peitschmanuel trainingandevaluationcorporafortheextractionofcausalrelationshipsencodedinbiologicalexpressionlanguagebel