Cargando…

RegEl corpus: identifying DNA regulatory elements in the scientific literature

High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)...

Descripción completa

Detalles Bibliográficos
Autores principales: Garda, Samuele, Lenihan-Geels, Freyda, Proft, Sebastian, Hochmuth, Stefanie, Schülke, Markus, Seelow, Dominik, Leser, Ulf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235371/
https://www.ncbi.nlm.nih.gov/pubmed/35758881
http://dx.doi.org/10.1093/database/baac043
_version_ 1784736302197899264
author Garda, Samuele
Lenihan-Geels, Freyda
Proft, Sebastian
Hochmuth, Stefanie
Schülke, Markus
Seelow, Dominik
Leser, Ulf
author_facet Garda, Samuele
Lenihan-Geels, Freyda
Proft, Sebastian
Hochmuth, Stefanie
Schülke, Markus
Seelow, Dominik
Leser, Ulf
author_sort Garda, Samuele
collection PubMed
description High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48–0.91 for entity detection and 0.71–0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg
format Online
Article
Text
id pubmed-9235371
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92353712022-06-28 RegEl corpus: identifying DNA regulatory elements in the scientific literature Garda, Samuele Lenihan-Geels, Freyda Proft, Sebastian Hochmuth, Stefanie Schülke, Markus Seelow, Dominik Leser, Ulf Database (Oxford) Original Article High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48–0.91 for entity detection and 0.71–0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg Oxford University Press 2022-06-27 /pmc/articles/PMC9235371/ /pubmed/35758881 http://dx.doi.org/10.1093/database/baac043 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Article
Garda, Samuele
Lenihan-Geels, Freyda
Proft, Sebastian
Hochmuth, Stefanie
Schülke, Markus
Seelow, Dominik
Leser, Ulf
RegEl corpus: identifying DNA regulatory elements in the scientific literature
title RegEl corpus: identifying DNA regulatory elements in the scientific literature
title_full RegEl corpus: identifying DNA regulatory elements in the scientific literature
title_fullStr RegEl corpus: identifying DNA regulatory elements in the scientific literature
title_full_unstemmed RegEl corpus: identifying DNA regulatory elements in the scientific literature
title_short RegEl corpus: identifying DNA regulatory elements in the scientific literature
title_sort regel corpus: identifying dna regulatory elements in the scientific literature
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235371/
https://www.ncbi.nlm.nih.gov/pubmed/35758881
http://dx.doi.org/10.1093/database/baac043
work_keys_str_mv AT gardasamuele regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature
AT lenihangeelsfreyda regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature
AT proftsebastian regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature
AT hochmuthstefanie regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature
AT schulkemarkus regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature
AT seelowdominik regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature
AT leserulf regelcorpusidentifyingdnaregulatoryelementsinthescientificliterature