Cargando…

Accelerated variant curation from scientific literature using biomedical text mining

Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have dev...

Descripción completa

Detalles Bibliográficos
Autores principales: Mallick, Rishab, Arnaboldi, Valerio, Davis, Paul, Diamantakis, Stavros, Zarowiecki, Magdalena, Howe, Kevin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Caltech Library 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9160977/
https://www.ncbi.nlm.nih.gov/pubmed/35663412
http://dx.doi.org/10.17912/micropub.biology.000578
_version_ 1784719386265780224
author Mallick, Rishab
Arnaboldi, Valerio
Davis, Paul
Diamantakis, Stavros
Zarowiecki, Magdalena
Howe, Kevin
author_facet Mallick, Rishab
Arnaboldi, Valerio
Davis, Paul
Diamantakis, Stavros
Zarowiecki, Magdalena
Howe, Kevin
author_sort Mallick, Rishab
collection PubMed
description Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers
format Online
Article
Text
id pubmed-9160977
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Caltech Library
record_format MEDLINE/PubMed
spelling pubmed-91609772022-06-03 Accelerated variant curation from scientific literature using biomedical text mining Mallick, Rishab Arnaboldi, Valerio Davis, Paul Diamantakis, Stavros Zarowiecki, Magdalena Howe, Kevin MicroPubl Biol New Methods Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers Caltech Library 2022-06-01 /pmc/articles/PMC9160977/ /pubmed/35663412 http://dx.doi.org/10.17912/micropub.biology.000578 Text en Copyright: © 2022 by the authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle New Methods
Mallick, Rishab
Arnaboldi, Valerio
Davis, Paul
Diamantakis, Stavros
Zarowiecki, Magdalena
Howe, Kevin
Accelerated variant curation from scientific literature using biomedical text mining
title Accelerated variant curation from scientific literature using biomedical text mining
title_full Accelerated variant curation from scientific literature using biomedical text mining
title_fullStr Accelerated variant curation from scientific literature using biomedical text mining
title_full_unstemmed Accelerated variant curation from scientific literature using biomedical text mining
title_short Accelerated variant curation from scientific literature using biomedical text mining
title_sort accelerated variant curation from scientific literature using biomedical text mining
topic New Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9160977/
https://www.ncbi.nlm.nih.gov/pubmed/35663412
http://dx.doi.org/10.17912/micropub.biology.000578
work_keys_str_mv AT mallickrishab acceleratedvariantcurationfromscientificliteratureusingbiomedicaltextmining
AT arnaboldivalerio acceleratedvariantcurationfromscientificliteratureusingbiomedicaltextmining
AT davispaul acceleratedvariantcurationfromscientificliteratureusingbiomedicaltextmining
AT diamantakisstavros acceleratedvariantcurationfromscientificliteratureusingbiomedicaltextmining
AT zarowieckimagdalena acceleratedvariantcurationfromscientificliteratureusingbiomedicaltextmining
AT howekevin acceleratedvariantcurationfromscientificliteratureusingbiomedicaltextmining