Cargando…

Accelerated variant curation from scientific literature using biomedical text mining

Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have dev...

Descripción completa

Detalles Bibliográficos
Autores principales: Mallick, Rishab, Arnaboldi, Valerio, Davis, Paul, Diamantakis, Stavros, Zarowiecki, Magdalena, Howe, Kevin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Caltech Library 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9160977/
https://www.ncbi.nlm.nih.gov/pubmed/35663412
http://dx.doi.org/10.17912/micropub.biology.000578
Descripción
Sumario:Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers