Cargando…

ResidueFinder: extracting individual residue mentions from protein literature

BACKGROUND: The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. Mu...

Descripción completa

Detalles Bibliográficos
Autores principales: Becker, Ton E, Jakobsson, Eric
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8293528/
https://www.ncbi.nlm.nih.gov/pubmed/34289903
http://dx.doi.org/10.1186/s13326-021-00243-3
_version_ 1783725058981625856
author Becker, Ton E
Jakobsson, Eric
author_facet Becker, Ton E
Jakobsson, Eric
author_sort Becker, Ton E
collection PubMed
description BACKGROUND: The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. RESULTS: We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute F(β) for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. CONCLUSIONS: ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13326-021-00243-3.
format Online
Article
Text
id pubmed-8293528
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-82935282021-07-21 ResidueFinder: extracting individual residue mentions from protein literature Becker, Ton E Jakobsson, Eric J Biomed Semantics Software BACKGROUND: The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. RESULTS: We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute F(β) for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. CONCLUSIONS: ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13326-021-00243-3. BioMed Central 2021-07-21 /pmc/articles/PMC8293528/ /pubmed/34289903 http://dx.doi.org/10.1186/s13326-021-00243-3 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Becker, Ton E
Jakobsson, Eric
ResidueFinder: extracting individual residue mentions from protein literature
title ResidueFinder: extracting individual residue mentions from protein literature
title_full ResidueFinder: extracting individual residue mentions from protein literature
title_fullStr ResidueFinder: extracting individual residue mentions from protein literature
title_full_unstemmed ResidueFinder: extracting individual residue mentions from protein literature
title_short ResidueFinder: extracting individual residue mentions from protein literature
title_sort residuefinder: extracting individual residue mentions from protein literature
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8293528/
https://www.ncbi.nlm.nih.gov/pubmed/34289903
http://dx.doi.org/10.1186/s13326-021-00243-3
work_keys_str_mv AT beckertone residuefinderextractingindividualresiduementionsfromproteinliterature
AT jakobssoneric residuefinderextractingindividualresiduementionsfromproteinliterature