Cargando…

ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed

During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of pr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Turina, Paola, Fariselli, Piero, Capriotti, Emidio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Molecular Biosciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8027235/ https://www.ncbi.nlm.nih.gov/pubmed/33842537 http://dx.doi.org/10.3389/fmolb.2021.620475

_version_	1783675773174939648
author	Turina, Paola Fariselli, Piero Capriotti, Emidio
author_facet	Turina, Paola Fariselli, Piero Capriotti, Emidio
author_sort	Turina, Paola
collection	PubMed
description	During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts. Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan.
format	Online Article Text
id	pubmed-8027235
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-80272352021-04-09 ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed Turina, Paola Fariselli, Piero Capriotti, Emidio Front Mol Biosci Molecular Biosciences During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts. Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan. Frontiers Media S.A. 2021-03-25 /pmc/articles/PMC8027235/ /pubmed/33842537 http://dx.doi.org/10.3389/fmolb.2021.620475 Text en Copyright © 2021 Turina, Fariselli and Capriotti. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Molecular Biosciences Turina, Paola Fariselli, Piero Capriotti, Emidio ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed
title	ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed
title_full	ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed
title_fullStr	ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed
title_full_unstemmed	ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed
title_short	ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed
title_sort	thermoscan: semi-automatic identification of protein stability data from pubmed
topic	Molecular Biosciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8027235/ https://www.ncbi.nlm.nih.gov/pubmed/33842537 http://dx.doi.org/10.3389/fmolb.2021.620475
work_keys_str_mv	AT turinapaola thermoscansemiautomaticidentificationofproteinstabilitydatafrompubmed AT farisellipiero thermoscansemiautomaticidentificationofproteinstabilitydatafrompubmed AT capriottiemidio thermoscansemiautomaticidentificationofproteinstabilitydatafrompubmed

ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed

Ejemplares similares