Cargando…

First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes

The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and,...

Descripción completa

Detalles Bibliográficos
Autores principales: Méndez-Cruz, Carlos-Francisco, Gama-Castro, Socorro, Mejía-Almonte, Citlalli, Castillo-Villalba, Marco-Polo, Muñiz-Rascado, Luis-José, Collado-Vides, Julio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737074/
https://www.ncbi.nlm.nih.gov/pubmed/29220462
http://dx.doi.org/10.1093/database/bax070
_version_ 1783287471467921408
author Méndez-Cruz, Carlos-Francisco
Gama-Castro, Socorro
Mejía-Almonte, Citlalli
Castillo-Villalba, Marco-Polo
Muñiz-Rascado, Luis-José
Collado-Vides, Julio
author_facet Méndez-Cruz, Carlos-Francisco
Gama-Castro, Socorro
Mejía-Almonte, Citlalli
Castillo-Villalba, Marco-Polo
Muñiz-Rascado, Luis-José
Collado-Vides, Julio
author_sort Méndez-Cruz, Carlos-Francisco
collection PubMed
description The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality. DATABASE URL: RegulonDB, http://regulondb.ccg.unam.mx
format Online
Article
Text
id pubmed-5737074
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-57370742018-01-08 First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes Méndez-Cruz, Carlos-Francisco Gama-Castro, Socorro Mejía-Almonte, Citlalli Castillo-Villalba, Marco-Polo Muñiz-Rascado, Luis-José Collado-Vides, Julio Database (Oxford) Original Article The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality. DATABASE URL: RegulonDB, http://regulondb.ccg.unam.mx Oxford University Press 2017-09-26 /pmc/articles/PMC5737074/ /pubmed/29220462 http://dx.doi.org/10.1093/database/bax070 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Méndez-Cruz, Carlos-Francisco
Gama-Castro, Socorro
Mejía-Almonte, Citlalli
Castillo-Villalba, Marco-Polo
Muñiz-Rascado, Luis-José
Collado-Vides, Julio
First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
title First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
title_full First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
title_fullStr First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
title_full_unstemmed First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
title_short First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
title_sort first steps in automatic summarization of transcription factor properties for regulondb: classification of sentences about structural domains and regulated processes
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737074/
https://www.ncbi.nlm.nih.gov/pubmed/29220462
http://dx.doi.org/10.1093/database/bax070
work_keys_str_mv AT mendezcruzcarlosfrancisco firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses
AT gamacastrosocorro firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses
AT mejiaalmontecitlalli firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses
AT castillovillalbamarcopolo firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses
AT munizrascadoluisjose firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses
AT colladovidesjulio firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses