Cargando…
First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes
The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and,...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737074/ https://www.ncbi.nlm.nih.gov/pubmed/29220462 http://dx.doi.org/10.1093/database/bax070 |
_version_ | 1783287471467921408 |
---|---|
author | Méndez-Cruz, Carlos-Francisco Gama-Castro, Socorro Mejía-Almonte, Citlalli Castillo-Villalba, Marco-Polo Muñiz-Rascado, Luis-José Collado-Vides, Julio |
author_facet | Méndez-Cruz, Carlos-Francisco Gama-Castro, Socorro Mejía-Almonte, Citlalli Castillo-Villalba, Marco-Polo Muñiz-Rascado, Luis-José Collado-Vides, Julio |
author_sort | Méndez-Cruz, Carlos-Francisco |
collection | PubMed |
description | The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality. DATABASE URL: RegulonDB, http://regulondb.ccg.unam.mx |
format | Online Article Text |
id | pubmed-5737074 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-57370742018-01-08 First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes Méndez-Cruz, Carlos-Francisco Gama-Castro, Socorro Mejía-Almonte, Citlalli Castillo-Villalba, Marco-Polo Muñiz-Rascado, Luis-José Collado-Vides, Julio Database (Oxford) Original Article The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality. DATABASE URL: RegulonDB, http://regulondb.ccg.unam.mx Oxford University Press 2017-09-26 /pmc/articles/PMC5737074/ /pubmed/29220462 http://dx.doi.org/10.1093/database/bax070 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Méndez-Cruz, Carlos-Francisco Gama-Castro, Socorro Mejía-Almonte, Citlalli Castillo-Villalba, Marco-Polo Muñiz-Rascado, Luis-José Collado-Vides, Julio First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes |
title | First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes |
title_full | First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes |
title_fullStr | First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes |
title_full_unstemmed | First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes |
title_short | First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes |
title_sort | first steps in automatic summarization of transcription factor properties for regulondb: classification of sentences about structural domains and regulated processes |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737074/ https://www.ncbi.nlm.nih.gov/pubmed/29220462 http://dx.doi.org/10.1093/database/bax070 |
work_keys_str_mv | AT mendezcruzcarlosfrancisco firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses AT gamacastrosocorro firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses AT mejiaalmontecitlalli firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses AT castillovillalbamarcopolo firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses AT munizrascadoluisjose firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses AT colladovidesjulio firststepsinautomaticsummarizationoftranscriptionfactorpropertiesforregulondbclassificationofsentencesaboutstructuraldomainsandregulatedprocesses |