Cargando…

Identification of transcription factor contexts in literature using machine learning approaches

BACKGROUND: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hinde...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Hui, Nenadic, Goran, Keane, John A
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352869/ https://www.ncbi.nlm.nih.gov/pubmed/18426546 http://dx.doi.org/10.1186/1471-2105-9-S3-S11

_version_	1782152859503034368
author	Yang, Hui Nenadic, Goran Keane, John A
author_facet	Yang, Hui Nenadic, Goran Keane, John A
author_sort	Yang, Hui
collection	PubMed
description	BACKGROUND: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature. RESULTS: In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%. CONCLUSIONS: The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.
format	Text
id	pubmed-2352869
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-23528692008-04-29 Identification of transcription factor contexts in literature using machine learning approaches Yang, Hui Nenadic, Goran Keane, John A BMC Bioinformatics Proceedings BACKGROUND: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature. RESULTS: In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%. CONCLUSIONS: The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data. BioMed Central 2008-04-11 /pmc/articles/PMC2352869/ /pubmed/18426546 http://dx.doi.org/10.1186/1471-2105-9-S3-S11 Text en Copyright © 2008 Yang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Yang, Hui Nenadic, Goran Keane, John A Identification of transcription factor contexts in literature using machine learning approaches
title	Identification of transcription factor contexts in literature using machine learning approaches
title_full	Identification of transcription factor contexts in literature using machine learning approaches
title_fullStr	Identification of transcription factor contexts in literature using machine learning approaches
title_full_unstemmed	Identification of transcription factor contexts in literature using machine learning approaches
title_short	Identification of transcription factor contexts in literature using machine learning approaches
title_sort	identification of transcription factor contexts in literature using machine learning approaches
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352869/ https://www.ncbi.nlm.nih.gov/pubmed/18426546 http://dx.doi.org/10.1186/1471-2105-9-S3-S11
work_keys_str_mv	AT yanghui identificationoftranscriptionfactorcontextsinliteratureusingmachinelearningapproaches AT nenadicgoran identificationoftranscriptionfactorcontextsinliteratureusingmachinelearningapproaches AT keanejohna identificationoftranscriptionfactorcontextsinliteratureusingmachinelearningapproaches

Identification of transcription factor contexts in literature using machine learning approaches

Ejemplares similares