Cargando…
Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements
BACKGROUND: Arabidopsis thaliana is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning alg...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2007
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2213690/ https://www.ncbi.nlm.nih.gov/pubmed/17888165 http://dx.doi.org/10.1186/1471-2105-8-358 |
_version_ | 1782148935356252160 |
---|---|
author | Lan, Hui Carson, Rachel Provart, Nicholas J Bonner, Anthony J |
author_facet | Lan, Hui Carson, Rachel Provart, Nicholas J Bonner, Anthony J |
author_sort | Lan, Hui |
collection | PubMed |
description | BACKGROUND: Arabidopsis thaliana is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress. RESULTS: Using in house and publicly available data, we assembled a large set of gene expression measurements for A. thaliana. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC(50 )and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl. CONCLUSION: Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions – in this case, predictions of genes involved in stress response in plants – and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in A. thaliana that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions. |
format | Text |
id | pubmed-2213690 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2007 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-22136902008-01-25 Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements Lan, Hui Carson, Rachel Provart, Nicholas J Bonner, Anthony J BMC Bioinformatics Research Article BACKGROUND: Arabidopsis thaliana is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress. RESULTS: Using in house and publicly available data, we assembled a large set of gene expression measurements for A. thaliana. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC(50 )and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl. CONCLUSION: Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions – in this case, predictions of genes involved in stress response in plants – and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in A. thaliana that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions. BioMed Central 2007-09-21 /pmc/articles/PMC2213690/ /pubmed/17888165 http://dx.doi.org/10.1186/1471-2105-8-358 Text en Copyright © 2007 Lan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Lan, Hui Carson, Rachel Provart, Nicholas J Bonner, Anthony J Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements |
title | Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements |
title_full | Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements |
title_fullStr | Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements |
title_full_unstemmed | Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements |
title_short | Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements |
title_sort | combining classifiers to predict gene function in arabidopsis thaliana using large-scale gene expression measurements |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2213690/ https://www.ncbi.nlm.nih.gov/pubmed/17888165 http://dx.doi.org/10.1186/1471-2105-8-358 |
work_keys_str_mv | AT lanhui combiningclassifierstopredictgenefunctioninarabidopsisthalianausinglargescalegeneexpressionmeasurements AT carsonrachel combiningclassifierstopredictgenefunctioninarabidopsisthalianausinglargescalegeneexpressionmeasurements AT provartnicholasj combiningclassifierstopredictgenefunctioninarabidopsisthalianausinglargescalegeneexpressionmeasurements AT bonneranthonyj combiningclassifierstopredictgenefunctioninarabidopsisthalianausinglargescalegeneexpressionmeasurements |