Cargando…

A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants

The impact of deleterious variation on both plant fitness and crop productivity is not completely understood and is a hot topic of debates. The deleterious mutations in plants have been solely predicted using sequence conservation methods rather than function-based classifiers due to lack of well-an...

Descripción completa

Detalles Bibliográficos
Autores principales: Kovalev, Maxim S., Igolkina, Anna A., Samsonova, Maria G., Nuzhdin, Sergey V.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6279870/
https://www.ncbi.nlm.nih.gov/pubmed/30546376
http://dx.doi.org/10.3389/fpls.2018.01734
_version_ 1783378556933373952
author Kovalev, Maxim S.
Igolkina, Anna A.
Samsonova, Maria G.
Nuzhdin, Sergey V.
author_facet Kovalev, Maxim S.
Igolkina, Anna A.
Samsonova, Maria G.
Nuzhdin, Sergey V.
author_sort Kovalev, Maxim S.
collection PubMed
description The impact of deleterious variation on both plant fitness and crop productivity is not completely understood and is a hot topic of debates. The deleterious mutations in plants have been solely predicted using sequence conservation methods rather than function-based classifiers due to lack of well-annotated mutational datasets in these organisms. Here, we developed a machine learning classifier based on a dataset of deleterious and neutral mutations in Arabidopsis thaliana by extracting 18 informative features that discriminate deleterious mutations from neutral, including 9 novel features not used in previous studies. We examined linear SVM, Gaussian SVM, and Random Forest classifiers, with the latter performing best. Random Forest classifiers exhibited a markedly higher accuracy than the popular PolyPhen-2 tool in the Arabidopsis dataset. Additionally, we tested whether the Random Forest, trained on the Arabidopsis dataset, accurately predicts deleterious mutations in Orýza sativa and Pisum sativum and observed satisfactory levels of performance accuracy (87% and 93%, respectively) higher than obtained by the PolyPhen-2. Application of Transfer learning in classifiers did not improve their performance. To additionally test the performance of the Random Forest classifier across different angiosperm species, we applied it to annotate deleterious mutations in Cicer arietinum and validated them using population frequency data. Overall, we devised a classifier with the potential to improve the annotation of putative functional mutations in QTL and GWAS hit regions, as well as for the evolutionary analysis of proliferation of deleterious mutations during plant domestication; thus optimizing breeding improvement and development of new cultivars.
format Online
Article
Text
id pubmed-6279870
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-62798702018-12-13 A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants Kovalev, Maxim S. Igolkina, Anna A. Samsonova, Maria G. Nuzhdin, Sergey V. Front Plant Sci Plant Science The impact of deleterious variation on both plant fitness and crop productivity is not completely understood and is a hot topic of debates. The deleterious mutations in plants have been solely predicted using sequence conservation methods rather than function-based classifiers due to lack of well-annotated mutational datasets in these organisms. Here, we developed a machine learning classifier based on a dataset of deleterious and neutral mutations in Arabidopsis thaliana by extracting 18 informative features that discriminate deleterious mutations from neutral, including 9 novel features not used in previous studies. We examined linear SVM, Gaussian SVM, and Random Forest classifiers, with the latter performing best. Random Forest classifiers exhibited a markedly higher accuracy than the popular PolyPhen-2 tool in the Arabidopsis dataset. Additionally, we tested whether the Random Forest, trained on the Arabidopsis dataset, accurately predicts deleterious mutations in Orýza sativa and Pisum sativum and observed satisfactory levels of performance accuracy (87% and 93%, respectively) higher than obtained by the PolyPhen-2. Application of Transfer learning in classifiers did not improve their performance. To additionally test the performance of the Random Forest classifier across different angiosperm species, we applied it to annotate deleterious mutations in Cicer arietinum and validated them using population frequency data. Overall, we devised a classifier with the potential to improve the annotation of putative functional mutations in QTL and GWAS hit regions, as well as for the evolutionary analysis of proliferation of deleterious mutations during plant domestication; thus optimizing breeding improvement and development of new cultivars. Frontiers Media S.A. 2018-11-28 /pmc/articles/PMC6279870/ /pubmed/30546376 http://dx.doi.org/10.3389/fpls.2018.01734 Text en Copyright © 2018 Kovalev, Igolkina, Samsonova and Nuzhdin. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Plant Science
Kovalev, Maxim S.
Igolkina, Anna A.
Samsonova, Maria G.
Nuzhdin, Sergey V.
A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants
title A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants
title_full A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants
title_fullStr A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants
title_full_unstemmed A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants
title_short A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants
title_sort pipeline for classifying deleterious coding mutations in agricultural plants
topic Plant Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6279870/
https://www.ncbi.nlm.nih.gov/pubmed/30546376
http://dx.doi.org/10.3389/fpls.2018.01734
work_keys_str_mv AT kovalevmaxims apipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT igolkinaannaa apipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT samsonovamariag apipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT nuzhdinsergeyv apipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT kovalevmaxims pipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT igolkinaannaa pipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT samsonovamariag pipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants
AT nuzhdinsergeyv pipelineforclassifyingdeleteriouscodingmutationsinagriculturalplants