Cargando…

TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies

One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dim...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Jiali, Wu, Qingtai, Shen, Dafeng, Wen, Yangjun, Liu, Fengrong, Gao, Yu, Ding, Jie, Zhang, Jin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6889171/
https://www.ncbi.nlm.nih.gov/pubmed/31792302
http://dx.doi.org/10.1038/s41598-019-54519-x
_version_ 1783475360281657344
author Sun, Jiali
Wu, Qingtai
Shen, Dafeng
Wen, Yangjun
Liu, Fengrong
Gao, Yu
Ding, Jie
Zhang, Jin
author_facet Sun, Jiali
Wu, Qingtai
Shen, Dafeng
Wen, Yangjun
Liu, Fengrong
Gao, Yu
Ding, Jie
Zhang, Jin
author_sort Sun, Jiali
collection PubMed
description One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.
format Online
Article
Text
id pubmed-6889171
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-68891712019-12-10 TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies Sun, Jiali Wu, Qingtai Shen, Dafeng Wen, Yangjun Liu, Fengrong Gao, Yu Ding, Jie Zhang, Jin Sci Rep Article One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait. Nature Publishing Group UK 2019-12-02 /pmc/articles/PMC6889171/ /pubmed/31792302 http://dx.doi.org/10.1038/s41598-019-54519-x Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Sun, Jiali
Wu, Qingtai
Shen, Dafeng
Wen, Yangjun
Liu, Fengrong
Gao, Yu
Ding, Jie
Zhang, Jin
TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
title TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
title_full TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
title_fullStr TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
title_full_unstemmed TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
title_short TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
title_sort tslrf: two-stage algorithm based on least angle regression and random forest in genome-wide association studies
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6889171/
https://www.ncbi.nlm.nih.gov/pubmed/31792302
http://dx.doi.org/10.1038/s41598-019-54519-x
work_keys_str_mv AT sunjiali tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT wuqingtai tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT shendafeng tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT wenyangjun tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT liufengrong tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT gaoyu tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT dingjie tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies
AT zhangjin tslrftwostagealgorithmbasedonleastangleregressionandrandomforestingenomewideassociationstudies