Cargando…

Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data

Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing stati...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhu, Jiadi, Yuan, Ziyang, Shu, Lianjie, Liao, Wenhui, Zhao, Mingtao, Zhou, Yan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7969809/ https://www.ncbi.nlm.nih.gov/pubmed/33747051 http://dx.doi.org/10.3389/fgene.2021.642227

_version_	1783666303455723520
author	Zhu, Jiadi Yuan, Ziyang Shu, Lianjie Liao, Wenhui Zhao, Mingtao Zhou, Yan
author_facet	Zhu, Jiadi Yuan, Ziyang Shu, Lianjie Liao, Wenhui Zhao, Mingtao Zhou, Yan
author_sort	Zhu, Jiadi
collection	PubMed
description	Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model RNA-seq data by a discrete distribution, such as the Poisson, the negative binomial, or the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distributions have been developed: Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, it is unclear what the real distributions would be for these classifications when applied to a new and real dataset. Considering that count datasets are frequently characterized by excess zeros and overdispersion, this paper extends the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution and proposes a zero-inflated negative binomial logistic discriminant analysis (ZINBLDA) for classification. More importantly, we compare the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting the optimal method for RNA-seq data. Furthermore, we determine that the above four methods could transform into each other in some cases. Using simulation studies, we compare and evaluate the performance of these classification methods in a wide range of settings, and we also present a decision tree model created to help us select the optimal classifier for a new RNA-seq dataset. The results of the two real datasets coincide with the theory and simulation analysis results. The methods used in this work are implemented in the open-scource R scripts, with a source code freely available at https://github.com/FocusPaka/ZINBLDA.
format	Online Article Text
id	pubmed-7969809
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-79698092021-03-19 Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data Zhu, Jiadi Yuan, Ziyang Shu, Lianjie Liao, Wenhui Zhao, Mingtao Zhou, Yan Front Genet Genetics Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model RNA-seq data by a discrete distribution, such as the Poisson, the negative binomial, or the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distributions have been developed: Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, it is unclear what the real distributions would be for these classifications when applied to a new and real dataset. Considering that count datasets are frequently characterized by excess zeros and overdispersion, this paper extends the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution and proposes a zero-inflated negative binomial logistic discriminant analysis (ZINBLDA) for classification. More importantly, we compare the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting the optimal method for RNA-seq data. Furthermore, we determine that the above four methods could transform into each other in some cases. Using simulation studies, we compare and evaluate the performance of these classification methods in a wide range of settings, and we also present a decision tree model created to help us select the optimal classifier for a new RNA-seq dataset. The results of the two real datasets coincide with the theory and simulation analysis results. The methods used in this work are implemented in the open-scource R scripts, with a source code freely available at https://github.com/FocusPaka/ZINBLDA. Frontiers Media S.A. 2021-03-04 /pmc/articles/PMC7969809/ /pubmed/33747051 http://dx.doi.org/10.3389/fgene.2021.642227 Text en Copyright © 2021 Zhu, Yuan, Shu, Liao, Zhao and Zhou. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Zhu, Jiadi Yuan, Ziyang Shu, Lianjie Liao, Wenhui Zhao, Mingtao Zhou, Yan Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
title	Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
title_full	Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
title_fullStr	Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
title_full_unstemmed	Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
title_short	Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data
title_sort	selecting classification methods for small samples of next-generation sequencing data
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7969809/ https://www.ncbi.nlm.nih.gov/pubmed/33747051 http://dx.doi.org/10.3389/fgene.2021.642227
work_keys_str_mv	AT zhujiadi selectingclassificationmethodsforsmallsamplesofnextgenerationsequencingdata AT yuanziyang selectingclassificationmethodsforsmallsamplesofnextgenerationsequencingdata AT shulianjie selectingclassificationmethodsforsmallsamplesofnextgenerationsequencingdata AT liaowenhui selectingclassificationmethodsforsmallsamplesofnextgenerationsequencingdata AT zhaomingtao selectingclassificationmethodsforsmallsamplesofnextgenerationsequencingdata AT zhouyan selectingclassificationmethodsforsmallsamplesofnextgenerationsequencingdata

Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data

Ejemplares similares