Cargando…

Deep learning methods may not outperform other machine learning methods on analyzing genomic studies

Deep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing. Recently, many DL methods have been applied to analyze genomic studies. However, genomic data usually has too small a sample size to fit a complex network. They...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dong, Yao, Zhou, Shaoze, Xing, Li, Chen, Yumeng, Ren, Ziyu, Dong, Yongfeng, Zhang, Xuekui
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9537734/ https://www.ncbi.nlm.nih.gov/pubmed/36212148 http://dx.doi.org/10.3389/fgene.2022.992070

_version_	1784803267402792960
author	Dong, Yao Zhou, Shaoze Xing, Li Chen, Yumeng Ren, Ziyu Dong, Yongfeng Zhang, Xuekui
author_facet	Dong, Yao Zhou, Shaoze Xing, Li Chen, Yumeng Ren, Ziyu Dong, Yongfeng Zhang, Xuekui
author_sort	Dong, Yao
collection	PubMed
description	Deep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing. Recently, many DL methods have been applied to analyze genomic studies. However, genomic data usually has too small a sample size to fit a complex network. They do not have common structural patterns like images to utilize pre-trained networks or take advantage of convolution layers. The concern of overusing DL methods motivates us to evaluate DL methods’ performance versus popular non-deep Machine Learning (ML) methods for analyzing genomic data with a wide range of sample sizes. In this paper, we conduct a benchmark study using the UK Biobank data and its many random subsets with different sample sizes. The original UK Biobank data has about 500k participants. Each patient has comprehensive patient characteristics, disease histories, and genomic information, i.e., the genotypes of millions of Single-Nucleotide Polymorphism (SNPs). We are interested in predicting the risk of three lung diseases: asthma, COPD, and lung cancer. There are 205,238 participants have recorded disease outcomes for these three diseases. Five prediction models are investigated in this benchmark study, including three non-deep machine learning methods (Elastic Net, XGBoost, and SVM) and two deep learning methods (DNN and LSTM). Besides the most popular performance metrics, such as the F1-score, we promote the hit curve, a visual tool to describe the performance of predicting rare events. We discovered that DL methods frequently fail to outperform non-deep ML in analyzing genomic data, even in large datasets with over 200k samples. The experiment results suggest not overusing DL methods in genomic studies, even with biobank-level sample sizes. The performance differences between DL and non-deep ML decrease as the sample size of data increases. This suggests when the sample size of data is significant, further increasing sample sizes leads to more performance gain in DL methods. Hence, DL methods could be better if we analyze genomic data bigger than this study.
format	Online Article Text
id	pubmed-9537734
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-95377342022-10-08 Deep learning methods may not outperform other machine learning methods on analyzing genomic studies Dong, Yao Zhou, Shaoze Xing, Li Chen, Yumeng Ren, Ziyu Dong, Yongfeng Zhang, Xuekui Front Genet Genetics Deep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing. Recently, many DL methods have been applied to analyze genomic studies. However, genomic data usually has too small a sample size to fit a complex network. They do not have common structural patterns like images to utilize pre-trained networks or take advantage of convolution layers. The concern of overusing DL methods motivates us to evaluate DL methods’ performance versus popular non-deep Machine Learning (ML) methods for analyzing genomic data with a wide range of sample sizes. In this paper, we conduct a benchmark study using the UK Biobank data and its many random subsets with different sample sizes. The original UK Biobank data has about 500k participants. Each patient has comprehensive patient characteristics, disease histories, and genomic information, i.e., the genotypes of millions of Single-Nucleotide Polymorphism (SNPs). We are interested in predicting the risk of three lung diseases: asthma, COPD, and lung cancer. There are 205,238 participants have recorded disease outcomes for these three diseases. Five prediction models are investigated in this benchmark study, including three non-deep machine learning methods (Elastic Net, XGBoost, and SVM) and two deep learning methods (DNN and LSTM). Besides the most popular performance metrics, such as the F1-score, we promote the hit curve, a visual tool to describe the performance of predicting rare events. We discovered that DL methods frequently fail to outperform non-deep ML in analyzing genomic data, even in large datasets with over 200k samples. The experiment results suggest not overusing DL methods in genomic studies, even with biobank-level sample sizes. The performance differences between DL and non-deep ML decrease as the sample size of data increases. This suggests when the sample size of data is significant, further increasing sample sizes leads to more performance gain in DL methods. Hence, DL methods could be better if we analyze genomic data bigger than this study. Frontiers Media S.A. 2022-09-23 /pmc/articles/PMC9537734/ /pubmed/36212148 http://dx.doi.org/10.3389/fgene.2022.992070 Text en Copyright © 2022 Dong, Zhou, Xing, Chen, Ren, Dong and Zhang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Dong, Yao Zhou, Shaoze Xing, Li Chen, Yumeng Ren, Ziyu Dong, Yongfeng Zhang, Xuekui Deep learning methods may not outperform other machine learning methods on analyzing genomic studies
title	Deep learning methods may not outperform other machine learning methods on analyzing genomic studies
title_full	Deep learning methods may not outperform other machine learning methods on analyzing genomic studies
title_fullStr	Deep learning methods may not outperform other machine learning methods on analyzing genomic studies
title_full_unstemmed	Deep learning methods may not outperform other machine learning methods on analyzing genomic studies
title_short	Deep learning methods may not outperform other machine learning methods on analyzing genomic studies
title_sort	deep learning methods may not outperform other machine learning methods on analyzing genomic studies
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9537734/ https://www.ncbi.nlm.nih.gov/pubmed/36212148 http://dx.doi.org/10.3389/fgene.2022.992070
work_keys_str_mv	AT dongyao deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies AT zhoushaoze deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies AT xingli deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies AT chenyumeng deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies AT renziyu deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies AT dongyongfeng deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies AT zhangxuekui deeplearningmethodsmaynotoutperformothermachinelearningmethodsonanalyzinggenomicstudies

Deep learning methods may not outperform other machine learning methods on analyzing genomic studies

Ejemplares similares