Cargando…

Principal Component Analysis Applications in COVID-19 Genome Sequence Studies

RNA genomes from coronavirus have a length as long as 32 kilobases, and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that caused the outbreak of coronavirus disease 2019 (COVID-19) pandemic has long sequences which made the analysis difficult. Over 20,000 sequences have been subm...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Bo, Jiang, Lin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7804214/ https://www.ncbi.nlm.nih.gov/pubmed/33456620 http://dx.doi.org/10.1007/s12559-020-09790-w

_version_	1783636112643719168
author	Wang, Bo Jiang, Lin
author_facet	Wang, Bo Jiang, Lin
author_sort	Wang, Bo
collection	PubMed
description	RNA genomes from coronavirus have a length as long as 32 kilobases, and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that caused the outbreak of coronavirus disease 2019 (COVID-19) pandemic has long sequences which made the analysis difficult. Over 20,000 sequences have been submitted to GISAID, and the number is growing fast each day which increased the difficulties in data analysis; however, genome sequence analysis is critical in understanding the COVID-19 and preventing the spread of the disease. In this study, a principal component analysis (PCA) was applied to the aligned large size genome sequences and the numerical numbers were converted from the letters using a published method designed for protein sequence cluster analysis. The study initialized with a shortlist sequence testing, and the PCA score plot showed high tolerance with low-quality data, and the major virus sequences from humans were separated from the pangolin and bat samples. Our study also successfully built a model for a large number of sequences with more than 20,000 sequences which indicate the potential mutation directions for the COVID-19 which can be served as a pretreatment method for detailed studies such as decision tree-based methods. In summary, our study provided a fast tool to analyze the high-volume genome sequences such as the COVID-19 and successfully applied to more than 20,000 sequences which may provide mutation direction information for COVID-19 studies.
format	Online Article Text
id	pubmed-7804214
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-78042142021-01-13 Principal Component Analysis Applications in COVID-19 Genome Sequence Studies Wang, Bo Jiang, Lin Cognit Comput Article RNA genomes from coronavirus have a length as long as 32 kilobases, and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that caused the outbreak of coronavirus disease 2019 (COVID-19) pandemic has long sequences which made the analysis difficult. Over 20,000 sequences have been submitted to GISAID, and the number is growing fast each day which increased the difficulties in data analysis; however, genome sequence analysis is critical in understanding the COVID-19 and preventing the spread of the disease. In this study, a principal component analysis (PCA) was applied to the aligned large size genome sequences and the numerical numbers were converted from the letters using a published method designed for protein sequence cluster analysis. The study initialized with a shortlist sequence testing, and the PCA score plot showed high tolerance with low-quality data, and the major virus sequences from humans were separated from the pangolin and bat samples. Our study also successfully built a model for a large number of sequences with more than 20,000 sequences which indicate the potential mutation directions for the COVID-19 which can be served as a pretreatment method for detailed studies such as decision tree-based methods. In summary, our study provided a fast tool to analyze the high-volume genome sequences such as the COVID-19 and successfully applied to more than 20,000 sequences which may provide mutation direction information for COVID-19 studies. Springer US 2021-01-13 /pmc/articles/PMC7804214/ /pubmed/33456620 http://dx.doi.org/10.1007/s12559-020-09790-w Text en © Springer Science+Business Media, LLC, part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Wang, Bo Jiang, Lin Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
title	Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
title_full	Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
title_fullStr	Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
title_full_unstemmed	Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
title_short	Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
title_sort	principal component analysis applications in covid-19 genome sequence studies
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7804214/ https://www.ncbi.nlm.nih.gov/pubmed/33456620 http://dx.doi.org/10.1007/s12559-020-09790-w
work_keys_str_mv	AT wangbo principalcomponentanalysisapplicationsincovid19genomesequencestudies AT jianglin principalcomponentanalysisapplicationsincovid19genomesequencestudies

Principal Component Analysis Applications in COVID-19 Genome Sequence Studies

Ejemplares similares