Cargando…

Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences

SARS-CoV-2 has caused a worldwide pandemic. Existing research on coronavirus mutations is based on small data sets, and multiple sequence alignment using a global-scale data set has yet to be conducted. Statistical analysis of integral mutations and global spread are necessary and could help improve...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, Qianyu, Huang, Yunchuanxiang, Jiang, Ziyi, Wu, Feng, Ma, Lan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7831388/
https://www.ncbi.nlm.nih.gov/pubmed/33505425
http://dx.doi.org/10.3389/fgene.2020.591833
_version_ 1783641619458686976
author Lin, Qianyu
Huang, Yunchuanxiang
Jiang, Ziyi
Wu, Feng
Ma, Lan
author_facet Lin, Qianyu
Huang, Yunchuanxiang
Jiang, Ziyi
Wu, Feng
Ma, Lan
author_sort Lin, Qianyu
collection PubMed
description SARS-CoV-2 has caused a worldwide pandemic. Existing research on coronavirus mutations is based on small data sets, and multiple sequence alignment using a global-scale data set has yet to be conducted. Statistical analysis of integral mutations and global spread are necessary and could help improve primer design for nucleic acid diagnosis and vaccine development. Here, we optimized multiple sequence alignment using a conserved sequence search algorithm to align 24,768 sequences from the GISAID data set. A phylogenetic tree was constructed using the maximum likelihood (ML) method. Coronavirus subtypes were analyzed via t-SNE clustering. We performed haplotype network analysis and t-SNE clustering to analyze the coronavirus origin and spread. Overall, we identified 33 sense, 17 nonsense, 79 amino acid loss, and 4 amino acid insertion mutations in full-length open reading frames. Phylogenetic trees were successfully constructed and samples clustered into subtypes. The COVID-19 pandemic differed among countries and continents. Samples from the United States and western Europe were more diverse, and those from China and Asia mainly contained specific subtypes. Clades G/GH/GR are more likely to be the origin clades of SARS-CoV-2 compared with clades S/L/V. Conserved sequence searches can be used to segment long sequences, making large-scale multisequence alignment possible, facilitating more comprehensive gene mutation analysis. Mutation analysis of the SARS-CoV-2 can inform primer design for nucleic acid diagnosis to improve virus detection efficiency. In addition, research into the characteristics of viral spread and relationships among geographic regions can help formulate health policies and reduce the increase of imported cases.
format Online
Article
Text
id pubmed-7831388
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-78313882021-01-26 Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences Lin, Qianyu Huang, Yunchuanxiang Jiang, Ziyi Wu, Feng Ma, Lan Front Genet Genetics SARS-CoV-2 has caused a worldwide pandemic. Existing research on coronavirus mutations is based on small data sets, and multiple sequence alignment using a global-scale data set has yet to be conducted. Statistical analysis of integral mutations and global spread are necessary and could help improve primer design for nucleic acid diagnosis and vaccine development. Here, we optimized multiple sequence alignment using a conserved sequence search algorithm to align 24,768 sequences from the GISAID data set. A phylogenetic tree was constructed using the maximum likelihood (ML) method. Coronavirus subtypes were analyzed via t-SNE clustering. We performed haplotype network analysis and t-SNE clustering to analyze the coronavirus origin and spread. Overall, we identified 33 sense, 17 nonsense, 79 amino acid loss, and 4 amino acid insertion mutations in full-length open reading frames. Phylogenetic trees were successfully constructed and samples clustered into subtypes. The COVID-19 pandemic differed among countries and continents. Samples from the United States and western Europe were more diverse, and those from China and Asia mainly contained specific subtypes. Clades G/GH/GR are more likely to be the origin clades of SARS-CoV-2 compared with clades S/L/V. Conserved sequence searches can be used to segment long sequences, making large-scale multisequence alignment possible, facilitating more comprehensive gene mutation analysis. Mutation analysis of the SARS-CoV-2 can inform primer design for nucleic acid diagnosis to improve virus detection efficiency. In addition, research into the characteristics of viral spread and relationships among geographic regions can help formulate health policies and reduce the increase of imported cases. Frontiers Media S.A. 2021-01-11 /pmc/articles/PMC7831388/ /pubmed/33505425 http://dx.doi.org/10.3389/fgene.2020.591833 Text en Copyright © 2021 Lin, Huang, Jiang, Wu and Ma. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Lin, Qianyu
Huang, Yunchuanxiang
Jiang, Ziyi
Wu, Feng
Ma, Lan
Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences
title Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences
title_full Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences
title_fullStr Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences
title_full_unstemmed Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences
title_short Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences
title_sort deciphering the subtype differentiation history of sars-cov-2 based on a new breadth-first searching optimized alignment method over a global data set of 24,768 sequences
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7831388/
https://www.ncbi.nlm.nih.gov/pubmed/33505425
http://dx.doi.org/10.3389/fgene.2020.591833
work_keys_str_mv AT linqianyu decipheringthesubtypedifferentiationhistoryofsarscov2basedonanewbreadthfirstsearchingoptimizedalignmentmethodoveraglobaldatasetof24768sequences
AT huangyunchuanxiang decipheringthesubtypedifferentiationhistoryofsarscov2basedonanewbreadthfirstsearchingoptimizedalignmentmethodoveraglobaldatasetof24768sequences
AT jiangziyi decipheringthesubtypedifferentiationhistoryofsarscov2basedonanewbreadthfirstsearchingoptimizedalignmentmethodoveraglobaldatasetof24768sequences
AT wufeng decipheringthesubtypedifferentiationhistoryofsarscov2basedonanewbreadthfirstsearchingoptimizedalignmentmethodoveraglobaldatasetof24768sequences
AT malan decipheringthesubtypedifferentiationhistoryofsarscov2basedonanewbreadthfirstsearchingoptimizedalignmentmethodoveraglobaldatasetof24768sequences