Cargando…

Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses

BACKGROUND: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. RESULTS: We presented a model-...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Shunpu, Li, Zhong, Beland, Kevin, Lu, Guoqing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4955158/
https://www.ncbi.nlm.nih.gov/pubmed/27439701
http://dx.doi.org/10.1186/s12859-016-1147-x
_version_ 1782443897804292096
author Zhang, Shunpu
Li, Zhong
Beland, Kevin
Lu, Guoqing
author_facet Zhang, Shunpu
Li, Zhong
Beland, Kevin
Lu, Guoqing
author_sort Zhang, Shunpu
collection PubMed
description BACKGROUND: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. RESULTS: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92–1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. CONCLUSIONS: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1147-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4955158
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-49551582016-09-06 Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses Zhang, Shunpu Li, Zhong Beland, Kevin Lu, Guoqing BMC Bioinformatics Methodology Article BACKGROUND: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. RESULTS: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92–1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. CONCLUSIONS: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1147-x) contains supplementary material, which is available to authorized users. BioMed Central 2016-07-21 /pmc/articles/PMC4955158/ /pubmed/27439701 http://dx.doi.org/10.1186/s12859-016-1147-x Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Zhang, Shunpu
Li, Zhong
Beland, Kevin
Lu, Guoqing
Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
title Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
title_full Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
title_fullStr Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
title_full_unstemmed Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
title_short Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
title_sort model-based clustering with certainty estimation: implication for clade assignment of influenza viruses
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4955158/
https://www.ncbi.nlm.nih.gov/pubmed/27439701
http://dx.doi.org/10.1186/s12859-016-1147-x
work_keys_str_mv AT zhangshunpu modelbasedclusteringwithcertaintyestimationimplicationforcladeassignmentofinfluenzaviruses
AT lizhong modelbasedclusteringwithcertaintyestimationimplicationforcladeassignmentofinfluenzaviruses
AT belandkevin modelbasedclusteringwithcertaintyestimationimplicationforcladeassignmentofinfluenzaviruses
AT luguoqing modelbasedclusteringwithcertaintyestimationimplicationforcladeassignmentofinfluenzaviruses