Cargando…

A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis

BACKGROUND: Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of we...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Jiwoong, Ahn, Yongju, Lee, Kichan, Park, Sung Hee, Kim, Sangsoo
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2936400/
https://www.ncbi.nlm.nih.gov/pubmed/20727194
http://dx.doi.org/10.1186/1471-2105-11-434
_version_ 1782186487013441536
author Kim, Jiwoong
Ahn, Yongju
Lee, Kichan
Park, Sung Hee
Kim, Sangsoo
author_facet Kim, Jiwoong
Ahn, Yongju
Lee, Kichan
Park, Sung Hee
Kim, Sangsoo
author_sort Kim, Jiwoong
collection PubMed
description BACKGROUND: Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment. RESULTS: MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory. CONCLUSIONS: The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/.
format Text
id pubmed-2936400
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-29364002011-07-08 A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis Kim, Jiwoong Ahn, Yongju Lee, Kichan Park, Sung Hee Kim, Sangsoo BMC Bioinformatics Methodology Article BACKGROUND: Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment. RESULTS: MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory. CONCLUSIONS: The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/. BioMed Central 2010-08-21 /pmc/articles/PMC2936400/ /pubmed/20727194 http://dx.doi.org/10.1186/1471-2105-11-434 Text en Copyright ©2010 Kim et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kim, Jiwoong
Ahn, Yongju
Lee, Kichan
Park, Sung Hee
Kim, Sangsoo
A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
title A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
title_full A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
title_fullStr A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
title_full_unstemmed A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
title_short A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
title_sort classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2936400/
https://www.ncbi.nlm.nih.gov/pubmed/20727194
http://dx.doi.org/10.1186/1471-2105-11-434
work_keys_str_mv AT kimjiwoong aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT ahnyongju aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT leekichan aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT parksunghee aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT kimsangsoo aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT kimjiwoong classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT ahnyongju classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT leekichan classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT parksunghee classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis
AT kimsangsoo classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis