Cargando…
A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
BACKGROUND: Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of we...
Autores principales: | , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2936400/ https://www.ncbi.nlm.nih.gov/pubmed/20727194 http://dx.doi.org/10.1186/1471-2105-11-434 |
_version_ | 1782186487013441536 |
---|---|
author | Kim, Jiwoong Ahn, Yongju Lee, Kichan Park, Sung Hee Kim, Sangsoo |
author_facet | Kim, Jiwoong Ahn, Yongju Lee, Kichan Park, Sung Hee Kim, Sangsoo |
author_sort | Kim, Jiwoong |
collection | PubMed |
description | BACKGROUND: Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment. RESULTS: MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory. CONCLUSIONS: The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/. |
format | Text |
id | pubmed-2936400 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-29364002011-07-08 A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis Kim, Jiwoong Ahn, Yongju Lee, Kichan Park, Sung Hee Kim, Sangsoo BMC Bioinformatics Methodology Article BACKGROUND: Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment. RESULTS: MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory. CONCLUSIONS: The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/. BioMed Central 2010-08-21 /pmc/articles/PMC2936400/ /pubmed/20727194 http://dx.doi.org/10.1186/1471-2105-11-434 Text en Copyright ©2010 Kim et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Kim, Jiwoong Ahn, Yongju Lee, Kichan Park, Sung Hee Kim, Sangsoo A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
title | A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
title_full | A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
title_fullStr | A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
title_full_unstemmed | A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
title_short | A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
title_sort | classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2936400/ https://www.ncbi.nlm.nih.gov/pubmed/20727194 http://dx.doi.org/10.1186/1471-2105-11-434 |
work_keys_str_mv | AT kimjiwoong aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT ahnyongju aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT leekichan aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT parksunghee aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT kimsangsoo aclassificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT kimjiwoong classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT ahnyongju classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT leekichan classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT parksunghee classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis AT kimsangsoo classificationapproachforgenotypingviralsequencesbasedonmultidimensionalscalingandlineardiscriminantanalysis |