Cargando…

Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets

Multi-locus sequence typing (MLST) is widely used to investigate genetic relationships among eukaryotic taxa, including parasitic pathogens. MLST analysis workflows typically involve construction of alignment-based phylogenetic trees – i.e., where tree structures are computed from nucleotide differe...

Descripción completa

Detalles Bibliográficos
Autores principales: Jacobson, David, Zheng, Yueli, Plucinski, Mateusz M., Qvarnstrom, Yvonne, Barratt, Joel L. N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10127246/
https://www.ncbi.nlm.nih.gov/pubmed/35963590
http://dx.doi.org/10.1016/j.ympev.2022.107608
_version_ 1785030423674355712
author Jacobson, David
Zheng, Yueli
Plucinski, Mateusz M.
Qvarnstrom, Yvonne
Barratt, Joel L. N.
author_facet Jacobson, David
Zheng, Yueli
Plucinski, Mateusz M.
Qvarnstrom, Yvonne
Barratt, Joel L. N.
author_sort Jacobson, David
collection PubMed
description Multi-locus sequence typing (MLST) is widely used to investigate genetic relationships among eukaryotic taxa, including parasitic pathogens. MLST analysis workflows typically involve construction of alignment-based phylogenetic trees – i.e., where tree structures are computed from nucleotide differences observed in a multiple sequence alignment (MSA). Notably, alignment-based phylogenetic methods require that all isolates/taxa are represented by a single sequence. When multiple loci are sequenced these sequences may be concatenated to produce one tree that includes information from all loci. Alignment-based phylogenetic techniques are robust and widely used yet possess some shortcomings, including how heterozygous sites are handled, intolerance for missing data (i.e., partial genotypes), and differences in the way insertions-deletions (indels) are scored/treated during tree construction. In certain contexts, ‘haplotype-based’ methods may represent a viable alternative to alignment-based techniques, as they do not possess the aforementioned limitations. This is namely because haplotype-based methods assess genetic similarity based on numbers of shared (i.e., intersecting) haplotypes as opposed to similarities in nucleotide composition observed in an MSA. For haplotype-based comparisons, choosing an appropriate distance statistic is fundamental, and several statistics are available to choose from. However, a comprehensive assessment of various available statistics for their ability to produce a robust haplotype-based phylogenetic reconstruction has not yet been performed. We evaluated seven distance statistics by applying them to extant MLST datasets from the gastrointestinal parasite Cyclospora cayetanensis and two species of pathogenic nematode of the genus Strongyloides. We compare the genetic relationships identified using each statistic to epidemiologic, geographic, and host metadata. We show that Barratt’s heuristic definition of genetic distance was the most robust among the statistics evaluated. Consequently, it is proposed that Barratt’s heuristic represents a useful approach for use in the context of challenging MLST datasets possessing features (i.e., high heterozygosity, partial genotypes, and indel or repeat-based polymorphisms) that confound or preclude the use of alignment-based methods.
format Online
Article
Text
id pubmed-10127246
institution National Center for Biotechnology Information
language English
publishDate 2022
record_format MEDLINE/PubMed
spelling pubmed-101272462023-04-25 Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets Jacobson, David Zheng, Yueli Plucinski, Mateusz M. Qvarnstrom, Yvonne Barratt, Joel L. N. Mol Phylogenet Evol Article Multi-locus sequence typing (MLST) is widely used to investigate genetic relationships among eukaryotic taxa, including parasitic pathogens. MLST analysis workflows typically involve construction of alignment-based phylogenetic trees – i.e., where tree structures are computed from nucleotide differences observed in a multiple sequence alignment (MSA). Notably, alignment-based phylogenetic methods require that all isolates/taxa are represented by a single sequence. When multiple loci are sequenced these sequences may be concatenated to produce one tree that includes information from all loci. Alignment-based phylogenetic techniques are robust and widely used yet possess some shortcomings, including how heterozygous sites are handled, intolerance for missing data (i.e., partial genotypes), and differences in the way insertions-deletions (indels) are scored/treated during tree construction. In certain contexts, ‘haplotype-based’ methods may represent a viable alternative to alignment-based techniques, as they do not possess the aforementioned limitations. This is namely because haplotype-based methods assess genetic similarity based on numbers of shared (i.e., intersecting) haplotypes as opposed to similarities in nucleotide composition observed in an MSA. For haplotype-based comparisons, choosing an appropriate distance statistic is fundamental, and several statistics are available to choose from. However, a comprehensive assessment of various available statistics for their ability to produce a robust haplotype-based phylogenetic reconstruction has not yet been performed. We evaluated seven distance statistics by applying them to extant MLST datasets from the gastrointestinal parasite Cyclospora cayetanensis and two species of pathogenic nematode of the genus Strongyloides. We compare the genetic relationships identified using each statistic to epidemiologic, geographic, and host metadata. We show that Barratt’s heuristic definition of genetic distance was the most robust among the statistics evaluated. Consequently, it is proposed that Barratt’s heuristic represents a useful approach for use in the context of challenging MLST datasets possessing features (i.e., high heterozygosity, partial genotypes, and indel or repeat-based polymorphisms) that confound or preclude the use of alignment-based methods. 2022-12 2022-08-11 /pmc/articles/PMC10127246/ /pubmed/35963590 http://dx.doi.org/10.1016/j.ympev.2022.107608 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) ).
spellingShingle Article
Jacobson, David
Zheng, Yueli
Plucinski, Mateusz M.
Qvarnstrom, Yvonne
Barratt, Joel L. N.
Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets
title Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets
title_full Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets
title_fullStr Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets
title_full_unstemmed Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets
title_short Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets
title_sort evaluation of various distance computation methods for construction of haplotype-based phylogenies from large mlst datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10127246/
https://www.ncbi.nlm.nih.gov/pubmed/35963590
http://dx.doi.org/10.1016/j.ympev.2022.107608
work_keys_str_mv AT jacobsondavid evaluationofvariousdistancecomputationmethodsforconstructionofhaplotypebasedphylogeniesfromlargemlstdatasets
AT zhengyueli evaluationofvariousdistancecomputationmethodsforconstructionofhaplotypebasedphylogeniesfromlargemlstdatasets
AT plucinskimateuszm evaluationofvariousdistancecomputationmethodsforconstructionofhaplotypebasedphylogeniesfromlargemlstdatasets
AT qvarnstromyvonne evaluationofvariousdistancecomputationmethodsforconstructionofhaplotypebasedphylogeniesfromlargemlstdatasets
AT barrattjoelln evaluationofvariousdistancecomputationmethodsforconstructionofhaplotypebasedphylogeniesfromlargemlstdatasets