Cargando…
Employing phylogenetic tree shape statistics to resolve the underlying host population structure
BACKGROUND: Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing th...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8579572/ https://www.ncbi.nlm.nih.gov/pubmed/34758743 http://dx.doi.org/10.1186/s12859-021-04465-1 |
_version_ | 1784596453386092544 |
---|---|
author | Kayondo, Hassan W. Ssekagiri, Alfred Nabakooza, Grace Bbosa, Nicholas Ssemwanga, Deogratius Kaleebu, Pontiano Mwalili, Samuel Mango, John M. Leigh Brown, Andrew J. Saenz, Roberto A. Galiwango, Ronald Kitayimbwa, John M. |
author_facet | Kayondo, Hassan W. Ssekagiri, Alfred Nabakooza, Grace Bbosa, Nicholas Ssemwanga, Deogratius Kaleebu, Pontiano Mwalili, Samuel Mango, John M. Leigh Brown, Andrew J. Saenz, Roberto A. Galiwango, Ronald Kitayimbwa, John M. |
author_sort | Kayondo, Hassan W. |
collection | PubMed |
description | BACKGROUND: Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. RESULTS: In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ([Formula: see text] ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. CONCLUSIONS: Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of [Formula: see text] using SVM-polynomial classifier. |
format | Online Article Text |
id | pubmed-8579572 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-85795722021-11-10 Employing phylogenetic tree shape statistics to resolve the underlying host population structure Kayondo, Hassan W. Ssekagiri, Alfred Nabakooza, Grace Bbosa, Nicholas Ssemwanga, Deogratius Kaleebu, Pontiano Mwalili, Samuel Mango, John M. Leigh Brown, Andrew J. Saenz, Roberto A. Galiwango, Ronald Kitayimbwa, John M. BMC Bioinformatics Research Article BACKGROUND: Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. RESULTS: In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ([Formula: see text] ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. CONCLUSIONS: Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of [Formula: see text] using SVM-polynomial classifier. BioMed Central 2021-11-10 /pmc/articles/PMC8579572/ /pubmed/34758743 http://dx.doi.org/10.1186/s12859-021-04465-1 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Kayondo, Hassan W. Ssekagiri, Alfred Nabakooza, Grace Bbosa, Nicholas Ssemwanga, Deogratius Kaleebu, Pontiano Mwalili, Samuel Mango, John M. Leigh Brown, Andrew J. Saenz, Roberto A. Galiwango, Ronald Kitayimbwa, John M. Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title | Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_full | Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_fullStr | Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_full_unstemmed | Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_short | Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_sort | employing phylogenetic tree shape statistics to resolve the underlying host population structure |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8579572/ https://www.ncbi.nlm.nih.gov/pubmed/34758743 http://dx.doi.org/10.1186/s12859-021-04465-1 |
work_keys_str_mv | AT kayondohassanw employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT ssekagirialfred employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT nabakoozagrace employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT bbosanicholas employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT ssemwangadeogratius employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT kaleebupontiano employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT mwalilisamuel employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT mangojohnm employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT leighbrownandrewj employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT saenzrobertoa employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT galiwangoronald employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT kitayimbwajohnm employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure |