Cargando…

A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices

Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise par...

Descripción completa

Detalles Bibliográficos
Autores principales: Watson, James A., Taylor, Aimee R., Ashley, Elizabeth A., Dondorp, Arjen, Buckee, Caroline O., White, Nicholas J., Holmes, Chris C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7577480/
https://www.ncbi.nlm.nih.gov/pubmed/33035220
http://dx.doi.org/10.1371/journal.pgen.1009037
_version_ 1783598199135535104
author Watson, James A.
Taylor, Aimee R.
Ashley, Elizabeth A.
Dondorp, Arjen
Buckee, Caroline O.
White, Nicholas J.
Holmes, Chris C.
author_facet Watson, James A.
Taylor, Aimee R.
Ashley, Elizabeth A.
Dondorp, Arjen
Buckee, Caroline O.
White, Nicholas J.
Holmes, Chris C.
author_sort Watson, James A.
collection PubMed
description Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results.
format Online
Article
Text
id pubmed-7577480
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-75774802020-10-26 A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices Watson, James A. Taylor, Aimee R. Ashley, Elizabeth A. Dondorp, Arjen Buckee, Caroline O. White, Nicholas J. Holmes, Chris C. PLoS Genet Research Article Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results. Public Library of Science 2020-10-09 /pmc/articles/PMC7577480/ /pubmed/33035220 http://dx.doi.org/10.1371/journal.pgen.1009037 Text en © 2020 Watson et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Watson, James A.
Taylor, Aimee R.
Ashley, Elizabeth A.
Dondorp, Arjen
Buckee, Caroline O.
White, Nicholas J.
Holmes, Chris C.
A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_full A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_fullStr A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_full_unstemmed A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_short A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_sort cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7577480/
https://www.ncbi.nlm.nih.gov/pubmed/33035220
http://dx.doi.org/10.1371/journal.pgen.1009037
work_keys_str_mv AT watsonjamesa acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT tayloraimeer acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT ashleyelizabetha acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT dondorparjen acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT buckeecarolineo acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT whitenicholasj acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT holmeschrisc acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT watsonjamesa cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT tayloraimeer cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT ashleyelizabetha cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT dondorparjen cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT buckeecarolineo cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT whitenicholasj cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices
AT holmeschrisc cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices