Cargando…
A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking
BACKGROUND: As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemio...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8604304/ https://www.ncbi.nlm.nih.gov/pubmed/34797875 http://dx.doi.org/10.1371/journal.pone.0260293 |
_version_ | 1784601929474637824 |
---|---|
author | Liu, Yen-Yi Chen, Chih-Chieh |
author_facet | Liu, Yen-Yi Chen, Chih-Chieh |
author_sort | Liu, Yen-Yi |
collection | PubMed |
description | BACKGROUND: As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times. METHODS: We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected. RESULTS: Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology. |
format | Online Article Text |
id | pubmed-8604304 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-86043042021-11-20 A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking Liu, Yen-Yi Chen, Chih-Chieh PLoS One Research Article BACKGROUND: As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create “specious discrepancy” among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times. METHODS: We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected. RESULTS: Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene–based epidemiology. Public Library of Science 2021-11-19 /pmc/articles/PMC8604304/ /pubmed/34797875 http://dx.doi.org/10.1371/journal.pone.0260293 Text en © 2021 Liu, Chen https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Liu, Yen-Yi Chen, Chih-Chieh A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
title | A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
title_full | A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
title_fullStr | A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
title_full_unstemmed | A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
title_short | A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
title_sort | machine learning-based typing scheme refinement for listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8604304/ https://www.ncbi.nlm.nih.gov/pubmed/34797875 http://dx.doi.org/10.1371/journal.pone.0260293 |
work_keys_str_mv | AT liuyenyi amachinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking AT chenchihchieh amachinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking AT liuyenyi machinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking AT chenchihchieh machinelearningbasedtypingschemerefinementforlisteriamonocytogenescoregenomemultilocussequencetypingwithhighdiscriminatorypowerforcommonsourceoutbreaktracking |