Cargando…

Machine learning classification by fitting amplicon sequences to existing OTUs

The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operati...

Descripción completa

Detalles Bibliográficos
Autores principales: Armour, Courtney R., Sovacool, Kelly L., Close, William L., Topçuoğlu, Begüm D., Wiens, Jenna, Schloss, Patrick D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Microbiology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10597446/
https://www.ncbi.nlm.nih.gov/pubmed/37615431
http://dx.doi.org/10.1128/msphere.00336-23
_version_ 1785125342563794944
author Armour, Courtney R.
Sovacool, Kelly L.
Close, William L.
Topçuoğlu, Begüm D.
Wiens, Jenna
Schloss, Patrick D.
author_facet Armour, Courtney R.
Sovacool, Kelly L.
Close, William L.
Topçuoğlu, Begüm D.
Wiens, Jenna
Schloss, Patrick D.
author_sort Armour, Courtney R.
collection PubMed
description The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE: There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.
format Online
Article
Text
id pubmed-10597446
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Society for Microbiology
record_format MEDLINE/PubMed
spelling pubmed-105974462023-10-25 Machine learning classification by fitting amplicon sequences to existing OTUs Armour, Courtney R. Sovacool, Kelly L. Close, William L. Topçuoğlu, Begüm D. Wiens, Jenna Schloss, Patrick D. mSphere Observation The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE: There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database. American Society for Microbiology 2023-08-24 /pmc/articles/PMC10597446/ /pubmed/37615431 http://dx.doi.org/10.1128/msphere.00336-23 Text en Copyright © 2023 Armour et al. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Observation
Armour, Courtney R.
Sovacool, Kelly L.
Close, William L.
Topçuoğlu, Begüm D.
Wiens, Jenna
Schloss, Patrick D.
Machine learning classification by fitting amplicon sequences to existing OTUs
title Machine learning classification by fitting amplicon sequences to existing OTUs
title_full Machine learning classification by fitting amplicon sequences to existing OTUs
title_fullStr Machine learning classification by fitting amplicon sequences to existing OTUs
title_full_unstemmed Machine learning classification by fitting amplicon sequences to existing OTUs
title_short Machine learning classification by fitting amplicon sequences to existing OTUs
title_sort machine learning classification by fitting amplicon sequences to existing otus
topic Observation
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10597446/
https://www.ncbi.nlm.nih.gov/pubmed/37615431
http://dx.doi.org/10.1128/msphere.00336-23
work_keys_str_mv AT armourcourtneyr machinelearningclassificationbyfittingampliconsequencestoexistingotus
AT sovacoolkellyl machinelearningclassificationbyfittingampliconsequencestoexistingotus
AT closewilliaml machinelearningclassificationbyfittingampliconsequencestoexistingotus
AT topcuoglubegumd machinelearningclassificationbyfittingampliconsequencestoexistingotus
AT wiensjenna machinelearningclassificationbyfittingampliconsequencestoexistingotus
AT schlosspatrickd machinelearningclassificationbyfittingampliconsequencestoexistingotus