Cargando…

Streaming histogram sketching for rapid microbiome analytics

BACKGROUND: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widesp...

Descripción completa

Detalles Bibliográficos
Autores principales: Rowe, Will PM, Carrieri, Anna Paola, Alcon-Giner, Cristina, Caim, Shabhonam, Shaw, Alex, Sim, Kathleen, Kroll, J. Simon, Hall, Lindsay J., Pyzer-Knapp, Edward O., Winn, Martyn D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6420756/
https://www.ncbi.nlm.nih.gov/pubmed/30878035
http://dx.doi.org/10.1186/s40168-019-0653-2
_version_ 1783404134257393664
author Rowe, Will PM
Carrieri, Anna Paola
Alcon-Giner, Cristina
Caim, Shabhonam
Shaw, Alex
Sim, Kathleen
Kroll, J. Simon
Hall, Lindsay J.
Pyzer-Knapp, Edward O.
Winn, Martyn D.
author_facet Rowe, Will PM
Carrieri, Anna Paola
Alcon-Giner, Cristina
Caim, Shabhonam
Shaw, Alex
Sim, Kathleen
Kroll, J. Simon
Hall, Lindsay J.
Pyzer-Knapp, Edward O.
Winn, Martyn D.
author_sort Rowe, Will PM
collection PubMed
description BACKGROUND: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. RESULTS: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. CONCLUSIONS: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. (https://github.com/will-rowe/hulk).
format Online
Article
Text
id pubmed-6420756
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64207562019-03-28 Streaming histogram sketching for rapid microbiome analytics Rowe, Will PM Carrieri, Anna Paola Alcon-Giner, Cristina Caim, Shabhonam Shaw, Alex Sim, Kathleen Kroll, J. Simon Hall, Lindsay J. Pyzer-Knapp, Edward O. Winn, Martyn D. Microbiome Methodology BACKGROUND: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. RESULTS: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. CONCLUSIONS: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. (https://github.com/will-rowe/hulk). BioMed Central 2019-03-16 /pmc/articles/PMC6420756/ /pubmed/30878035 http://dx.doi.org/10.1186/s40168-019-0653-2 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Rowe, Will PM
Carrieri, Anna Paola
Alcon-Giner, Cristina
Caim, Shabhonam
Shaw, Alex
Sim, Kathleen
Kroll, J. Simon
Hall, Lindsay J.
Pyzer-Knapp, Edward O.
Winn, Martyn D.
Streaming histogram sketching for rapid microbiome analytics
title Streaming histogram sketching for rapid microbiome analytics
title_full Streaming histogram sketching for rapid microbiome analytics
title_fullStr Streaming histogram sketching for rapid microbiome analytics
title_full_unstemmed Streaming histogram sketching for rapid microbiome analytics
title_short Streaming histogram sketching for rapid microbiome analytics
title_sort streaming histogram sketching for rapid microbiome analytics
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6420756/
https://www.ncbi.nlm.nih.gov/pubmed/30878035
http://dx.doi.org/10.1186/s40168-019-0653-2
work_keys_str_mv AT rowewillpm streaminghistogramsketchingforrapidmicrobiomeanalytics
AT carrieriannapaola streaminghistogramsketchingforrapidmicrobiomeanalytics
AT alconginercristina streaminghistogramsketchingforrapidmicrobiomeanalytics
AT caimshabhonam streaminghistogramsketchingforrapidmicrobiomeanalytics
AT shawalex streaminghistogramsketchingforrapidmicrobiomeanalytics
AT simkathleen streaminghistogramsketchingforrapidmicrobiomeanalytics
AT krolljsimon streaminghistogramsketchingforrapidmicrobiomeanalytics
AT halllindsayj streaminghistogramsketchingforrapidmicrobiomeanalytics
AT pyzerknappedwardo streaminghistogramsketchingforrapidmicrobiomeanalytics
AT winnmartynd streaminghistogramsketchingforrapidmicrobiomeanalytics