Cargando…

Metagenomic Geolocation Using Read Signatures

We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to avai...

Descripción completa

Detalles Bibliográficos
Autores principales: Chappell , Timothy, Geva , Shlomo, Hogan , James M., Lovell , David, Trotman , Andrew, Perrin , Dimitri
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8918732/
https://www.ncbi.nlm.nih.gov/pubmed/35295949
http://dx.doi.org/10.3389/fgene.2022.643592
_version_ 1784668795257749504
author Chappell , Timothy
Geva , Shlomo
Hogan , James M.
Lovell , David
Trotman , Andrew
Perrin , Dimitri
author_facet Chappell , Timothy
Geva , Shlomo
Hogan , James M.
Lovell , David
Trotman , Andrew
Perrin , Dimitri
author_sort Chappell , Timothy
collection PubMed
description We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable-length read is converted into a fixed-length, k-mer-based read signature. Read signatures are then clustered into location signatures which provide a more compact characterisation of the reads at each location. Classification is then treated as a problem in ranked retrieval of locations, where signature similarity is used as a measure of similarity in microbial composition. We evaluate our approach using the CAMDA 2020 Challenge dataset and obtain promising results based on nearest neighbour classification. The main findings of this study are that k-mer representations carry sufficient information to reveal the origin of many of the CAMDA 2020 Challenge metagenomic samples, and that this reference-free approach can be achieved with much less computation than methods that need reads to be assigned to operational taxonomic units—advantages which become clear through comparison to previously published work on the CAMDA 2019 Challenge data.
format Online
Article
Text
id pubmed-8918732
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-89187322022-03-15 Metagenomic Geolocation Using Read Signatures Chappell , Timothy Geva , Shlomo Hogan , James M. Lovell , David Trotman , Andrew Perrin , Dimitri Front Genet Genetics We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable-length read is converted into a fixed-length, k-mer-based read signature. Read signatures are then clustered into location signatures which provide a more compact characterisation of the reads at each location. Classification is then treated as a problem in ranked retrieval of locations, where signature similarity is used as a measure of similarity in microbial composition. We evaluate our approach using the CAMDA 2020 Challenge dataset and obtain promising results based on nearest neighbour classification. The main findings of this study are that k-mer representations carry sufficient information to reveal the origin of many of the CAMDA 2020 Challenge metagenomic samples, and that this reference-free approach can be achieved with much less computation than methods that need reads to be assigned to operational taxonomic units—advantages which become clear through comparison to previously published work on the CAMDA 2019 Challenge data. Frontiers Media S.A. 2022-02-28 /pmc/articles/PMC8918732/ /pubmed/35295949 http://dx.doi.org/10.3389/fgene.2022.643592 Text en Copyright © 2022 Chappell , Geva , Hogan , Lovell , Trotman  and Perrin . https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Chappell , Timothy
Geva , Shlomo
Hogan , James M.
Lovell , David
Trotman , Andrew
Perrin , Dimitri
Metagenomic Geolocation Using Read Signatures
title Metagenomic Geolocation Using Read Signatures
title_full Metagenomic Geolocation Using Read Signatures
title_fullStr Metagenomic Geolocation Using Read Signatures
title_full_unstemmed Metagenomic Geolocation Using Read Signatures
title_short Metagenomic Geolocation Using Read Signatures
title_sort metagenomic geolocation using read signatures
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8918732/
https://www.ncbi.nlm.nih.gov/pubmed/35295949
http://dx.doi.org/10.3389/fgene.2022.643592
work_keys_str_mv AT chappelltimothy metagenomicgeolocationusingreadsignatures
AT gevashlomo metagenomicgeolocationusingreadsignatures
AT hoganjamesm metagenomicgeolocationusingreadsignatures
AT lovelldavid metagenomicgeolocationusingreadsignatures
AT trotmanandrew metagenomicgeolocationusingreadsignatures
AT perrindimitri metagenomicgeolocationusingreadsignatures