Cargando…
Metagenomic Geolocation Using Read Signatures
We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to avai...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8918732/ https://www.ncbi.nlm.nih.gov/pubmed/35295949 http://dx.doi.org/10.3389/fgene.2022.643592 |
_version_ | 1784668795257749504 |
---|---|
author | Chappell , Timothy Geva , Shlomo Hogan , James M. Lovell , David Trotman , Andrew Perrin , Dimitri |
author_facet | Chappell , Timothy Geva , Shlomo Hogan , James M. Lovell , David Trotman , Andrew Perrin , Dimitri |
author_sort | Chappell , Timothy |
collection | PubMed |
description | We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable-length read is converted into a fixed-length, k-mer-based read signature. Read signatures are then clustered into location signatures which provide a more compact characterisation of the reads at each location. Classification is then treated as a problem in ranked retrieval of locations, where signature similarity is used as a measure of similarity in microbial composition. We evaluate our approach using the CAMDA 2020 Challenge dataset and obtain promising results based on nearest neighbour classification. The main findings of this study are that k-mer representations carry sufficient information to reveal the origin of many of the CAMDA 2020 Challenge metagenomic samples, and that this reference-free approach can be achieved with much less computation than methods that need reads to be assigned to operational taxonomic units—advantages which become clear through comparison to previously published work on the CAMDA 2019 Challenge data. |
format | Online Article Text |
id | pubmed-8918732 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-89187322022-03-15 Metagenomic Geolocation Using Read Signatures Chappell , Timothy Geva , Shlomo Hogan , James M. Lovell , David Trotman , Andrew Perrin , Dimitri Front Genet Genetics We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable-length read is converted into a fixed-length, k-mer-based read signature. Read signatures are then clustered into location signatures which provide a more compact characterisation of the reads at each location. Classification is then treated as a problem in ranked retrieval of locations, where signature similarity is used as a measure of similarity in microbial composition. We evaluate our approach using the CAMDA 2020 Challenge dataset and obtain promising results based on nearest neighbour classification. The main findings of this study are that k-mer representations carry sufficient information to reveal the origin of many of the CAMDA 2020 Challenge metagenomic samples, and that this reference-free approach can be achieved with much less computation than methods that need reads to be assigned to operational taxonomic units—advantages which become clear through comparison to previously published work on the CAMDA 2019 Challenge data. Frontiers Media S.A. 2022-02-28 /pmc/articles/PMC8918732/ /pubmed/35295949 http://dx.doi.org/10.3389/fgene.2022.643592 Text en Copyright © 2022 Chappell , Geva , Hogan , Lovell , Trotman and Perrin . https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Chappell , Timothy Geva , Shlomo Hogan , James M. Lovell , David Trotman , Andrew Perrin , Dimitri Metagenomic Geolocation Using Read Signatures |
title | Metagenomic Geolocation Using Read Signatures |
title_full | Metagenomic Geolocation Using Read Signatures |
title_fullStr | Metagenomic Geolocation Using Read Signatures |
title_full_unstemmed | Metagenomic Geolocation Using Read Signatures |
title_short | Metagenomic Geolocation Using Read Signatures |
title_sort | metagenomic geolocation using read signatures |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8918732/ https://www.ncbi.nlm.nih.gov/pubmed/35295949 http://dx.doi.org/10.3389/fgene.2022.643592 |
work_keys_str_mv | AT chappelltimothy metagenomicgeolocationusingreadsignatures AT gevashlomo metagenomicgeolocationusingreadsignatures AT hoganjamesm metagenomicgeolocationusingreadsignatures AT lovelldavid metagenomicgeolocationusingreadsignatures AT trotmanandrew metagenomicgeolocationusingreadsignatures AT perrindimitri metagenomicgeolocationusingreadsignatures |