Cargando…
Higher recall in metagenomic sequence classification exploiting overlapping reads
BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classificatio...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5731601/ https://www.ncbi.nlm.nih.gov/pubmed/29244002 http://dx.doi.org/10.1186/s12864-017-4273-6 |
_version_ | 1783286533473697792 |
---|---|
author | Girotto, Samuele Comin, Matteo Pizzi, Cinzia |
author_facet | Girotto, Samuele Comin, Matteo Pizzi, Cinzia |
author_sort | Girotto, Samuele |
collection | PubMed |
description | BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classification of metagenomic reads is one of the most challenging. State of the art methods classify single reads with almost 100% precision. However, very often, the performance in terms of recall falls at about 50%. As a consequence, state-of-the-art methods are indeed capable of correctly classify only half of the reads in the sample. How to achieve better performances in terms of overall quality of classification remains a largely unsolved problem. RESULTS: In this paper we propose a method for metagenomics CLassification Improvement with Overlapping Reads (CLIOR), that exploits the information carried by the overlapping reads graph of the input read dataset to improve recall, f-measure, and the estimated abundance of species. In this work, we applied CLIOR on top of the classification produced by the classifier Clark-l. Experiments on simulated and synthetic metagenomes show that CLIOR can lead to substantial improvement of the recall rate, sometimes doubling it. On average, on simulated datasets, the increase of recall is paired with an higher precision too, while on synthetic datasets it comes at expenses of a small loss of precision. On experiments on real metagenomes CLIOR is able to assign many more reads while keeping the abundance ratios in line with previous studies. CONCLUSIONS: Our results showed that with CLIOR is possible to boost the recall of a state-of-the-art metagenomic classifier by inferring and/or correcting the assignment of reads with missing or erroneous labeling. CLIOR is not restricted to the reads classification algorithm used in our experiments, but it may be applied to other methods too. Finally, CLIOR does not need large computational resources, and it can be run on a laptop. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4273-6) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5731601 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-57316012017-12-19 Higher recall in metagenomic sequence classification exploiting overlapping reads Girotto, Samuele Comin, Matteo Pizzi, Cinzia BMC Genomics Research BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classification of metagenomic reads is one of the most challenging. State of the art methods classify single reads with almost 100% precision. However, very often, the performance in terms of recall falls at about 50%. As a consequence, state-of-the-art methods are indeed capable of correctly classify only half of the reads in the sample. How to achieve better performances in terms of overall quality of classification remains a largely unsolved problem. RESULTS: In this paper we propose a method for metagenomics CLassification Improvement with Overlapping Reads (CLIOR), that exploits the information carried by the overlapping reads graph of the input read dataset to improve recall, f-measure, and the estimated abundance of species. In this work, we applied CLIOR on top of the classification produced by the classifier Clark-l. Experiments on simulated and synthetic metagenomes show that CLIOR can lead to substantial improvement of the recall rate, sometimes doubling it. On average, on simulated datasets, the increase of recall is paired with an higher precision too, while on synthetic datasets it comes at expenses of a small loss of precision. On experiments on real metagenomes CLIOR is able to assign many more reads while keeping the abundance ratios in line with previous studies. CONCLUSIONS: Our results showed that with CLIOR is possible to boost the recall of a state-of-the-art metagenomic classifier by inferring and/or correcting the assignment of reads with missing or erroneous labeling. CLIOR is not restricted to the reads classification algorithm used in our experiments, but it may be applied to other methods too. Finally, CLIOR does not need large computational resources, and it can be run on a laptop. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4273-6) contains supplementary material, which is available to authorized users. BioMed Central 2017-12-06 /pmc/articles/PMC5731601/ /pubmed/29244002 http://dx.doi.org/10.1186/s12864-017-4273-6 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Girotto, Samuele Comin, Matteo Pizzi, Cinzia Higher recall in metagenomic sequence classification exploiting overlapping reads |
title | Higher recall in metagenomic sequence classification exploiting overlapping reads |
title_full | Higher recall in metagenomic sequence classification exploiting overlapping reads |
title_fullStr | Higher recall in metagenomic sequence classification exploiting overlapping reads |
title_full_unstemmed | Higher recall in metagenomic sequence classification exploiting overlapping reads |
title_short | Higher recall in metagenomic sequence classification exploiting overlapping reads |
title_sort | higher recall in metagenomic sequence classification exploiting overlapping reads |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5731601/ https://www.ncbi.nlm.nih.gov/pubmed/29244002 http://dx.doi.org/10.1186/s12864-017-4273-6 |
work_keys_str_mv | AT girottosamuele higherrecallinmetagenomicsequenceclassificationexploitingoverlappingreads AT cominmatteo higherrecallinmetagenomicsequenceclassificationexploitingoverlappingreads AT pizzicinzia higherrecallinmetagenomicsequenceclassificationexploitingoverlappingreads |