Cargando…

Higher recall in metagenomic sequence classification exploiting overlapping reads

BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classificatio...

Descripción completa

Detalles Bibliográficos
Autores principales: Girotto, Samuele, Comin, Matteo, Pizzi, Cinzia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5731601/
https://www.ncbi.nlm.nih.gov/pubmed/29244002
http://dx.doi.org/10.1186/s12864-017-4273-6
_version_ 1783286533473697792
author Girotto, Samuele
Comin, Matteo
Pizzi, Cinzia
author_facet Girotto, Samuele
Comin, Matteo
Pizzi, Cinzia
author_sort Girotto, Samuele
collection PubMed
description BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classification of metagenomic reads is one of the most challenging. State of the art methods classify single reads with almost 100% precision. However, very often, the performance in terms of recall falls at about 50%. As a consequence, state-of-the-art methods are indeed capable of correctly classify only half of the reads in the sample. How to achieve better performances in terms of overall quality of classification remains a largely unsolved problem. RESULTS: In this paper we propose a method for metagenomics CLassification Improvement with Overlapping Reads (CLIOR), that exploits the information carried by the overlapping reads graph of the input read dataset to improve recall, f-measure, and the estimated abundance of species. In this work, we applied CLIOR on top of the classification produced by the classifier Clark-l. Experiments on simulated and synthetic metagenomes show that CLIOR can lead to substantial improvement of the recall rate, sometimes doubling it. On average, on simulated datasets, the increase of recall is paired with an higher precision too, while on synthetic datasets it comes at expenses of a small loss of precision. On experiments on real metagenomes CLIOR is able to assign many more reads while keeping the abundance ratios in line with previous studies. CONCLUSIONS: Our results showed that with CLIOR is possible to boost the recall of a state-of-the-art metagenomic classifier by inferring and/or correcting the assignment of reads with missing or erroneous labeling. CLIOR is not restricted to the reads classification algorithm used in our experiments, but it may be applied to other methods too. Finally, CLIOR does not need large computational resources, and it can be run on a laptop. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4273-6) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5731601
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-57316012017-12-19 Higher recall in metagenomic sequence classification exploiting overlapping reads Girotto, Samuele Comin, Matteo Pizzi, Cinzia BMC Genomics Research BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classification of metagenomic reads is one of the most challenging. State of the art methods classify single reads with almost 100% precision. However, very often, the performance in terms of recall falls at about 50%. As a consequence, state-of-the-art methods are indeed capable of correctly classify only half of the reads in the sample. How to achieve better performances in terms of overall quality of classification remains a largely unsolved problem. RESULTS: In this paper we propose a method for metagenomics CLassification Improvement with Overlapping Reads (CLIOR), that exploits the information carried by the overlapping reads graph of the input read dataset to improve recall, f-measure, and the estimated abundance of species. In this work, we applied CLIOR on top of the classification produced by the classifier Clark-l. Experiments on simulated and synthetic metagenomes show that CLIOR can lead to substantial improvement of the recall rate, sometimes doubling it. On average, on simulated datasets, the increase of recall is paired with an higher precision too, while on synthetic datasets it comes at expenses of a small loss of precision. On experiments on real metagenomes CLIOR is able to assign many more reads while keeping the abundance ratios in line with previous studies. CONCLUSIONS: Our results showed that with CLIOR is possible to boost the recall of a state-of-the-art metagenomic classifier by inferring and/or correcting the assignment of reads with missing or erroneous labeling. CLIOR is not restricted to the reads classification algorithm used in our experiments, but it may be applied to other methods too. Finally, CLIOR does not need large computational resources, and it can be run on a laptop. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4273-6) contains supplementary material, which is available to authorized users. BioMed Central 2017-12-06 /pmc/articles/PMC5731601/ /pubmed/29244002 http://dx.doi.org/10.1186/s12864-017-4273-6 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Girotto, Samuele
Comin, Matteo
Pizzi, Cinzia
Higher recall in metagenomic sequence classification exploiting overlapping reads
title Higher recall in metagenomic sequence classification exploiting overlapping reads
title_full Higher recall in metagenomic sequence classification exploiting overlapping reads
title_fullStr Higher recall in metagenomic sequence classification exploiting overlapping reads
title_full_unstemmed Higher recall in metagenomic sequence classification exploiting overlapping reads
title_short Higher recall in metagenomic sequence classification exploiting overlapping reads
title_sort higher recall in metagenomic sequence classification exploiting overlapping reads
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5731601/
https://www.ncbi.nlm.nih.gov/pubmed/29244002
http://dx.doi.org/10.1186/s12864-017-4273-6
work_keys_str_mv AT girottosamuele higherrecallinmetagenomicsequenceclassificationexploitingoverlappingreads
AT cominmatteo higherrecallinmetagenomicsequenceclassificationexploitingoverlappingreads
AT pizzicinzia higherrecallinmetagenomicsequenceclassificationexploitingoverlappingreads