Cargando…

A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads

BACKGROUND: Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searchi...

Descripción completa

Detalles Bibliográficos
Autores principales: Du, Ruofei, Mercante, Donald, An, Lingling, Fang, Zhide
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5922784/
https://www.ncbi.nlm.nih.gov/pubmed/29710879
http://dx.doi.org/10.4172/2155-6180.1000208
_version_ 1783318227369066496
author Du, Ruofei
Mercante, Donald
An, Lingling
Fang, Zhide
author_facet Du, Ruofei
Mercante, Donald
An, Lingling
Fang, Zhide
author_sort Du, Ruofei
collection PubMed
description BACKGROUND: Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searching of metagenomic sequencing reads against a protein database, the relative abundance of a family can be represented by the number of reads aligned to its members. However, it has been observed that, for short reads generated by next-generation sequencing platforms, some may be erroneously assigned to the functional families they are not associated to. This commonly occurred phenomenon is termed as cross-annotation. Current methods for functional profiling of a metagenomic sample use empirical cutoff values, to select the alignments and ignore such cross-annotation problem, or employ summarized equation to do a simple adjustment. RESULT: By introducing latent variables, we use the Probabilistic Latent Semantic Analysis to model the proportions of reads assigned to functional families in a metagenomic sample. The approach can be applied on a metagenomic sample after the list of the true functional families being obtained or estimated. It was implemented in metagenomic samples functionally characterized by the database of Clusters of Orthologous Groups of proteins, and successfully addressed the cross-annotation issue on both in vitro-simulated, bioinformatics tool simulated metagenomic samples, and a real-world data. CONCLUSIONS: Correcting cross-annotation will increase the accuracy of the functional profiling of a metagenome generated by short reads. It will further benefit differential abundance analysis of metagenomic samples under different conditions.
format Online
Article
Text
id pubmed-5922784
institution National Center for Biotechnology Information
language English
publishDate 2014
record_format MEDLINE/PubMed
spelling pubmed-59227842018-04-27 A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads Du, Ruofei Mercante, Donald An, Lingling Fang, Zhide J Biom Biostat Article BACKGROUND: Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searching of metagenomic sequencing reads against a protein database, the relative abundance of a family can be represented by the number of reads aligned to its members. However, it has been observed that, for short reads generated by next-generation sequencing platforms, some may be erroneously assigned to the functional families they are not associated to. This commonly occurred phenomenon is termed as cross-annotation. Current methods for functional profiling of a metagenomic sample use empirical cutoff values, to select the alignments and ignore such cross-annotation problem, or employ summarized equation to do a simple adjustment. RESULT: By introducing latent variables, we use the Probabilistic Latent Semantic Analysis to model the proportions of reads assigned to functional families in a metagenomic sample. The approach can be applied on a metagenomic sample after the list of the true functional families being obtained or estimated. It was implemented in metagenomic samples functionally characterized by the database of Clusters of Orthologous Groups of proteins, and successfully addressed the cross-annotation issue on both in vitro-simulated, bioinformatics tool simulated metagenomic samples, and a real-world data. CONCLUSIONS: Correcting cross-annotation will increase the accuracy of the functional profiling of a metagenome generated by short reads. It will further benefit differential abundance analysis of metagenomic samples under different conditions. 2014-11-10 2014 /pmc/articles/PMC5922784/ /pubmed/29710879 http://dx.doi.org/10.4172/2155-6180.1000208 Text en http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Article
Du, Ruofei
Mercante, Donald
An, Lingling
Fang, Zhide
A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
title A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
title_full A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
title_fullStr A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
title_full_unstemmed A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
title_short A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
title_sort statistical approach to correcting cross-annotations in a metagenomic functional profile generated by short reads
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5922784/
https://www.ncbi.nlm.nih.gov/pubmed/29710879
http://dx.doi.org/10.4172/2155-6180.1000208
work_keys_str_mv AT duruofei astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT mercantedonald astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT anlingling astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT fangzhide astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT duruofei statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT mercantedonald statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT anlingling statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads
AT fangzhide statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads