Cargando…
A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads
BACKGROUND: Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searchi...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5922784/ https://www.ncbi.nlm.nih.gov/pubmed/29710879 http://dx.doi.org/10.4172/2155-6180.1000208 |
_version_ | 1783318227369066496 |
---|---|
author | Du, Ruofei Mercante, Donald An, Lingling Fang, Zhide |
author_facet | Du, Ruofei Mercante, Donald An, Lingling Fang, Zhide |
author_sort | Du, Ruofei |
collection | PubMed |
description | BACKGROUND: Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searching of metagenomic sequencing reads against a protein database, the relative abundance of a family can be represented by the number of reads aligned to its members. However, it has been observed that, for short reads generated by next-generation sequencing platforms, some may be erroneously assigned to the functional families they are not associated to. This commonly occurred phenomenon is termed as cross-annotation. Current methods for functional profiling of a metagenomic sample use empirical cutoff values, to select the alignments and ignore such cross-annotation problem, or employ summarized equation to do a simple adjustment. RESULT: By introducing latent variables, we use the Probabilistic Latent Semantic Analysis to model the proportions of reads assigned to functional families in a metagenomic sample. The approach can be applied on a metagenomic sample after the list of the true functional families being obtained or estimated. It was implemented in metagenomic samples functionally characterized by the database of Clusters of Orthologous Groups of proteins, and successfully addressed the cross-annotation issue on both in vitro-simulated, bioinformatics tool simulated metagenomic samples, and a real-world data. CONCLUSIONS: Correcting cross-annotation will increase the accuracy of the functional profiling of a metagenome generated by short reads. It will further benefit differential abundance analysis of metagenomic samples under different conditions. |
format | Online Article Text |
id | pubmed-5922784 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
record_format | MEDLINE/PubMed |
spelling | pubmed-59227842018-04-27 A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads Du, Ruofei Mercante, Donald An, Lingling Fang, Zhide J Biom Biostat Article BACKGROUND: Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searching of metagenomic sequencing reads against a protein database, the relative abundance of a family can be represented by the number of reads aligned to its members. However, it has been observed that, for short reads generated by next-generation sequencing platforms, some may be erroneously assigned to the functional families they are not associated to. This commonly occurred phenomenon is termed as cross-annotation. Current methods for functional profiling of a metagenomic sample use empirical cutoff values, to select the alignments and ignore such cross-annotation problem, or employ summarized equation to do a simple adjustment. RESULT: By introducing latent variables, we use the Probabilistic Latent Semantic Analysis to model the proportions of reads assigned to functional families in a metagenomic sample. The approach can be applied on a metagenomic sample after the list of the true functional families being obtained or estimated. It was implemented in metagenomic samples functionally characterized by the database of Clusters of Orthologous Groups of proteins, and successfully addressed the cross-annotation issue on both in vitro-simulated, bioinformatics tool simulated metagenomic samples, and a real-world data. CONCLUSIONS: Correcting cross-annotation will increase the accuracy of the functional profiling of a metagenome generated by short reads. It will further benefit differential abundance analysis of metagenomic samples under different conditions. 2014-11-10 2014 /pmc/articles/PMC5922784/ /pubmed/29710879 http://dx.doi.org/10.4172/2155-6180.1000208 Text en http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Article Du, Ruofei Mercante, Donald An, Lingling Fang, Zhide A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads |
title | A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads |
title_full | A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads |
title_fullStr | A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads |
title_full_unstemmed | A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads |
title_short | A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads |
title_sort | statistical approach to correcting cross-annotations in a metagenomic functional profile generated by short reads |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5922784/ https://www.ncbi.nlm.nih.gov/pubmed/29710879 http://dx.doi.org/10.4172/2155-6180.1000208 |
work_keys_str_mv | AT duruofei astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT mercantedonald astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT anlingling astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT fangzhide astatisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT duruofei statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT mercantedonald statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT anlingling statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads AT fangzhide statisticalapproachtocorrectingcrossannotationsinametagenomicfunctionalprofilegeneratedbyshortreads |