Cargando…
Detecting false positive sequence homology: a machine learning approach
BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765110/ https://www.ncbi.nlm.nih.gov/pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3 |
_version_ | 1782417502098161664 |
---|---|
author | Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. |
author_facet | Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. |
author_sort | Fujimoto, M. Stanley |
collection | PubMed |
description | BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. RESULTS: In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. CONCLUSIONS: Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0955-3) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4765110 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-47651102016-02-25 Detecting false positive sequence homology: a machine learning approach Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. BMC Bioinformatics Methodology Article BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. RESULTS: In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. CONCLUSIONS: Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0955-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-24 /pmc/articles/PMC4765110/ /pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3 Text en © Fujimoto et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. Detecting false positive sequence homology: a machine learning approach |
title | Detecting false positive sequence homology: a machine learning approach |
title_full | Detecting false positive sequence homology: a machine learning approach |
title_fullStr | Detecting false positive sequence homology: a machine learning approach |
title_full_unstemmed | Detecting false positive sequence homology: a machine learning approach |
title_short | Detecting false positive sequence homology: a machine learning approach |
title_sort | detecting false positive sequence homology: a machine learning approach |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765110/ https://www.ncbi.nlm.nih.gov/pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3 |
work_keys_str_mv | AT fujimotomstanley detectingfalsepositivesequencehomologyamachinelearningapproach AT suvorovanton detectingfalsepositivesequencehomologyamachinelearningapproach AT jensennicholaso detectingfalsepositivesequencehomologyamachinelearningapproach AT clementmarkj detectingfalsepositivesequencehomologyamachinelearningapproach AT bybeesethm detectingfalsepositivesequencehomologyamachinelearningapproach |