Cargando…

Detecting false positive sequence homology: a machine learning approach

BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are...

Descripción completa

Detalles Bibliográficos
Autores principales: Fujimoto, M. Stanley, Suvorov, Anton, Jensen, Nicholas O., Clement, Mark J., Bybee, Seth M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765110/
https://www.ncbi.nlm.nih.gov/pubmed/26911862
http://dx.doi.org/10.1186/s12859-016-0955-3
_version_ 1782417502098161664
author Fujimoto, M. Stanley
Suvorov, Anton
Jensen, Nicholas O.
Clement, Mark J.
Bybee, Seth M.
author_facet Fujimoto, M. Stanley
Suvorov, Anton
Jensen, Nicholas O.
Clement, Mark J.
Bybee, Seth M.
author_sort Fujimoto, M. Stanley
collection PubMed
description BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. RESULTS: In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. CONCLUSIONS: Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0955-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4765110
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47651102016-02-25 Detecting false positive sequence homology: a machine learning approach Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. BMC Bioinformatics Methodology Article BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. RESULTS: In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. CONCLUSIONS: Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0955-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-24 /pmc/articles/PMC4765110/ /pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3 Text en © Fujimoto et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Fujimoto, M. Stanley
Suvorov, Anton
Jensen, Nicholas O.
Clement, Mark J.
Bybee, Seth M.
Detecting false positive sequence homology: a machine learning approach
title Detecting false positive sequence homology: a machine learning approach
title_full Detecting false positive sequence homology: a machine learning approach
title_fullStr Detecting false positive sequence homology: a machine learning approach
title_full_unstemmed Detecting false positive sequence homology: a machine learning approach
title_short Detecting false positive sequence homology: a machine learning approach
title_sort detecting false positive sequence homology: a machine learning approach
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765110/
https://www.ncbi.nlm.nih.gov/pubmed/26911862
http://dx.doi.org/10.1186/s12859-016-0955-3
work_keys_str_mv AT fujimotomstanley detectingfalsepositivesequencehomologyamachinelearningapproach
AT suvorovanton detectingfalsepositivesequencehomologyamachinelearningapproach
AT jensennicholaso detectingfalsepositivesequencehomologyamachinelearningapproach
AT clementmarkj detectingfalsepositivesequencehomologyamachinelearningapproach
AT bybeesethm detectingfalsepositivesequencehomologyamachinelearningapproach