Cargando…

Detecting false positive sequence homology: a machine learning approach

BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are...

Descripción completa

Detalles Bibliográficos
Autores principales:	Fujimoto, M. Stanley, Suvorov, Anton, Jensen, Nicholas O., Clement, Mark J., Bybee, Seth M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765110/ https://www.ncbi.nlm.nih.gov/pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3

_version_	1782417502098161664
author	Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M.
author_facet	Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M.
author_sort	Fujimoto, M. Stanley
collection	PubMed
description	BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. RESULTS: In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. CONCLUSIONS: Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0955-3) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4765110
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47651102016-02-25 Detecting false positive sequence homology: a machine learning approach Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. BMC Bioinformatics Methodology Article BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. RESULTS: In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. CONCLUSIONS: Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0955-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-24 /pmc/articles/PMC4765110/ /pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3 Text en © Fujimoto et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Fujimoto, M. Stanley Suvorov, Anton Jensen, Nicholas O. Clement, Mark J. Bybee, Seth M. Detecting false positive sequence homology: a machine learning approach
title	Detecting false positive sequence homology: a machine learning approach
title_full	Detecting false positive sequence homology: a machine learning approach
title_fullStr	Detecting false positive sequence homology: a machine learning approach
title_full_unstemmed	Detecting false positive sequence homology: a machine learning approach
title_short	Detecting false positive sequence homology: a machine learning approach
title_sort	detecting false positive sequence homology: a machine learning approach
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765110/ https://www.ncbi.nlm.nih.gov/pubmed/26911862 http://dx.doi.org/10.1186/s12859-016-0955-3
work_keys_str_mv	AT fujimotomstanley detectingfalsepositivesequencehomologyamachinelearningapproach AT suvorovanton detectingfalsepositivesequencehomologyamachinelearningapproach AT jensennicholaso detectingfalsepositivesequencehomologyamachinelearningapproach AT clementmarkj detectingfalsepositivesequencehomologyamachinelearningapproach AT bybeesethm detectingfalsepositivesequencehomologyamachinelearningapproach

Detecting false positive sequence homology: a machine learning approach

Ejemplares similares