Cargando…

Homology Induction: the use of machine learning to improve sequence similarity searches

BACKGROUND: The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000). RESULTS: We present Homology Induction (HI), a new approach to inferring homology. HI uses machine lea...

Descripción completa

Detalles Bibliográficos
Autores principales:	Karwath, Andreas, King, Ross D
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2002
Materias:	Methodology article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC107726/ https://www.ncbi.nlm.nih.gov/pubmed/11972320 http://dx.doi.org/10.1186/1471-2105-3-11

_version_	1782120216466030592
author	Karwath, Andreas King, Ross D
author_facet	Karwath, Andreas King, Ross D
author_sort	Karwath, Andreas
collection	PubMed
description	BACKGROUND: The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000). RESULTS: We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodoly with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families. CONCLUSIONS: HI is a new technique for the detection of remote protein homolgy – a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method.
format	Text
id	pubmed-107726
institution	National Center for Biotechnology Information
language	English
publishDate	2002
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-1077262002-05-09 Homology Induction: the use of machine learning to improve sequence similarity searches Karwath, Andreas King, Ross D BMC Bioinformatics Methodology article BACKGROUND: The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000). RESULTS: We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodoly with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families. CONCLUSIONS: HI is a new technique for the detection of remote protein homolgy – a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method. BioMed Central 2002-04-23 /pmc/articles/PMC107726/ /pubmed/11972320 http://dx.doi.org/10.1186/1471-2105-3-11 Text en Copyright ©2002 Karwath and King; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
spellingShingle	Methodology article Karwath, Andreas King, Ross D Homology Induction: the use of machine learning to improve sequence similarity searches
title	Homology Induction: the use of machine learning to improve sequence similarity searches
title_full	Homology Induction: the use of machine learning to improve sequence similarity searches
title_fullStr	Homology Induction: the use of machine learning to improve sequence similarity searches
title_full_unstemmed	Homology Induction: the use of machine learning to improve sequence similarity searches
title_short	Homology Induction: the use of machine learning to improve sequence similarity searches
title_sort	homology induction: the use of machine learning to improve sequence similarity searches
topic	Methodology article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC107726/ https://www.ncbi.nlm.nih.gov/pubmed/11972320 http://dx.doi.org/10.1186/1471-2105-3-11
work_keys_str_mv	AT karwathandreas homologyinductiontheuseofmachinelearningtoimprovesequencesimilaritysearches AT kingrossd homologyinductiontheuseofmachinelearningtoimprovesequencesimilaritysearches

Homology Induction: the use of machine learning to improve sequence similarity searches

Ejemplares similares