Cargando…
Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441003/ https://www.ncbi.nlm.nih.gov/pubmed/26005564 http://dx.doi.org/10.1186/s13326-015-0006-4 |
_version_ | 1782372725921153024 |
---|---|
author | Funk, Christopher S Kahanda, Indika Ben-Hur, Asa Verspoor, Karin M |
author_facet | Funk, Christopher S Kahanda, Indika Ben-Hur, Asa Verspoor, Karin M |
author_sort | Funk, Christopher S |
collection | PubMed |
description | Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a “medium-throughput” pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-015-0006-4) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4441003 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-44410032015-05-23 Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct Funk, Christopher S Kahanda, Indika Ben-Hur, Asa Verspoor, Karin M J Biomed Semantics Research Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a “medium-throughput” pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-015-0006-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-03-18 /pmc/articles/PMC4441003/ /pubmed/26005564 http://dx.doi.org/10.1186/s13326-015-0006-4 Text en © Funk et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Funk, Christopher S Kahanda, Indika Ben-Hur, Asa Verspoor, Karin M Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct |
title | Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct |
title_full | Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct |
title_fullStr | Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct |
title_full_unstemmed | Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct |
title_short | Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct |
title_sort | evaluating a variety of text-mined features for automatic protein function prediction with gostruct |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441003/ https://www.ncbi.nlm.nih.gov/pubmed/26005564 http://dx.doi.org/10.1186/s13326-015-0006-4 |
work_keys_str_mv | AT funkchristophers evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct AT kahandaindika evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct AT benhurasa evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct AT verspoorkarinm evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct |