Cargando…

LocText: relation extraction of protein localizations to assist database curation

BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literatur...

Descripción completa

Detalles Bibliográficos
Autores principales: Cejuela, Juan Miguel, Vinchurkar, Shrikant, Goldberg, Tatyana, Prabhu Shankar, Madhukar Sollepura, Baghudana, Ashish, Bojchevski, Aleksandar, Uhlig, Carsten, Ofner, André, Raharja-Liu, Pandu, Jensen, Lars Juhl, Rost, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5773052/
https://www.ncbi.nlm.nih.gov/pubmed/29343218
http://dx.doi.org/10.1186/s12859-018-2021-9
_version_ 1783293499701985280
author Cejuela, Juan Miguel
Vinchurkar, Shrikant
Goldberg, Tatyana
Prabhu Shankar, Madhukar Sollepura
Baghudana, Ashish
Bojchevski, Aleksandar
Uhlig, Carsten
Ofner, André
Raharja-Liu, Pandu
Jensen, Lars Juhl
Rost, Burkhard
author_facet Cejuela, Juan Miguel
Vinchurkar, Shrikant
Goldberg, Tatyana
Prabhu Shankar, Madhukar Sollepura
Baghudana, Ashish
Bojchevski, Aleksandar
Uhlig, Carsten
Ofner, André
Raharja-Liu, Pandu
Jensen, Lars Juhl
Rost, Burkhard
author_sort Cejuela, Juan Miguel
collection PubMed
description BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence. RESULTS: LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot. CONCLUSIONS: LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-018-2021-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5773052
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-57730522018-01-26 LocText: relation extraction of protein localizations to assist database curation Cejuela, Juan Miguel Vinchurkar, Shrikant Goldberg, Tatyana Prabhu Shankar, Madhukar Sollepura Baghudana, Ashish Bojchevski, Aleksandar Uhlig, Carsten Ofner, André Raharja-Liu, Pandu Jensen, Lars Juhl Rost, Burkhard BMC Bioinformatics Research Article BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence. RESULTS: LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot. CONCLUSIONS: LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-018-2021-9) contains supplementary material, which is available to authorized users. BioMed Central 2018-01-17 /pmc/articles/PMC5773052/ /pubmed/29343218 http://dx.doi.org/10.1186/s12859-018-2021-9 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Cejuela, Juan Miguel
Vinchurkar, Shrikant
Goldberg, Tatyana
Prabhu Shankar, Madhukar Sollepura
Baghudana, Ashish
Bojchevski, Aleksandar
Uhlig, Carsten
Ofner, André
Raharja-Liu, Pandu
Jensen, Lars Juhl
Rost, Burkhard
LocText: relation extraction of protein localizations to assist database curation
title LocText: relation extraction of protein localizations to assist database curation
title_full LocText: relation extraction of protein localizations to assist database curation
title_fullStr LocText: relation extraction of protein localizations to assist database curation
title_full_unstemmed LocText: relation extraction of protein localizations to assist database curation
title_short LocText: relation extraction of protein localizations to assist database curation
title_sort loctext: relation extraction of protein localizations to assist database curation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5773052/
https://www.ncbi.nlm.nih.gov/pubmed/29343218
http://dx.doi.org/10.1186/s12859-018-2021-9
work_keys_str_mv AT cejuelajuanmiguel loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT vinchurkarshrikant loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT goldbergtatyana loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT prabhushankarmadhukarsollepura loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT baghudanaashish loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT bojchevskialeksandar loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT uhligcarsten loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT ofnerandre loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT raharjaliupandu loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT jensenlarsjuhl loctextrelationextractionofproteinlocalizationstoassistdatabasecuration
AT rostburkhard loctextrelationextractionofproteinlocalizationstoassistdatabasecuration