Cargando…

Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database

Manual curation has long been used for extracting key information found within the primary literature for input into biological databases. The human immunodeficiency virus type 1 (HIV-1), human protein interaction database (HHPID), for example, contains 2589 manually extracted interactions, linked t...

Descripción completa

Detalles Bibliográficos
Autores principales: Jamieson, Daniel G., Gerner, Martin, Sarafraz, Farzaneh, Nenadic, Goran, Robertson, David L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3332570/
https://www.ncbi.nlm.nih.gov/pubmed/22529179
http://dx.doi.org/10.1093/database/bas023
_version_ 1782230253487259648
author Jamieson, Daniel G.
Gerner, Martin
Sarafraz, Farzaneh
Nenadic, Goran
Robertson, David L.
author_facet Jamieson, Daniel G.
Gerner, Martin
Sarafraz, Farzaneh
Nenadic, Goran
Robertson, David L.
author_sort Jamieson, Daniel G.
collection PubMed
description Manual curation has long been used for extracting key information found within the primary literature for input into biological databases. The human immunodeficiency virus type 1 (HIV-1), human protein interaction database (HHPID), for example, contains 2589 manually extracted interactions, linked to 14 312 mentions in 3090 articles. The advancement of text-mining (TM) techniques has offered a possibility to rapidly retrieve such data from large volumes of text to a high degree of accuracy. Here, we present a recreation of the HHPID using the current state of the art in TM. To retrieve interactions, we performed gene/protein named entity recognition (NER) and applied two molecular event extraction tools on all abstracts and titles cited in the HHPID. Our best NER scores for precision, recall and F-score were 87.5%, 90.0% and 88.6%, respectively, while event extraction achieved 76.4%, 84.2% and 80.1%, respectively. We demonstrate that over 50% of the HHPID interactions can be recreated from abstracts and titles. Furthermore, from 49 available open-access full-text articles, we extracted a total of 237 unique HIV-1–human interactions, as opposed to 187 interactions recorded in the HHPID from the same articles. On average, we extracted 23 times more mentions of interactions and events from a full-text article than from an abstract and title, with a 6-fold increase in the number of unique interactions. We further demonstrated that more frequently occurring interactions extracted by TM are more likely to be true positives. Overall, the results demonstrate that TM was able to recover a large proportion of interactions, many of which were found within the HHPID, making TM a useful assistant in the manual curation process. Finally, we also retrieved other types of interactions in the context of HIV-1 that are not currently present in the HHPID, thus, expanding the scope of this data set. All data is available at http://gnode1.mib.man.ac.uk/HIV1-text-mining.
format Online
Article
Text
id pubmed-3332570
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-33325702012-04-23 Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database Jamieson, Daniel G. Gerner, Martin Sarafraz, Farzaneh Nenadic, Goran Robertson, David L. Database (Oxford) Original Article Manual curation has long been used for extracting key information found within the primary literature for input into biological databases. The human immunodeficiency virus type 1 (HIV-1), human protein interaction database (HHPID), for example, contains 2589 manually extracted interactions, linked to 14 312 mentions in 3090 articles. The advancement of text-mining (TM) techniques has offered a possibility to rapidly retrieve such data from large volumes of text to a high degree of accuracy. Here, we present a recreation of the HHPID using the current state of the art in TM. To retrieve interactions, we performed gene/protein named entity recognition (NER) and applied two molecular event extraction tools on all abstracts and titles cited in the HHPID. Our best NER scores for precision, recall and F-score were 87.5%, 90.0% and 88.6%, respectively, while event extraction achieved 76.4%, 84.2% and 80.1%, respectively. We demonstrate that over 50% of the HHPID interactions can be recreated from abstracts and titles. Furthermore, from 49 available open-access full-text articles, we extracted a total of 237 unique HIV-1–human interactions, as opposed to 187 interactions recorded in the HHPID from the same articles. On average, we extracted 23 times more mentions of interactions and events from a full-text article than from an abstract and title, with a 6-fold increase in the number of unique interactions. We further demonstrated that more frequently occurring interactions extracted by TM are more likely to be true positives. Overall, the results demonstrate that TM was able to recover a large proportion of interactions, many of which were found within the HHPID, making TM a useful assistant in the manual curation process. Finally, we also retrieved other types of interactions in the context of HIV-1 that are not currently present in the HHPID, thus, expanding the scope of this data set. All data is available at http://gnode1.mib.man.ac.uk/HIV1-text-mining. Oxford University Press 2012-04-23 /pmc/articles/PMC3332570/ /pubmed/22529179 http://dx.doi.org/10.1093/database/bas023 Text en © The Author(s) 2012. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Jamieson, Daniel G.
Gerner, Martin
Sarafraz, Farzaneh
Nenadic, Goran
Robertson, David L.
Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
title Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
title_full Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
title_fullStr Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
title_full_unstemmed Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
title_short Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
title_sort towards semi-automated curation: using text mining to recreate the hiv-1, human protein interaction database
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3332570/
https://www.ncbi.nlm.nih.gov/pubmed/22529179
http://dx.doi.org/10.1093/database/bas023
work_keys_str_mv AT jamiesondanielg towardssemiautomatedcurationusingtextminingtorecreatethehiv1humanproteininteractiondatabase
AT gernermartin towardssemiautomatedcurationusingtextminingtorecreatethehiv1humanproteininteractiondatabase
AT sarafrazfarzaneh towardssemiautomatedcurationusingtextminingtorecreatethehiv1humanproteininteractiondatabase
AT nenadicgoran towardssemiautomatedcurationusingtextminingtorecreatethehiv1humanproteininteractiondatabase
AT robertsondavidl towardssemiautomatedcurationusingtextminingtorecreatethehiv1humanproteininteractiondatabase