Cargando…

An analysis of gene/protein associations at PubMed scale

BACKGROUND: Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow...

Descripción completa

Detalles Bibliográficos
Autores principales: Pyysalo, Sampo, Ohta, Tomoko, Tsujii, Jun’ichi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3239305/
https://www.ncbi.nlm.nih.gov/pubmed/22166173
http://dx.doi.org/10.1186/2041-1480-2-S5-S5
_version_ 1782219163954053120
author Pyysalo, Sampo
Ohta, Tomoko
Tsujii, Jun’ichi
author_facet Pyysalo, Sampo
Ohta, Tomoko
Tsujii, Jun’ichi
author_sort Pyysalo, Sampo
collection PubMed
description BACKGROUND: Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available. RESULTS: In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology. CONCLUSIONS: We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.
format Online
Article
Text
id pubmed-3239305
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32393052011-12-16 An analysis of gene/protein associations at PubMed scale Pyysalo, Sampo Ohta, Tomoko Tsujii, Jun’ichi J Biomed Semantics Research BACKGROUND: Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available. RESULTS: In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology. CONCLUSIONS: We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage. BioMed Central 2011-10-06 /pmc/articles/PMC3239305/ /pubmed/22166173 http://dx.doi.org/10.1186/2041-1480-2-S5-S5 Text en Copyright ©2011 Pyysalo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Pyysalo, Sampo
Ohta, Tomoko
Tsujii, Jun’ichi
An analysis of gene/protein associations at PubMed scale
title An analysis of gene/protein associations at PubMed scale
title_full An analysis of gene/protein associations at PubMed scale
title_fullStr An analysis of gene/protein associations at PubMed scale
title_full_unstemmed An analysis of gene/protein associations at PubMed scale
title_short An analysis of gene/protein associations at PubMed scale
title_sort analysis of gene/protein associations at pubmed scale
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3239305/
https://www.ncbi.nlm.nih.gov/pubmed/22166173
http://dx.doi.org/10.1186/2041-1480-2-S5-S5
work_keys_str_mv AT pyysalosampo ananalysisofgeneproteinassociationsatpubmedscale
AT ohtatomoko ananalysisofgeneproteinassociationsatpubmedscale
AT tsujiijunichi ananalysisofgeneproteinassociationsatpubmedscale
AT pyysalosampo analysisofgeneproteinassociationsatpubmedscale
AT ohtatomoko analysisofgeneproteinassociationsatpubmedscale
AT tsujiijunichi analysisofgeneproteinassociationsatpubmedscale