Cargando…

Detection and categorization of bacteria habitats using shallow linguistic analysis

BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resource...

Descripción completa

Detalles Bibliográficos
Autores principales: Karadeniz, İlknur, Özgür, Arzucan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511461/
https://www.ncbi.nlm.nih.gov/pubmed/26201262
http://dx.doi.org/10.1186/1471-2105-16-S10-S5
_version_ 1782382339780771840
author Karadeniz, İlknur
Özgür, Arzucan
author_facet Karadeniz, İlknur
Özgür, Arzucan
author_sort Karadeniz, İlknur
collection PubMed
description BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas. METHODS: We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach. RESULTS: We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%. CONCLUSIONS: Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013.
format Online
Article
Text
id pubmed-4511461
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45114612015-07-28 Detection and categorization of bacteria habitats using shallow linguistic analysis Karadeniz, İlknur Özgür, Arzucan BMC Bioinformatics Research BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas. METHODS: We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach. RESULTS: We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%. CONCLUSIONS: Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013. BioMed Central 2015-07-13 /pmc/articles/PMC4511461/ /pubmed/26201262 http://dx.doi.org/10.1186/1471-2105-16-S10-S5 Text en Copyright © 2015 Karadeniz and Özgür; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Karadeniz, İlknur
Özgür, Arzucan
Detection and categorization of bacteria habitats using shallow linguistic analysis
title Detection and categorization of bacteria habitats using shallow linguistic analysis
title_full Detection and categorization of bacteria habitats using shallow linguistic analysis
title_fullStr Detection and categorization of bacteria habitats using shallow linguistic analysis
title_full_unstemmed Detection and categorization of bacteria habitats using shallow linguistic analysis
title_short Detection and categorization of bacteria habitats using shallow linguistic analysis
title_sort detection and categorization of bacteria habitats using shallow linguistic analysis
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511461/
https://www.ncbi.nlm.nih.gov/pubmed/26201262
http://dx.doi.org/10.1186/1471-2105-16-S10-S5
work_keys_str_mv AT karadenizilknur detectionandcategorizationofbacteriahabitatsusingshallowlinguisticanalysis
AT ozgurarzucan detectionandcategorizationofbacteriahabitatsusingshallowlinguisticanalysis