Cargando…
Detection and categorization of bacteria habitats using shallow linguistic analysis
BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resource...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511461/ https://www.ncbi.nlm.nih.gov/pubmed/26201262 http://dx.doi.org/10.1186/1471-2105-16-S10-S5 |
_version_ | 1782382339780771840 |
---|---|
author | Karadeniz, İlknur Özgür, Arzucan |
author_facet | Karadeniz, İlknur Özgür, Arzucan |
author_sort | Karadeniz, İlknur |
collection | PubMed |
description | BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas. METHODS: We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach. RESULTS: We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%. CONCLUSIONS: Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013. |
format | Online Article Text |
id | pubmed-4511461 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-45114612015-07-28 Detection and categorization of bacteria habitats using shallow linguistic analysis Karadeniz, İlknur Özgür, Arzucan BMC Bioinformatics Research BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas. METHODS: We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach. RESULTS: We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%. CONCLUSIONS: Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013. BioMed Central 2015-07-13 /pmc/articles/PMC4511461/ /pubmed/26201262 http://dx.doi.org/10.1186/1471-2105-16-S10-S5 Text en Copyright © 2015 Karadeniz and Özgür; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Karadeniz, İlknur Özgür, Arzucan Detection and categorization of bacteria habitats using shallow linguistic analysis |
title | Detection and categorization of bacteria habitats using shallow linguistic analysis |
title_full | Detection and categorization of bacteria habitats using shallow linguistic analysis |
title_fullStr | Detection and categorization of bacteria habitats using shallow linguistic analysis |
title_full_unstemmed | Detection and categorization of bacteria habitats using shallow linguistic analysis |
title_short | Detection and categorization of bacteria habitats using shallow linguistic analysis |
title_sort | detection and categorization of bacteria habitats using shallow linguistic analysis |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511461/ https://www.ncbi.nlm.nih.gov/pubmed/26201262 http://dx.doi.org/10.1186/1471-2105-16-S10-S5 |
work_keys_str_mv | AT karadenizilknur detectionandcategorizationofbacteriahabitatsusingshallowlinguisticanalysis AT ozgurarzucan detectionandcategorizationofbacteriahabitatsusingshallowlinguisticanalysis |