Cargando…
Automated recognition of malignancy mentions in biomedical literature
BACKGROUND: The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Pre...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1657036/ https://www.ncbi.nlm.nih.gov/pubmed/17090325 http://dx.doi.org/10.1186/1471-2105-7-492 |
_version_ | 1782131025978064896 |
---|---|
author | Jin, Yang McDonald, Ryan T Lerman, Kevin Mandel, Mark A Carroll, Steven Liberman, Mark Y Pereira, Fernando C Winters, Raymond S White, Peter S |
author_facet | Jin, Yang McDonald, Ryan T Lerman, Kevin Mandel, Mark A Carroll, Steven Liberman, Mark Y Pereira, Fernando C Winters, Raymond S White, Peter S |
author_sort | Jin, Yang |
collection | PubMed |
description | BACKGROUND: The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. RESULTS: We developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance. CONCLUSION: Together, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain. |
format | Text |
id | pubmed-1657036 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-16570362006-11-22 Automated recognition of malignancy mentions in biomedical literature Jin, Yang McDonald, Ryan T Lerman, Kevin Mandel, Mark A Carroll, Steven Liberman, Mark Y Pereira, Fernando C Winters, Raymond S White, Peter S BMC Bioinformatics Research Article BACKGROUND: The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. RESULTS: We developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance. CONCLUSION: Together, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain. BioMed Central 2006-11-07 /pmc/articles/PMC1657036/ /pubmed/17090325 http://dx.doi.org/10.1186/1471-2105-7-492 Text en Copyright © 2006 Jin et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Jin, Yang McDonald, Ryan T Lerman, Kevin Mandel, Mark A Carroll, Steven Liberman, Mark Y Pereira, Fernando C Winters, Raymond S White, Peter S Automated recognition of malignancy mentions in biomedical literature |
title | Automated recognition of malignancy mentions in biomedical literature |
title_full | Automated recognition of malignancy mentions in biomedical literature |
title_fullStr | Automated recognition of malignancy mentions in biomedical literature |
title_full_unstemmed | Automated recognition of malignancy mentions in biomedical literature |
title_short | Automated recognition of malignancy mentions in biomedical literature |
title_sort | automated recognition of malignancy mentions in biomedical literature |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1657036/ https://www.ncbi.nlm.nih.gov/pubmed/17090325 http://dx.doi.org/10.1186/1471-2105-7-492 |
work_keys_str_mv | AT jinyang automatedrecognitionofmalignancymentionsinbiomedicalliterature AT mcdonaldryant automatedrecognitionofmalignancymentionsinbiomedicalliterature AT lermankevin automatedrecognitionofmalignancymentionsinbiomedicalliterature AT mandelmarka automatedrecognitionofmalignancymentionsinbiomedicalliterature AT carrollsteven automatedrecognitionofmalignancymentionsinbiomedicalliterature AT libermanmarky automatedrecognitionofmalignancymentionsinbiomedicalliterature AT pereirafernandoc automatedrecognitionofmalignancymentionsinbiomedicalliterature AT wintersraymonds automatedrecognitionofmalignancymentionsinbiomedicalliterature AT whitepeters automatedrecognitionofmalignancymentionsinbiomedicalliterature |