Cargando…

Exploring subdomain variation in biomedical language

BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which differe...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lippincott, Thomas, Séaghdha, Diarmuid Ó, Korhonen, Anna
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3118171/ https://www.ncbi.nlm.nih.gov/pubmed/21619603 http://dx.doi.org/10.1186/1471-2105-12-212

_version_	1782206429192519680
author	Lippincott, Thomas Séaghdha, Diarmuid Ó Korhonen, Anna
author_facet	Lippincott, Thomas Séaghdha, Diarmuid Ó Korhonen, Anna
author_sort	Lippincott, Thomas
collection	PubMed
description	BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. RESULTS: Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. CONCLUSIONS: We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.
format	Online Article Text
id	pubmed-3118171
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31181712011-06-19 Exploring subdomain variation in biomedical language Lippincott, Thomas Séaghdha, Diarmuid Ó Korhonen, Anna BMC Bioinformatics Research Article BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. RESULTS: Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. CONCLUSIONS: We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers. BioMed Central 2011-05-27 /pmc/articles/PMC3118171/ /pubmed/21619603 http://dx.doi.org/10.1186/1471-2105-12-212 Text en Copyright ©2011 Lippincott et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Lippincott, Thomas Séaghdha, Diarmuid Ó Korhonen, Anna Exploring subdomain variation in biomedical language
title	Exploring subdomain variation in biomedical language
title_full	Exploring subdomain variation in biomedical language
title_fullStr	Exploring subdomain variation in biomedical language
title_full_unstemmed	Exploring subdomain variation in biomedical language
title_short	Exploring subdomain variation in biomedical language
title_sort	exploring subdomain variation in biomedical language
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3118171/ https://www.ncbi.nlm.nih.gov/pubmed/21619603 http://dx.doi.org/10.1186/1471-2105-12-212
work_keys_str_mv	AT lippincottthomas exploringsubdomainvariationinbiomedicallanguage AT seaghdhadiarmuido exploringsubdomainvariationinbiomedicallanguage AT korhonenanna exploringsubdomainvariationinbiomedicallanguage

Exploring subdomain variation in biomedical language

Ejemplares similares