Cargando…

The structural and content aspects of abstracts versus bodies of full text journal articles are different

BACKGROUND: An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal...

Descripción completa

Detalles Bibliográficos
Autores principales: Cohen, K Bretonnel, Johnson, Helen L, Verspoor, Karin, Roeder, Christophe, Hunter, Lawrence E
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098079/
https://www.ncbi.nlm.nih.gov/pubmed/20920264
http://dx.doi.org/10.1186/1471-2105-11-492
_version_ 1782203912586002432
author Cohen, K Bretonnel
Johnson, Helen L
Verspoor, Karin
Roeder, Christophe
Hunter, Lawrence E
author_facet Cohen, K Bretonnel
Johnson, Helen L
Verspoor, Karin
Roeder, Christophe
Hunter, Lawrence E
author_sort Cohen, K Bretonnel
collection PubMed
description BACKGROUND: An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. RESULTS: We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. CONCLUSIONS: Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.
format Text
id pubmed-3098079
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30980792011-05-20 The structural and content aspects of abstracts versus bodies of full text journal articles are different Cohen, K Bretonnel Johnson, Helen L Verspoor, Karin Roeder, Christophe Hunter, Lawrence E BMC Bioinformatics Research Article BACKGROUND: An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. RESULTS: We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. CONCLUSIONS: Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts. BioMed Central 2010-09-29 /pmc/articles/PMC3098079/ /pubmed/20920264 http://dx.doi.org/10.1186/1471-2105-11-492 Text en Copyright ©2010 Cohen et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Cohen, K Bretonnel
Johnson, Helen L
Verspoor, Karin
Roeder, Christophe
Hunter, Lawrence E
The structural and content aspects of abstracts versus bodies of full text journal articles are different
title The structural and content aspects of abstracts versus bodies of full text journal articles are different
title_full The structural and content aspects of abstracts versus bodies of full text journal articles are different
title_fullStr The structural and content aspects of abstracts versus bodies of full text journal articles are different
title_full_unstemmed The structural and content aspects of abstracts versus bodies of full text journal articles are different
title_short The structural and content aspects of abstracts versus bodies of full text journal articles are different
title_sort structural and content aspects of abstracts versus bodies of full text journal articles are different
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098079/
https://www.ncbi.nlm.nih.gov/pubmed/20920264
http://dx.doi.org/10.1186/1471-2105-11-492
work_keys_str_mv AT cohenkbretonnel thestructuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT johnsonhelenl thestructuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT verspoorkarin thestructuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT roederchristophe thestructuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT hunterlawrencee thestructuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT cohenkbretonnel structuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT johnsonhelenl structuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT verspoorkarin structuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT roederchristophe structuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent
AT hunterlawrencee structuralandcontentaspectsofabstractsversusbodiesoffulltextjournalarticlesaredifferent