Cargando…

Framework for automatic information extraction from research papers on nanocrystal devices

To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dieb, Thaer M, Yoshioka, Masaharu, Hara, Shinjiro, Newton, Marcus C
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Beilstein-Institut 2015
Materias:	Full Research Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4660922/ https://www.ncbi.nlm.nih.gov/pubmed/26665057 http://dx.doi.org/10.3762/bjnano.6.190

_version_	1782402899909804032
author	Dieb, Thaer M Yoshioka, Masaharu Hara, Shinjiro Newton, Marcus C
author_facet	Dieb, Thaer M Yoshioka, Masaharu Hara, Shinjiro Newton, Marcus C
author_sort	Dieb, Thaer M
collection	PubMed
description	To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system.
format	Online Article Text
id	pubmed-4660922
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Beilstein-Institut
record_format	MEDLINE/PubMed
spelling	pubmed-46609222015-12-09 Framework for automatic information extraction from research papers on nanocrystal devices Dieb, Thaer M Yoshioka, Masaharu Hara, Shinjiro Newton, Marcus C Beilstein J Nanotechnol Full Research Paper To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system. Beilstein-Institut 2015-09-07 /pmc/articles/PMC4660922/ /pubmed/26665057 http://dx.doi.org/10.3762/bjnano.6.190 Text en Copyright © 2015, Dieb et al. https://creativecommons.org/licenses/by/2.0https://www.beilstein-journals.org/bjnano/termsThis is an Open Access article under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The license is subject to the Beilstein Journal of Nanotechnology terms and conditions: (https://www.beilstein-journals.org/bjnano/terms)
spellingShingle	Full Research Paper Dieb, Thaer M Yoshioka, Masaharu Hara, Shinjiro Newton, Marcus C Framework for automatic information extraction from research papers on nanocrystal devices
title	Framework for automatic information extraction from research papers on nanocrystal devices
title_full	Framework for automatic information extraction from research papers on nanocrystal devices
title_fullStr	Framework for automatic information extraction from research papers on nanocrystal devices
title_full_unstemmed	Framework for automatic information extraction from research papers on nanocrystal devices
title_short	Framework for automatic information extraction from research papers on nanocrystal devices
title_sort	framework for automatic information extraction from research papers on nanocrystal devices
topic	Full Research Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4660922/ https://www.ncbi.nlm.nih.gov/pubmed/26665057 http://dx.doi.org/10.3762/bjnano.6.190
work_keys_str_mv	AT diebthaerm frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices AT yoshiokamasaharu frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices AT harashinjiro frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices AT newtonmarcusc frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices

Framework for automatic information extraction from research papers on nanocrystal devices

Ejemplares similares