Cargando…

Extraction of data deposition statements from the literature: a method for automatically tracking research results

Motivation: Research in the biomedical domain can have a major impact through open sharing of the data produced. For this reason, it is important to be able to identify instances of data production and deposition for potential re-use. Herein, we report on the automatic identification of data deposit...

Descripción completa

Detalles Bibliográficos
Autores principales: Névéol, Aurélie, Wilbur, W. John, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223368/
https://www.ncbi.nlm.nih.gov/pubmed/21998156
http://dx.doi.org/10.1093/bioinformatics/btr573
_version_ 1782217286478725120
author Névéol, Aurélie
Wilbur, W. John
Lu, Zhiyong
author_facet Névéol, Aurélie
Wilbur, W. John
Lu, Zhiyong
author_sort Névéol, Aurélie
collection PubMed
description Motivation: Research in the biomedical domain can have a major impact through open sharing of the data produced. For this reason, it is important to be able to identify instances of data production and deposition for potential re-use. Herein, we report on the automatic identification of data deposition statements in research articles. Results: We apply machine learning algorithms to sentences extracted from full-text articles in PubMed Central in order to automatically determine whether a given article contains a data deposition statement, and retrieve the specific statements. With an Support Vector Machine classifier using conditional random field determined deposition features, articles containing deposition statements are correctly identified with 81% F-measure. An error analysis shows that almost half of the articles classified as containing a deposition statement by our method but not by the gold standard do indeed contain a deposition statement. In addition, our system was used to process articles in PubMed Central, predicting that a total of 52 932 articles report data deposition, many of which are not currently included in the Secondary Source Identifier [si] field for MEDLINE citations. Availability: All annotated datasets described in this study are freely available from the NLM/NCBI website at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Neveol/DepositionDataSets.zip Contact: aurelie.neveol@nih.gov; john.wilbur@nih.gov; zhiyong.lu@nih.gov Supplementary Information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3223368
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-32233682011-11-25 Extraction of data deposition statements from the literature: a method for automatically tracking research results Névéol, Aurélie Wilbur, W. John Lu, Zhiyong Bioinformatics Original Papers Motivation: Research in the biomedical domain can have a major impact through open sharing of the data produced. For this reason, it is important to be able to identify instances of data production and deposition for potential re-use. Herein, we report on the automatic identification of data deposition statements in research articles. Results: We apply machine learning algorithms to sentences extracted from full-text articles in PubMed Central in order to automatically determine whether a given article contains a data deposition statement, and retrieve the specific statements. With an Support Vector Machine classifier using conditional random field determined deposition features, articles containing deposition statements are correctly identified with 81% F-measure. An error analysis shows that almost half of the articles classified as containing a deposition statement by our method but not by the gold standard do indeed contain a deposition statement. In addition, our system was used to process articles in PubMed Central, predicting that a total of 52 932 articles report data deposition, many of which are not currently included in the Secondary Source Identifier [si] field for MEDLINE citations. Availability: All annotated datasets described in this study are freely available from the NLM/NCBI website at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Neveol/DepositionDataSets.zip Contact: aurelie.neveol@nih.gov; john.wilbur@nih.gov; zhiyong.lu@nih.gov Supplementary Information: Supplementary data are available at Bioinformatics online. Oxford University Press 2011-12-01 2011-10-13 /pmc/articles/PMC3223368/ /pubmed/21998156 http://dx.doi.org/10.1093/bioinformatics/btr573 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Névéol, Aurélie
Wilbur, W. John
Lu, Zhiyong
Extraction of data deposition statements from the literature: a method for automatically tracking research results
title Extraction of data deposition statements from the literature: a method for automatically tracking research results
title_full Extraction of data deposition statements from the literature: a method for automatically tracking research results
title_fullStr Extraction of data deposition statements from the literature: a method for automatically tracking research results
title_full_unstemmed Extraction of data deposition statements from the literature: a method for automatically tracking research results
title_short Extraction of data deposition statements from the literature: a method for automatically tracking research results
title_sort extraction of data deposition statements from the literature: a method for automatically tracking research results
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223368/
https://www.ncbi.nlm.nih.gov/pubmed/21998156
http://dx.doi.org/10.1093/bioinformatics/btr573
work_keys_str_mv AT neveolaurelie extractionofdatadepositionstatementsfromtheliteratureamethodforautomaticallytrackingresearchresults
AT wilburwjohn extractionofdatadepositionstatementsfromtheliteratureamethodforautomaticallytrackingresearchresults
AT luzhiyong extractionofdatadepositionstatementsfromtheliteratureamethodforautomaticallytrackingresearchresults