Cargando…

Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information’s Sequence Read Arc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Klie, Adam, Tsui, Brian Y, Mollah, Shamim, Skola, Dylan, Dow, Michelle, Hsu, Chun-Nan, Carter, Hannah
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8083811/ https://www.ncbi.nlm.nih.gov/pubmed/33914028 http://dx.doi.org/10.1093/database/baab021

_version_	1783686034572181504
author	Klie, Adam Tsui, Brian Y Mollah, Shamim Skola, Dylan Dow, Michelle Hsu, Chun-Nan Carter, Hannah
author_facet	Klie, Adam Tsui, Brian Y Mollah, Shamim Skola, Dylan Dow, Michelle Hsu, Chun-Nan Carter, Hannah
author_sort	Klie, Adam
collection	PubMed
description	High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information’s Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute–value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE
format	Online Article Text
id	pubmed-8083811
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-80838112021-05-05 Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition Klie, Adam Tsui, Brian Y Mollah, Shamim Skola, Dylan Dow, Michelle Hsu, Chun-Nan Carter, Hannah Database (Oxford) Original Article High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information’s Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute–value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE Oxford University Press 2021-04-29 /pmc/articles/PMC8083811/ /pubmed/33914028 http://dx.doi.org/10.1093/database/baab021 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Klie, Adam Tsui, Brian Y Mollah, Shamim Skola, Dylan Dow, Michelle Hsu, Chun-Nan Carter, Hannah Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition
title	Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition
title_full	Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition
title_fullStr	Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition
title_full_unstemmed	Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition
title_short	Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition
title_sort	increasing metadata coverage of sra biosample entries using deep learning–based named entity recognition
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8083811/ https://www.ncbi.nlm.nih.gov/pubmed/33914028 http://dx.doi.org/10.1093/database/baab021
work_keys_str_mv	AT klieadam increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition AT tsuibriany increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition AT mollahshamim increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition AT skoladylan increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition AT dowmichelle increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition AT hsuchunnan increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition AT carterhannah increasingmetadatacoverageofsrabiosampleentriesusingdeeplearningbasednamedentityrecognition

Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition

Ejemplares similares