Cargando…

Systematic tissue annotations of genomics samples by modeling unstructured metadata

There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are rou...

Descripción completa

Detalles Bibliográficos
Autores principales: Hawkins, Nathaniel T., Maldaver, Marc, Yannakopoulos, Anna, Guare, Lindsay A., Krishnan, Arjun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9643451/
https://www.ncbi.nlm.nih.gov/pubmed/36347858
http://dx.doi.org/10.1038/s41467-022-34435-x
_version_ 1784826530139996160
author Hawkins, Nathaniel T.
Maldaver, Marc
Yannakopoulos, Anna
Guare, Lindsay A.
Krishnan, Arjun
author_facet Hawkins, Nathaniel T.
Maldaver, Marc
Yannakopoulos, Anna
Guare, Lindsay A.
Krishnan, Arjun
author_sort Hawkins, Nathaniel T.
collection PubMed
description There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto.
format Online
Article
Text
id pubmed-9643451
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-96434512022-11-15 Systematic tissue annotations of genomics samples by modeling unstructured metadata Hawkins, Nathaniel T. Maldaver, Marc Yannakopoulos, Anna Guare, Lindsay A. Krishnan, Arjun Nat Commun Article There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto. Nature Publishing Group UK 2022-11-08 /pmc/articles/PMC9643451/ /pubmed/36347858 http://dx.doi.org/10.1038/s41467-022-34435-x Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Hawkins, Nathaniel T.
Maldaver, Marc
Yannakopoulos, Anna
Guare, Lindsay A.
Krishnan, Arjun
Systematic tissue annotations of genomics samples by modeling unstructured metadata
title Systematic tissue annotations of genomics samples by modeling unstructured metadata
title_full Systematic tissue annotations of genomics samples by modeling unstructured metadata
title_fullStr Systematic tissue annotations of genomics samples by modeling unstructured metadata
title_full_unstemmed Systematic tissue annotations of genomics samples by modeling unstructured metadata
title_short Systematic tissue annotations of genomics samples by modeling unstructured metadata
title_sort systematic tissue annotations of genomics samples by modeling unstructured metadata
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9643451/
https://www.ncbi.nlm.nih.gov/pubmed/36347858
http://dx.doi.org/10.1038/s41467-022-34435-x
work_keys_str_mv AT hawkinsnathanielt systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata
AT maldavermarc systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata
AT yannakopoulosanna systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata
AT guarelindsaya systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata
AT krishnanarjun systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata