Cargando…
Systematic tissue annotations of genomics samples by modeling unstructured metadata
There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are rou...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9643451/ https://www.ncbi.nlm.nih.gov/pubmed/36347858 http://dx.doi.org/10.1038/s41467-022-34435-x |
_version_ | 1784826530139996160 |
---|---|
author | Hawkins, Nathaniel T. Maldaver, Marc Yannakopoulos, Anna Guare, Lindsay A. Krishnan, Arjun |
author_facet | Hawkins, Nathaniel T. Maldaver, Marc Yannakopoulos, Anna Guare, Lindsay A. Krishnan, Arjun |
author_sort | Hawkins, Nathaniel T. |
collection | PubMed |
description | There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto. |
format | Online Article Text |
id | pubmed-9643451 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-96434512022-11-15 Systematic tissue annotations of genomics samples by modeling unstructured metadata Hawkins, Nathaniel T. Maldaver, Marc Yannakopoulos, Anna Guare, Lindsay A. Krishnan, Arjun Nat Commun Article There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto. Nature Publishing Group UK 2022-11-08 /pmc/articles/PMC9643451/ /pubmed/36347858 http://dx.doi.org/10.1038/s41467-022-34435-x Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Hawkins, Nathaniel T. Maldaver, Marc Yannakopoulos, Anna Guare, Lindsay A. Krishnan, Arjun Systematic tissue annotations of genomics samples by modeling unstructured metadata |
title | Systematic tissue annotations of genomics samples by modeling unstructured metadata |
title_full | Systematic tissue annotations of genomics samples by modeling unstructured metadata |
title_fullStr | Systematic tissue annotations of genomics samples by modeling unstructured metadata |
title_full_unstemmed | Systematic tissue annotations of genomics samples by modeling unstructured metadata |
title_short | Systematic tissue annotations of genomics samples by modeling unstructured metadata |
title_sort | systematic tissue annotations of genomics samples by modeling unstructured metadata |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9643451/ https://www.ncbi.nlm.nih.gov/pubmed/36347858 http://dx.doi.org/10.1038/s41467-022-34435-x |
work_keys_str_mv | AT hawkinsnathanielt systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata AT maldavermarc systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata AT yannakopoulosanna systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata AT guarelindsaya systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata AT krishnanarjun systematictissueannotationsofgenomicssamplesbymodelingunstructuredmetadata |