Cargando…

A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets

Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural...

Descripción completa

Detalles Bibliográficos
Autores principales: Doddahonnaiah, Deeksha, Lenehan, Patrick J., Hughes, Travis K., Zemmour, David, Garcia-Rivera, Enrique, Venkatakrishnan, A. J., Chilaka, Ramakrishna, Khare, Apoorv, Kasaraneni, Akhil, Garg, Abhinav, Anand, Akash, Barve, Rakesh, Thiagarajan, Viswanathan, Soundararajan, Venky
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8229796/
https://www.ncbi.nlm.nih.gov/pubmed/34200671
http://dx.doi.org/10.3390/genes12060898
_version_ 1783713062979633152
author Doddahonnaiah, Deeksha
Lenehan, Patrick J.
Hughes, Travis K.
Zemmour, David
Garcia-Rivera, Enrique
Venkatakrishnan, A. J.
Chilaka, Ramakrishna
Khare, Apoorv
Kasaraneni, Akhil
Garg, Abhinav
Anand, Akash
Barve, Rakesh
Thiagarajan, Viswanathan
Soundararajan, Venky
author_facet Doddahonnaiah, Deeksha
Lenehan, Patrick J.
Hughes, Travis K.
Zemmour, David
Garcia-Rivera, Enrique
Venkatakrishnan, A. J.
Chilaka, Ramakrishna
Khare, Apoorv
Kasaraneni, Akhil
Garg, Abhinav
Anand, Akash
Barve, Rakesh
Thiagarajan, Viswanathan
Soundararajan, Venky
author_sort Doddahonnaiah, Deeksha
collection PubMed
description Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10(−76), r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.
format Online
Article
Text
id pubmed-8229796
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-82297962021-06-26 A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets Doddahonnaiah, Deeksha Lenehan, Patrick J. Hughes, Travis K. Zemmour, David Garcia-Rivera, Enrique Venkatakrishnan, A. J. Chilaka, Ramakrishna Khare, Apoorv Kasaraneni, Akhil Garg, Abhinav Anand, Akash Barve, Rakesh Thiagarajan, Viswanathan Soundararajan, Venky Genes (Basel) Article Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10(−76), r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data. MDPI 2021-06-10 /pmc/articles/PMC8229796/ /pubmed/34200671 http://dx.doi.org/10.3390/genes12060898 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Doddahonnaiah, Deeksha
Lenehan, Patrick J.
Hughes, Travis K.
Zemmour, David
Garcia-Rivera, Enrique
Venkatakrishnan, A. J.
Chilaka, Ramakrishna
Khare, Apoorv
Kasaraneni, Akhil
Garg, Abhinav
Anand, Akash
Barve, Rakesh
Thiagarajan, Viswanathan
Soundararajan, Venky
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_full A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_fullStr A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_full_unstemmed A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_short A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_sort literature-derived knowledge graph augments the interpretation of single cell rna-seq datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8229796/
https://www.ncbi.nlm.nih.gov/pubmed/34200671
http://dx.doi.org/10.3390/genes12060898
work_keys_str_mv AT doddahonnaiahdeeksha aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT lenehanpatrickj aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT hughestravisk aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT zemmourdavid aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT garciariveraenrique aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT venkatakrishnanaj aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT chilakaramakrishna aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT khareapoorv aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT kasaraneniakhil aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT gargabhinav aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT anandakash aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT barverakesh aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT thiagarajanviswanathan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT soundararajanvenky aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT doddahonnaiahdeeksha literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT lenehanpatrickj literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT hughestravisk literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT zemmourdavid literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT garciariveraenrique literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT venkatakrishnanaj literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT chilakaramakrishna literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT khareapoorv literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT kasaraneniakhil literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT gargabhinav literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT anandakash literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT barverakesh literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT thiagarajanviswanathan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT soundararajanvenky literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets