Cargando…

Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)

Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Juan, Feng, QiPing, Wu, Patrick, Warner, Jeremy L., Denny, Joshua C., Wei, Wei-Qi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6374022/
https://www.ncbi.nlm.nih.gov/pubmed/30759150
http://dx.doi.org/10.1371/journal.pone.0212112
_version_ 1783395093471821824
author Zhao, Juan
Feng, QiPing
Wu, Patrick
Warner, Jeremy L.
Denny, Joshua C.
Wei, Wei-Qi
author_facet Zhao, Juan
Feng, QiPing
Wu, Patrick
Warner, Jeremy L.
Denny, Joshua C.
Wei, Wei-Qi
author_sort Zhao, Juan
collection PubMed
description Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases.
format Online
Article
Text
id pubmed-6374022
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-63740222019-03-01 Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) Zhao, Juan Feng, QiPing Wu, Patrick Warner, Jeremy L. Denny, Joshua C. Wei, Wei-Qi PLoS One Research Article Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases. Public Library of Science 2019-02-13 /pmc/articles/PMC6374022/ /pubmed/30759150 http://dx.doi.org/10.1371/journal.pone.0212112 Text en © 2019 Zhao et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Zhao, Juan
Feng, QiPing
Wu, Patrick
Warner, Jeremy L.
Denny, Joshua C.
Wei, Wei-Qi
Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
title Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
title_full Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
title_fullStr Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
title_full_unstemmed Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
title_short Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
title_sort using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: a case study of lipoprotein(a) (lpa)
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6374022/
https://www.ncbi.nlm.nih.gov/pubmed/30759150
http://dx.doi.org/10.1371/journal.pone.0212112
work_keys_str_mv AT zhaojuan usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa
AT fengqiping usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa
AT wupatrick usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa
AT warnerjeremyl usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa
AT dennyjoshuac usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa
AT weiweiqi usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa