Cargando…
Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)
Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6374022/ https://www.ncbi.nlm.nih.gov/pubmed/30759150 http://dx.doi.org/10.1371/journal.pone.0212112 |
_version_ | 1783395093471821824 |
---|---|
author | Zhao, Juan Feng, QiPing Wu, Patrick Warner, Jeremy L. Denny, Joshua C. Wei, Wei-Qi |
author_facet | Zhao, Juan Feng, QiPing Wu, Patrick Warner, Jeremy L. Denny, Joshua C. Wei, Wei-Qi |
author_sort | Zhao, Juan |
collection | PubMed |
description | Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases. |
format | Online Article Text |
id | pubmed-6374022 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-63740222019-03-01 Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) Zhao, Juan Feng, QiPing Wu, Patrick Warner, Jeremy L. Denny, Joshua C. Wei, Wei-Qi PLoS One Research Article Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases. Public Library of Science 2019-02-13 /pmc/articles/PMC6374022/ /pubmed/30759150 http://dx.doi.org/10.1371/journal.pone.0212112 Text en © 2019 Zhao et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Zhao, Juan Feng, QiPing Wu, Patrick Warner, Jeremy L. Denny, Joshua C. Wei, Wei-Qi Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) |
title | Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) |
title_full | Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) |
title_fullStr | Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) |
title_full_unstemmed | Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) |
title_short | Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA) |
title_sort | using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: a case study of lipoprotein(a) (lpa) |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6374022/ https://www.ncbi.nlm.nih.gov/pubmed/30759150 http://dx.doi.org/10.1371/journal.pone.0212112 |
work_keys_str_mv | AT zhaojuan usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa AT fengqiping usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa AT wupatrick usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa AT warnerjeremyl usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa AT dennyjoshuac usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa AT weiweiqi usingtopicmodelingvianonnegativematrixfactorizationtoidentifyrelationshipsbetweengeneticvariantsanddiseasephenotypesacasestudyoflipoproteinalpa |