Cargando…

Collective feature selection to identify crucial epistatic variants

BACKGROUND: Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample...

Descripción completa

Detalles Bibliográficos
Autores principales: Verma, Shefali S., Lucas, Anastasia, Zhang, Xinyuan, Veturi, Yogasudha, Dudek, Scott, Li, Binglan, Li, Ruowang, Urbanowicz, Ryan, Moore, Jason H., Kim, Dokyoon, Ritchie, Marylyn D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5907720/
https://www.ncbi.nlm.nih.gov/pubmed/29713383
http://dx.doi.org/10.1186/s13040-018-0168-6
_version_ 1783315592264024064
author Verma, Shefali S.
Lucas, Anastasia
Zhang, Xinyuan
Veturi, Yogasudha
Dudek, Scott
Li, Binglan
Li, Ruowang
Urbanowicz, Ryan
Moore, Jason H.
Kim, Dokyoon
Ritchie, Marylyn D.
author_facet Verma, Shefali S.
Lucas, Anastasia
Zhang, Xinyuan
Veturi, Yogasudha
Dudek, Scott
Li, Binglan
Li, Ruowang
Urbanowicz, Ryan
Moore, Jason H.
Kim, Dokyoon
Ritchie, Marylyn D.
author_sort Verma, Shefali S.
collection PubMed
description BACKGROUND: Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called “short fat data” problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach. RESULTS: Through our simulation study we propose a collective feature selection approach to select features that are in the “union” of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger’s MyCode Community Health Initiative (on behalf of DiscovEHR collaboration). CONCLUSIONS: In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.
format Online
Article
Text
id pubmed-5907720
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-59077202018-04-30 Collective feature selection to identify crucial epistatic variants Verma, Shefali S. Lucas, Anastasia Zhang, Xinyuan Veturi, Yogasudha Dudek, Scott Li, Binglan Li, Ruowang Urbanowicz, Ryan Moore, Jason H. Kim, Dokyoon Ritchie, Marylyn D. BioData Min Research BACKGROUND: Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called “short fat data” problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach. RESULTS: Through our simulation study we propose a collective feature selection approach to select features that are in the “union” of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger’s MyCode Community Health Initiative (on behalf of DiscovEHR collaboration). CONCLUSIONS: In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity. BioMed Central 2018-04-19 /pmc/articles/PMC5907720/ /pubmed/29713383 http://dx.doi.org/10.1186/s13040-018-0168-6 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Verma, Shefali S.
Lucas, Anastasia
Zhang, Xinyuan
Veturi, Yogasudha
Dudek, Scott
Li, Binglan
Li, Ruowang
Urbanowicz, Ryan
Moore, Jason H.
Kim, Dokyoon
Ritchie, Marylyn D.
Collective feature selection to identify crucial epistatic variants
title Collective feature selection to identify crucial epistatic variants
title_full Collective feature selection to identify crucial epistatic variants
title_fullStr Collective feature selection to identify crucial epistatic variants
title_full_unstemmed Collective feature selection to identify crucial epistatic variants
title_short Collective feature selection to identify crucial epistatic variants
title_sort collective feature selection to identify crucial epistatic variants
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5907720/
https://www.ncbi.nlm.nih.gov/pubmed/29713383
http://dx.doi.org/10.1186/s13040-018-0168-6
work_keys_str_mv AT vermashefalis collectivefeatureselectiontoidentifycrucialepistaticvariants
AT lucasanastasia collectivefeatureselectiontoidentifycrucialepistaticvariants
AT zhangxinyuan collectivefeatureselectiontoidentifycrucialepistaticvariants
AT veturiyogasudha collectivefeatureselectiontoidentifycrucialepistaticvariants
AT dudekscott collectivefeatureselectiontoidentifycrucialepistaticvariants
AT libinglan collectivefeatureselectiontoidentifycrucialepistaticvariants
AT liruowang collectivefeatureselectiontoidentifycrucialepistaticvariants
AT urbanowiczryan collectivefeatureselectiontoidentifycrucialepistaticvariants
AT moorejasonh collectivefeatureselectiontoidentifycrucialepistaticvariants
AT kimdokyoon collectivefeatureselectiontoidentifycrucialepistaticvariants
AT ritchiemarylynd collectivefeatureselectiontoidentifycrucialepistaticvariants