Cargando…

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)

BACKGROUND: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, partic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Piette, Elizabeth R., Moore, Jason H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5907739/ https://www.ncbi.nlm.nih.gov/pubmed/29713384 http://dx.doi.org/10.1186/s13040-018-0167-7

_version_	1783315596686917632
author	Piette, Elizabeth R. Moore, Jason H.
author_facet	Piette, Elizabeth R. Moore, Jason H.
author_sort	Piette, Elizabeth R.
collection	PubMed
description	BACKGROUND: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions. RESULTS: We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results. CONCLUSIONS: Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0167-7) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5907739
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-59077392018-04-30 Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV) Piette, Elizabeth R. Moore, Jason H. BioData Min Methodology BACKGROUND: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions. RESULTS: We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results. CONCLUSIONS: Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0167-7) contains supplementary material, which is available to authorized users. BioMed Central 2018-04-19 /pmc/articles/PMC5907739/ /pubmed/29713384 http://dx.doi.org/10.1186/s13040-018-0167-7 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Piette, Elizabeth R. Moore, Jason H. Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)
title	Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)
title_full	Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)
title_fullStr	Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)
title_full_unstemmed	Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)
title_short	Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)
title_sort	improving machine learning reproducibility in genetic association studies with proportional instance cross validation (picv)
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5907739/ https://www.ncbi.nlm.nih.gov/pubmed/29713384 http://dx.doi.org/10.1186/s13040-018-0167-7
work_keys_str_mv	AT pietteelizabethr improvingmachinelearningreproducibilityingeneticassociationstudieswithproportionalinstancecrossvalidationpicv AT moorejasonh improvingmachinelearningreproducibilityingeneticassociationstudieswithproportionalinstancecrossvalidationpicv

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)

Ejemplares similares