Cargando…

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regi...

Descripción completa

Detalles Bibliográficos
Autores principales: Schubach, Max, Re, Matteo, Robinson, Peter N., Valentini, Giorgio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5462751/
https://www.ncbi.nlm.nih.gov/pubmed/28592878
http://dx.doi.org/10.1038/s41598-017-03011-5
_version_ 1783242564292313088
author Schubach, Max
Re, Matteo
Robinson, Peter N.
Valentini, Giorgio
author_facet Schubach, Max
Re, Matteo
Robinson, Peter N.
Valentini, Giorgio
author_sort Schubach, Max
collection PubMed
description Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.
format Online
Article
Text
id pubmed-5462751
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-54627512017-06-08 Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants Schubach, Max Re, Matteo Robinson, Peter N. Valentini, Giorgio Sci Rep Article Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task. Nature Publishing Group UK 2017-06-07 /pmc/articles/PMC5462751/ /pubmed/28592878 http://dx.doi.org/10.1038/s41598-017-03011-5 Text en © The Author(s) 2017 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Schubach, Max
Re, Matteo
Robinson, Peter N.
Valentini, Giorgio
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
title Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
title_full Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
title_fullStr Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
title_full_unstemmed Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
title_short Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
title_sort imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5462751/
https://www.ncbi.nlm.nih.gov/pubmed/28592878
http://dx.doi.org/10.1038/s41598-017-03011-5
work_keys_str_mv AT schubachmax imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants
AT rematteo imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants
AT robinsonpetern imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants
AT valentinigiorgio imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants