Cargando…
Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying importan...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245589/ https://www.ncbi.nlm.nih.gov/pubmed/30458782 http://dx.doi.org/10.1186/s12918-018-0615-5 |
_version_ | 1783372269906558976 |
---|---|
author | Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael |
author_facet | Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael |
author_sort | Valdés, María Gabriela |
collection | PubMed |
description | BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. RESULTS: The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. CONCLUSIONS: We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the “core genes”, mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the “core pathways” with apparent unrelated biological functionality. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12918-018-0615-5) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6245589 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-62455892018-11-26 Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael BMC Syst Biol Research BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. RESULTS: The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. CONCLUSIONS: We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the “core genes”, mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the “core pathways” with apparent unrelated biological functionality. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12918-018-0615-5) contains supplementary material, which is available to authorized users. BioMed Central 2018-11-20 /pmc/articles/PMC6245589/ /pubmed/30458782 http://dx.doi.org/10.1186/s12918-018-0615-5 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
title | Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
title_full | Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
title_fullStr | Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
title_full_unstemmed | Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
title_short | Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
title_sort | pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245589/ https://www.ncbi.nlm.nih.gov/pubmed/30458782 http://dx.doi.org/10.1186/s12918-018-0615-5 |
work_keys_str_mv | AT valdesmariagabriela pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT galvanfemeniaivan pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT ripollvicentribas pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT duranxavier pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT yokotajun pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT gavaldaricard pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT rafaelpalouxavier pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT decidrafael pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata |