Cargando…

Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data

BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying importan...

Descripción completa

Detalles Bibliográficos
Autores principales:	Valdés, María Gabriela, Galván-Femenía, Iván, Ripoll, Vicent Ribas, Duran, Xavier, Yokota, Jun, Gavaldà, Ricard, Rafael-Palou, Xavier, de Cid, Rafael
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245589/ https://www.ncbi.nlm.nih.gov/pubmed/30458782 http://dx.doi.org/10.1186/s12918-018-0615-5

_version_	1783372269906558976
author	Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael
author_facet	Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael
author_sort	Valdés, María Gabriela
collection	PubMed
description	BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. RESULTS: The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. CONCLUSIONS: We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the “core genes”, mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the “core pathways” with apparent unrelated biological functionality. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12918-018-0615-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6245589
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-62455892018-11-26 Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael BMC Syst Biol Research BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. RESULTS: The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. CONCLUSIONS: We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the “core genes”, mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the “core pathways” with apparent unrelated biological functionality. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12918-018-0615-5) contains supplementary material, which is available to authorized users. BioMed Central 2018-11-20 /pmc/articles/PMC6245589/ /pubmed/30458782 http://dx.doi.org/10.1186/s12918-018-0615-5 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Valdés, María Gabriela Galván-Femenía, Iván Ripoll, Vicent Ribas Duran, Xavier Yokota, Jun Gavaldà, Ricard Rafael-Palou, Xavier de Cid, Rafael Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
title	Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
title_full	Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
title_fullStr	Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
title_full_unstemmed	Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
title_short	Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
title_sort	pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245589/ https://www.ncbi.nlm.nih.gov/pubmed/30458782 http://dx.doi.org/10.1186/s12918-018-0615-5
work_keys_str_mv	AT valdesmariagabriela pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT galvanfemeniaivan pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT ripollvicentribas pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT duranxavier pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT yokotajun pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT gavaldaricard pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT rafaelpalouxavier pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata AT decidrafael pipelinedesigntoidentifykeyfeaturesandclassifythechemotherapyresponseonlungcancerpatientsusinglargescalegeneticdata

Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data

Ejemplares similares