Cargando…

Breast cancer prediction using genome wide single nucleotide polymorphism data

BACKGROUND: This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile. RESULTS: We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls)...

Descripción completa

Detalles Bibliográficos
Autores principales: Hajiloo, Mohsen, Damavandi, Babak, HooshSadat, Metanat, Sangi, Farzad, Mackey, John R, Cass, Carol E, Greiner, Russell, Damaraju, Sambasivarao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3891310/
https://www.ncbi.nlm.nih.gov/pubmed/24266904
http://dx.doi.org/10.1186/1471-2105-14-S13-S3
_version_ 1782299364314578944
author Hajiloo, Mohsen
Damavandi, Babak
HooshSadat, Metanat
Sangi, Farzad
Mackey, John R
Cass, Carol E
Greiner, Russell
Damaraju, Sambasivarao
author_facet Hajiloo, Mohsen
Damavandi, Babak
HooshSadat, Metanat
Sangi, Farzad
Mackey, John R
Cass, Carol E
Greiner, Russell
Damaraju, Sambasivarao
author_sort Hajiloo, Mohsen
collection PubMed
description BACKGROUND: This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile. RESULTS: We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline. CONCLUSIONS: We anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors.
format Online
Article
Text
id pubmed-3891310
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38913102014-01-15 Breast cancer prediction using genome wide single nucleotide polymorphism data Hajiloo, Mohsen Damavandi, Babak HooshSadat, Metanat Sangi, Farzad Mackey, John R Cass, Carol E Greiner, Russell Damaraju, Sambasivarao BMC Bioinformatics Research BACKGROUND: This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile. RESULTS: We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline. CONCLUSIONS: We anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors. BioMed Central 2013-10-01 /pmc/articles/PMC3891310/ /pubmed/24266904 http://dx.doi.org/10.1186/1471-2105-14-S13-S3 Text en Copyright © 2013 Hajiloo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Hajiloo, Mohsen
Damavandi, Babak
HooshSadat, Metanat
Sangi, Farzad
Mackey, John R
Cass, Carol E
Greiner, Russell
Damaraju, Sambasivarao
Breast cancer prediction using genome wide single nucleotide polymorphism data
title Breast cancer prediction using genome wide single nucleotide polymorphism data
title_full Breast cancer prediction using genome wide single nucleotide polymorphism data
title_fullStr Breast cancer prediction using genome wide single nucleotide polymorphism data
title_full_unstemmed Breast cancer prediction using genome wide single nucleotide polymorphism data
title_short Breast cancer prediction using genome wide single nucleotide polymorphism data
title_sort breast cancer prediction using genome wide single nucleotide polymorphism data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3891310/
https://www.ncbi.nlm.nih.gov/pubmed/24266904
http://dx.doi.org/10.1186/1471-2105-14-S13-S3
work_keys_str_mv AT hajiloomohsen breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT damavandibabak breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT hooshsadatmetanat breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT sangifarzad breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT mackeyjohnr breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT casscarole breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT greinerrussell breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata
AT damarajusambasivarao breastcancerpredictionusinggenomewidesinglenucleotidepolymorphismdata