Cargando…

Effective selection of informative SNPs and classification on the HapMap genotype data

BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Nina, Wang, Lipo
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245981/
https://www.ncbi.nlm.nih.gov/pubmed/18093342
http://dx.doi.org/10.1186/1471-2105-8-484
_version_ 1782150699845419008
author Zhou, Nina
Wang, Lipo
author_facet Zhou, Nina
Wang, Lipo
author_sort Zhou, Nina
collection PubMed
description BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. RESULTS: In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. CONCLUSION: Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.
format Text
id pubmed-2245981
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22459812008-02-20 Effective selection of informative SNPs and classification on the HapMap genotype data Zhou, Nina Wang, Lipo BMC Bioinformatics Research Article BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. RESULTS: In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. CONCLUSION: Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals. BioMed Central 2007-12-20 /pmc/articles/PMC2245981/ /pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484 Text en Copyright © 2007 Zhou and Wang; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Zhou, Nina
Wang, Lipo
Effective selection of informative SNPs and classification on the HapMap genotype data
title Effective selection of informative SNPs and classification on the HapMap genotype data
title_full Effective selection of informative SNPs and classification on the HapMap genotype data
title_fullStr Effective selection of informative SNPs and classification on the HapMap genotype data
title_full_unstemmed Effective selection of informative SNPs and classification on the HapMap genotype data
title_short Effective selection of informative SNPs and classification on the HapMap genotype data
title_sort effective selection of informative snps and classification on the hapmap genotype data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245981/
https://www.ncbi.nlm.nih.gov/pubmed/18093342
http://dx.doi.org/10.1186/1471-2105-8-484
work_keys_str_mv AT zhounina effectiveselectionofinformativesnpsandclassificationonthehapmapgenotypedata
AT wanglipo effectiveselectionofinformativesnpsandclassificationonthehapmapgenotypedata