Cargando…

Effective selection of informative SNPs and classification on the HapMap genotype data

BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhou, Nina, Wang, Lipo
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245981/ https://www.ncbi.nlm.nih.gov/pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484

_version_	1782150699845419008
author	Zhou, Nina Wang, Lipo
author_facet	Zhou, Nina Wang, Lipo
author_sort	Zhou, Nina
collection	PubMed
description	BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. RESULTS: In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. CONCLUSION: Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.
format	Text
id	pubmed-2245981
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22459812008-02-20 Effective selection of informative SNPs and classification on the HapMap genotype data Zhou, Nina Wang, Lipo BMC Bioinformatics Research Article BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. RESULTS: In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. CONCLUSION: Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals. BioMed Central 2007-12-20 /pmc/articles/PMC2245981/ /pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484 Text en Copyright © 2007 Zhou and Wang; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Zhou, Nina Wang, Lipo Effective selection of informative SNPs and classification on the HapMap genotype data
title	Effective selection of informative SNPs and classification on the HapMap genotype data
title_full	Effective selection of informative SNPs and classification on the HapMap genotype data
title_fullStr	Effective selection of informative SNPs and classification on the HapMap genotype data
title_full_unstemmed	Effective selection of informative SNPs and classification on the HapMap genotype data
title_short	Effective selection of informative SNPs and classification on the HapMap genotype data
title_sort	effective selection of informative snps and classification on the hapmap genotype data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245981/ https://www.ncbi.nlm.nih.gov/pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484
work_keys_str_mv	AT zhounina effectiveselectionofinformativesnpsandclassificationonthehapmapgenotypedata AT wanglipo effectiveselectionofinformativesnpsandclassificationonthehapmapgenotypedata

Effective selection of informative SNPs and classification on the HapMap genotype data

Ejemplares similares