Cargando…
Effective selection of informative SNPs and classification on the HapMap genotype data
BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2007
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245981/ https://www.ncbi.nlm.nih.gov/pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484 |
_version_ | 1782150699845419008 |
---|---|
author | Zhou, Nina Wang, Lipo |
author_facet | Zhou, Nina Wang, Lipo |
author_sort | Zhou, Nina |
collection | PubMed |
description | BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. RESULTS: In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. CONCLUSION: Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals. |
format | Text |
id | pubmed-2245981 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2007 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-22459812008-02-20 Effective selection of informative SNPs and classification on the HapMap genotype data Zhou, Nina Wang, Lipo BMC Bioinformatics Research Article BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. RESULTS: In this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. CONCLUSION: Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals. BioMed Central 2007-12-20 /pmc/articles/PMC2245981/ /pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484 Text en Copyright © 2007 Zhou and Wang; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Zhou, Nina Wang, Lipo Effective selection of informative SNPs and classification on the HapMap genotype data |
title | Effective selection of informative SNPs and classification on the HapMap genotype data |
title_full | Effective selection of informative SNPs and classification on the HapMap genotype data |
title_fullStr | Effective selection of informative SNPs and classification on the HapMap genotype data |
title_full_unstemmed | Effective selection of informative SNPs and classification on the HapMap genotype data |
title_short | Effective selection of informative SNPs and classification on the HapMap genotype data |
title_sort | effective selection of informative snps and classification on the hapmap genotype data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245981/ https://www.ncbi.nlm.nih.gov/pubmed/18093342 http://dx.doi.org/10.1186/1471-2105-8-484 |
work_keys_str_mv | AT zhounina effectiveselectionofinformativesnpsandclassificationonthehapmapgenotypedata AT wanglipo effectiveselectionofinformativesnpsandclassificationonthehapmapgenotypedata |