Cargando…
Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
BACKGROUND: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mini...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665053/ https://www.ncbi.nlm.nih.gov/pubmed/19344481 http://dx.doi.org/10.1186/1471-2105-10-S3-S5 |
_version_ | 1782166016268173312 |
---|---|
author | Amigo, Jorge Phillips, Christopher Salas, Antonio Carracedo, Ángel |
author_facet | Amigo, Jorge Phillips, Christopher Salas, Antonio Carracedo, Ángel |
author_sort | Amigo, Jorge |
collection | PubMed |
description | BACKGROUND: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies. RESULTS: To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases. CONCLUSION: The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest. |
format | Text |
id | pubmed-2665053 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-26650532009-04-04 Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes Amigo, Jorge Phillips, Christopher Salas, Antonio Carracedo, Ángel BMC Bioinformatics Proceedings BACKGROUND: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies. RESULTS: To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases. CONCLUSION: The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest. BioMed Central 2009-03-19 /pmc/articles/PMC2665053/ /pubmed/19344481 http://dx.doi.org/10.1186/1471-2105-10-S3-S5 Text en Copyright © 2009 Amigo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Amigo, Jorge Phillips, Christopher Salas, Antonio Carracedo, Ángel Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes |
title | Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes |
title_full | Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes |
title_fullStr | Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes |
title_full_unstemmed | Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes |
title_short | Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes |
title_sort | viability of in-house datamarting approaches for population genetics analysis of snp genotypes |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665053/ https://www.ncbi.nlm.nih.gov/pubmed/19344481 http://dx.doi.org/10.1186/1471-2105-10-S3-S5 |
work_keys_str_mv | AT amigojorge viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes AT phillipschristopher viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes AT salasantonio viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes AT carracedoangel viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes |