Cargando…

Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

BACKGROUND: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mini...

Descripción completa

Detalles Bibliográficos
Autores principales: Amigo, Jorge, Phillips, Christopher, Salas, Antonio, Carracedo, Ángel
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665053/
https://www.ncbi.nlm.nih.gov/pubmed/19344481
http://dx.doi.org/10.1186/1471-2105-10-S3-S5
_version_ 1782166016268173312
author Amigo, Jorge
Phillips, Christopher
Salas, Antonio
Carracedo, Ángel
author_facet Amigo, Jorge
Phillips, Christopher
Salas, Antonio
Carracedo, Ángel
author_sort Amigo, Jorge
collection PubMed
description BACKGROUND: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies. RESULTS: To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases. CONCLUSION: The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest.
format Text
id pubmed-2665053
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26650532009-04-04 Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes Amigo, Jorge Phillips, Christopher Salas, Antonio Carracedo, Ángel BMC Bioinformatics Proceedings BACKGROUND: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies. RESULTS: To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases. CONCLUSION: The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest. BioMed Central 2009-03-19 /pmc/articles/PMC2665053/ /pubmed/19344481 http://dx.doi.org/10.1186/1471-2105-10-S3-S5 Text en Copyright © 2009 Amigo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Amigo, Jorge
Phillips, Christopher
Salas, Antonio
Carracedo, Ángel
Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
title Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
title_full Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
title_fullStr Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
title_full_unstemmed Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
title_short Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes
title_sort viability of in-house datamarting approaches for population genetics analysis of snp genotypes
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665053/
https://www.ncbi.nlm.nih.gov/pubmed/19344481
http://dx.doi.org/10.1186/1471-2105-10-S3-S5
work_keys_str_mv AT amigojorge viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes
AT phillipschristopher viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes
AT salasantonio viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes
AT carracedoangel viabilityofinhousedatamartingapproachesforpopulationgeneticsanalysisofsnpgenotypes