Cargando…

GEMINI: a computationally-efficient search engine for large gene expression datasets

BACKGROUND: Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search...

Descripción completa

Detalles Bibliográficos
Autores principales:	DeFreitas, Timothy, Saddiki, Hachem, Flaherty, Patrick
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765211/ https://www.ncbi.nlm.nih.gov/pubmed/26911289 http://dx.doi.org/10.1186/s12859-016-0934-8

_version_	1782417519956459520
author	DeFreitas, Timothy Saddiki, Hachem Flaherty, Patrick
author_facet	DeFreitas, Timothy Saddiki, Hachem Flaherty, Patrick
author_sort	DeFreitas, Timothy
collection	PubMed
description	BACKGROUND: Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query – a text-based string – is mismatched with the form of the target – a genomic profile. RESULTS: To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an [Formula: see text] expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 10(5) samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec. CONCLUSIONS: GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0934-8) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4765211
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47652112016-02-25 GEMINI: a computationally-efficient search engine for large gene expression datasets DeFreitas, Timothy Saddiki, Hachem Flaherty, Patrick BMC Bioinformatics Software BACKGROUND: Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query – a text-based string – is mismatched with the form of the target – a genomic profile. RESULTS: To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an [Formula: see text] expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 10(5) samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec. CONCLUSIONS: GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0934-8) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-24 /pmc/articles/PMC4765211/ /pubmed/26911289 http://dx.doi.org/10.1186/s12859-016-0934-8 Text en © DeFreitas et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software DeFreitas, Timothy Saddiki, Hachem Flaherty, Patrick GEMINI: a computationally-efficient search engine for large gene expression datasets
title	GEMINI: a computationally-efficient search engine for large gene expression datasets
title_full	GEMINI: a computationally-efficient search engine for large gene expression datasets
title_fullStr	GEMINI: a computationally-efficient search engine for large gene expression datasets
title_full_unstemmed	GEMINI: a computationally-efficient search engine for large gene expression datasets
title_short	GEMINI: a computationally-efficient search engine for large gene expression datasets
title_sort	gemini: a computationally-efficient search engine for large gene expression datasets
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765211/ https://www.ncbi.nlm.nih.gov/pubmed/26911289 http://dx.doi.org/10.1186/s12859-016-0934-8
work_keys_str_mv	AT defreitastimothy geminiacomputationallyefficientsearchengineforlargegeneexpressiondatasets AT saddikihachem geminiacomputationallyefficientsearchengineforlargegeneexpressiondatasets AT flahertypatrick geminiacomputationallyefficientsearchengineforlargegeneexpressiondatasets

GEMINI: a computationally-efficient search engine for large gene expression datasets

Ejemplares similares