Cargando…

START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries

BACKGROUND: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computationa...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhu, Xinjie, Zhang, Qiang, Ho, Eric Dun, Yu, Ken Hung-On, Liu, Chris, Huang, Tim H., Cheng, Alfred Sze-Lok, Kao, Ben, Lo, Eric, Yip, Kevin Y.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610441/
https://www.ncbi.nlm.nih.gov/pubmed/28938868
http://dx.doi.org/10.1186/s12864-017-4071-1
_version_ 1783265779375931392
author Zhu, Xinjie
Zhang, Qiang
Ho, Eric Dun
Yu, Ken Hung-On
Liu, Chris
Huang, Tim H.
Cheng, Alfred Sze-Lok
Kao, Ben
Lo, Eric
Yip, Kevin Y.
author_facet Zhu, Xinjie
Zhang, Qiang
Ho, Eric Dun
Yu, Ken Hung-On
Liu, Chris
Huang, Tim H.
Cheng, Alfred Sze-Lok
Kao, Ben
Lo, Eric
Yip, Kevin Y.
author_sort Zhu, Xinjie
collection PubMed
description BACKGROUND: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. RESULTS: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. CONCLUSIONS: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4071-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5610441
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-56104412017-10-10 START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries Zhu, Xinjie Zhang, Qiang Ho, Eric Dun Yu, Ken Hung-On Liu, Chris Huang, Tim H. Cheng, Alfred Sze-Lok Kao, Ben Lo, Eric Yip, Kevin Y. BMC Genomics Software BACKGROUND: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. RESULTS: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. CONCLUSIONS: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4071-1) contains supplementary material, which is available to authorized users. BioMed Central 2017-09-22 /pmc/articles/PMC5610441/ /pubmed/28938868 http://dx.doi.org/10.1186/s12864-017-4071-1 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Zhu, Xinjie
Zhang, Qiang
Ho, Eric Dun
Yu, Ken Hung-On
Liu, Chris
Huang, Tim H.
Cheng, Alfred Sze-Lok
Kao, Ben
Lo, Eric
Yip, Kevin Y.
START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_full START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_fullStr START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_full_unstemmed START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_short START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_sort start: a system for flexible analysis of hundreds of genomic signal tracks in few lines of sql-like queries
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610441/
https://www.ncbi.nlm.nih.gov/pubmed/28938868
http://dx.doi.org/10.1186/s12864-017-4071-1
work_keys_str_mv AT zhuxinjie startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT zhangqiang startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT hoericdun startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT yukenhungon startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT liuchris startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT huangtimh startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT chengalfredszelok startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT kaoben startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT loeric startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT yipkeviny startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries