Cargando…

MoTeX-II: structured MoTif eXtraction from large-scale datasets

BACKGROUND: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif...

Descripción completa

Detalles Bibliográficos
Autor principal:	Pissis, Solon P
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4227134/ https://www.ncbi.nlm.nih.gov/pubmed/25004797 http://dx.doi.org/10.1186/1471-2105-15-235

_version_	1782343743982010368
author	Pissis, Solon P
author_facet	Pissis, Solon P
author_sort	Pissis, Solon P
collection	PubMed
description	BACKGROUND: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets—for instance, large-scale ChIP-Seq datasets—cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds. RESULTS: In this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets. CONCLUSIONS: Use of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/.
format	Online Article Text
id	pubmed-4227134
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42271342014-11-12 MoTeX-II: structured MoTif eXtraction from large-scale datasets Pissis, Solon P BMC Bioinformatics Methodology Article BACKGROUND: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets—for instance, large-scale ChIP-Seq datasets—cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds. RESULTS: In this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets. CONCLUSIONS: Use of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/. BioMed Central 2014-07-08 /pmc/articles/PMC4227134/ /pubmed/25004797 http://dx.doi.org/10.1186/1471-2105-15-235 Text en Copyright © 2014 Pissis; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Pissis, Solon P MoTeX-II: structured MoTif eXtraction from large-scale datasets
title	MoTeX-II: structured MoTif eXtraction from large-scale datasets
title_full	MoTeX-II: structured MoTif eXtraction from large-scale datasets
title_fullStr	MoTeX-II: structured MoTif eXtraction from large-scale datasets
title_full_unstemmed	MoTeX-II: structured MoTif eXtraction from large-scale datasets
title_short	MoTeX-II: structured MoTif eXtraction from large-scale datasets
title_sort	motex-ii: structured motif extraction from large-scale datasets
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4227134/ https://www.ncbi.nlm.nih.gov/pubmed/25004797 http://dx.doi.org/10.1186/1471-2105-15-235
work_keys_str_mv	AT pississolonp motexiistructuredmotifextractionfromlargescaledatasets

MoTeX-II: structured MoTif eXtraction from large-scale datasets

Ejemplares similares