Cargando…

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data

BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Jinxiang, Li, Fuyi, Wang, Miao, Li, Junlong, Marquez-Lago, Tatiana T., Leier, André, Revote, Jerico, Li, Shuqin, Liu, Quanzhong, Song, Jiangning
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Big Data
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8805145/ https://www.ncbi.nlm.nih.gov/pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216

_version_	1784643183610691584
author	Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning
author_facet	Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning
author_sort	Chen, Jinxiang
collection	PubMed
description	BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
format	Online Article Text
id	pubmed-8805145
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-88051452022-02-02 BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning Front Big Data Big Data BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era. Frontiers Media S.A. 2022-01-18 /pmc/articles/PMC8805145/ /pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216 Text en Copyright © 2022 Chen, Li, Wang, Li, Marquez-Lago, Leier, Revote, Li, Liu and Song. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Big Data Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title	BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_full	BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_fullStr	BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_full_unstemmed	BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_short	BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_sort	bigfirst: a software program using big data technique for mining simple sequence repeats from large-scale sequencing data
topic	Big Data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8805145/ https://www.ncbi.nlm.nih.gov/pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216
work_keys_str_mv	AT chenjinxiang bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT lifuyi bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT wangmiao bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT lijunlong bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT marquezlagotatianat bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT leierandre bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT revotejerico bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT lishuqin bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT liuquanzhong bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT songjiangning bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data

Ejemplares similares