Cargando…

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data

BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Jinxiang, Li, Fuyi, Wang, Miao, Li, Junlong, Marquez-Lago, Tatiana T., Leier, André, Revote, Jerico, Li, Shuqin, Liu, Quanzhong, Song, Jiangning
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8805145/
https://www.ncbi.nlm.nih.gov/pubmed/35118375
http://dx.doi.org/10.3389/fdata.2021.727216
_version_ 1784643183610691584
author Chen, Jinxiang
Li, Fuyi
Wang, Miao
Li, Junlong
Marquez-Lago, Tatiana T.
Leier, André
Revote, Jerico
Li, Shuqin
Liu, Quanzhong
Song, Jiangning
author_facet Chen, Jinxiang
Li, Fuyi
Wang, Miao
Li, Junlong
Marquez-Lago, Tatiana T.
Leier, André
Revote, Jerico
Li, Shuqin
Liu, Quanzhong
Song, Jiangning
author_sort Chen, Jinxiang
collection PubMed
description BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
format Online
Article
Text
id pubmed-8805145
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-88051452022-02-02 BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning Front Big Data Big Data BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era. Frontiers Media S.A. 2022-01-18 /pmc/articles/PMC8805145/ /pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216 Text en Copyright © 2022 Chen, Li, Wang, Li, Marquez-Lago, Leier, Revote, Li, Liu and Song. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Big Data
Chen, Jinxiang
Li, Fuyi
Wang, Miao
Li, Junlong
Marquez-Lago, Tatiana T.
Leier, André
Revote, Jerico
Li, Shuqin
Liu, Quanzhong
Song, Jiangning
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_full BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_fullStr BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_full_unstemmed BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_short BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
title_sort bigfirst: a software program using big data technique for mining simple sequence repeats from large-scale sequencing data
topic Big Data
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8805145/
https://www.ncbi.nlm.nih.gov/pubmed/35118375
http://dx.doi.org/10.3389/fdata.2021.727216
work_keys_str_mv AT chenjinxiang bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT lifuyi bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT wangmiao bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT lijunlong bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT marquezlagotatianat bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT leierandre bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT revotejerico bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT lishuqin bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT liuquanzhong bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata
AT songjiangning bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata