Cargando…
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8805145/ https://www.ncbi.nlm.nih.gov/pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216 |
_version_ | 1784643183610691584 |
---|---|
author | Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning |
author_facet | Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning |
author_sort | Chen, Jinxiang |
collection | PubMed |
description | BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era. |
format | Online Article Text |
id | pubmed-8805145 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-88051452022-02-02 BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning Front Big Data Big Data BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. RESULTS: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. CONCLUSIONS: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era. Frontiers Media S.A. 2022-01-18 /pmc/articles/PMC8805145/ /pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216 Text en Copyright © 2022 Chen, Li, Wang, Li, Marquez-Lago, Leier, Revote, Li, Liu and Song. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Big Data Chen, Jinxiang Li, Fuyi Wang, Miao Li, Junlong Marquez-Lago, Tatiana T. Leier, André Revote, Jerico Li, Shuqin Liu, Quanzhong Song, Jiangning BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data |
title | BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data |
title_full | BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data |
title_fullStr | BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data |
title_full_unstemmed | BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data |
title_short | BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data |
title_sort | bigfirst: a software program using big data technique for mining simple sequence repeats from large-scale sequencing data |
topic | Big Data |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8805145/ https://www.ncbi.nlm.nih.gov/pubmed/35118375 http://dx.doi.org/10.3389/fdata.2021.727216 |
work_keys_str_mv | AT chenjinxiang bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT lifuyi bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT wangmiao bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT lijunlong bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT marquezlagotatianat bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT leierandre bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT revotejerico bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT lishuqin bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT liuquanzhong bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata AT songjiangning bigfirstasoftwareprogramusingbigdatatechniqueforminingsimplesequencerepeatsfromlargescalesequencingdata |