Cargando…
ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data
Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Genetics Society of America
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5714481/ https://www.ncbi.nlm.nih.gov/pubmed/29079682 http://dx.doi.org/10.1534/g3.117.300271 |
_version_ | 1783283590518276096 |
---|---|
author | Xia, Xuhua |
author_facet | Xia, Xuhua |
author_sort | Xia, Xuhua |
collection | PubMed |
description | Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study) deposited at NCBI’s SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the B. subtilis genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx. |
format | Online Article Text |
id | pubmed-5714481 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Genetics Society of America |
record_format | MEDLINE/PubMed |
spelling | pubmed-57144812017-12-05 ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data Xia, Xuhua G3 (Bethesda) Software and Data Resources Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study) deposited at NCBI’s SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the B. subtilis genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx. Genetics Society of America 2017-10-27 /pmc/articles/PMC5714481/ /pubmed/29079682 http://dx.doi.org/10.1534/g3.117.300271 Text en Copyright © 2017 Xia http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software and Data Resources Xia, Xuhua ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data |
title | ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data |
title_full | ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data |
title_fullStr | ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data |
title_full_unstemmed | ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data |
title_short | ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data |
title_sort | arsda: a new approach for storing, transmitting and analyzing transcriptomic data |
topic | Software and Data Resources |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5714481/ https://www.ncbi.nlm.nih.gov/pubmed/29079682 http://dx.doi.org/10.1534/g3.117.300271 |
work_keys_str_mv | AT xiaxuhua arsdaanewapproachforstoringtransmittingandanalyzingtranscriptomicdata |