Cargando…
Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms
BACKGROUND: The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological infor...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3489510/ https://www.ncbi.nlm.nih.gov/pubmed/22808927 http://dx.doi.org/10.1186/1471-2105-13-170 |
_version_ | 1782248729480265728 |
---|---|
author | Haznedaroglu, Berat Z Reeves, Darryl Rismani-Yazdi, Hamid Peccia, Jordan |
author_facet | Haznedaroglu, Berat Z Reeves, Darryl Rismani-Yazdi, Hamid Peccia, Jordan |
author_sort | Haznedaroglu, Berat Z |
collection | PubMed |
description | BACKGROUND: The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. RESULTS: Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. CONCLUSIONS: This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA. |
format | Online Article Text |
id | pubmed-3489510 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-34895102012-11-06 Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms Haznedaroglu, Berat Z Reeves, Darryl Rismani-Yazdi, Hamid Peccia, Jordan BMC Bioinformatics Research Article BACKGROUND: The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. RESULTS: Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. CONCLUSIONS: This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA. BioMed Central 2012-07-18 /pmc/articles/PMC3489510/ /pubmed/22808927 http://dx.doi.org/10.1186/1471-2105-13-170 Text en Copyright ©2012 Haznedaroglu et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Haznedaroglu, Berat Z Reeves, Darryl Rismani-Yazdi, Hamid Peccia, Jordan Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
title | Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
title_full | Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
title_fullStr | Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
title_full_unstemmed | Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
title_short | Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
title_sort | optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3489510/ https://www.ncbi.nlm.nih.gov/pubmed/22808927 http://dx.doi.org/10.1186/1471-2105-13-170 |
work_keys_str_mv | AT haznedarogluberatz optimizationofdenovotranscriptomeassemblyfromhighthroughputshortreadsequencingdataimprovesfunctionalannotationfornonmodelorganisms AT reevesdarryl optimizationofdenovotranscriptomeassemblyfromhighthroughputshortreadsequencingdataimprovesfunctionalannotationfornonmodelorganisms AT rismaniyazdihamid optimizationofdenovotranscriptomeassemblyfromhighthroughputshortreadsequencingdataimprovesfunctionalannotationfornonmodelorganisms AT pecciajordan optimizationofdenovotranscriptomeassemblyfromhighthroughputshortreadsequencingdataimprovesfunctionalannotationfornonmodelorganisms |