Cargando…

Challenges and advances for transcriptome assembly in non-model species

Analyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the...

Descripción completa

Detalles Bibliográficos
Autores principales: Ungaro, Arnaud, Pech, Nicolas, Martin, Jean-François, McCairns, R. J. Scott, Mévy, Jean-Philippe, Chappaz, Rémi, Gilles, André
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5607178/
https://www.ncbi.nlm.nih.gov/pubmed/28931057
http://dx.doi.org/10.1371/journal.pone.0185020
_version_ 1783265240846172160
author Ungaro, Arnaud
Pech, Nicolas
Martin, Jean-François
McCairns, R. J. Scott
Mévy, Jean-Philippe
Chappaz, Rémi
Gilles, André
author_facet Ungaro, Arnaud
Pech, Nicolas
Martin, Jean-François
McCairns, R. J. Scott
Mévy, Jean-Philippe
Chappaz, Rémi
Gilles, André
author_sort Ungaro, Arnaud
collection PubMed
description Analyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the case for non-model species, we evaluate whether using blastn would outperform mapping methods for read assignment in such situations (>15% divergence). We demonstrate its high performance by using simulated reads of lengths corresponding to those generated by the most common sequencing platforms, and over a realistic range of genetic divergence (0% to 30% divergence). Here we focus on gene identification and not on resolving the whole set of transcripts (i.e. the complete transcriptome). For simulated datasets, the transcriptome-guided assembly based on blastn recovers 94.8% of genes irrespective of read length at 0% divergence; however, assignment rate of reads is negatively correlated with both increasing divergence level and reducing read lengths. Nevertheless, we still observe 92.6% of recovered genes at 30% divergence irrespective of read length. This analysis also produces a categorization of genes relative to their assignment, and suggests guidelines for data processing prior to analyses of comparative transcriptomics and gene expression to minimize potential inferential bias associated with incorrect transcript assignment. We also compare the performances of de novo assembly alone vs in combination with a transcriptome-guided assembly based on blastn both via simulation and empirically, using data from a cyprinid fish species and from an oak species. For any simulated scenario, the transcriptome-guided assembly using blastn outperforms the de novo approach alone, including when the divergence level is beyond the reach of traditional mapping methods. Combining de novo assembly and a related reference transcriptome for read assignment also addresses the bias/error in contigs caused by the dependence on a related reference alone. Empirical data corroborate these findings when assembling transcriptomes from the two non-model organisms: Parachondrostoma toxostoma (fish) and Quercus pubescens (plant). For the fish species, out of the 31,944 genes known from D. rerio, the guided and de novo assemblies recover respectively 20,605 and 20,032 genes but the performance of the guided assembly approach is much higher for both the contiguity and completeness metrics. For the oak, out of the 29,971 genes known from Vitis vinifera, the transcriptome-guided and de novo assemblies display similar performance, but the new guided approach detects 16,326 genes where the de novo assembly only detects 9,385 genes.
format Online
Article
Text
id pubmed-5607178
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-56071782017-10-09 Challenges and advances for transcriptome assembly in non-model species Ungaro, Arnaud Pech, Nicolas Martin, Jean-François McCairns, R. J. Scott Mévy, Jean-Philippe Chappaz, Rémi Gilles, André PLoS One Research Article Analyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the case for non-model species, we evaluate whether using blastn would outperform mapping methods for read assignment in such situations (>15% divergence). We demonstrate its high performance by using simulated reads of lengths corresponding to those generated by the most common sequencing platforms, and over a realistic range of genetic divergence (0% to 30% divergence). Here we focus on gene identification and not on resolving the whole set of transcripts (i.e. the complete transcriptome). For simulated datasets, the transcriptome-guided assembly based on blastn recovers 94.8% of genes irrespective of read length at 0% divergence; however, assignment rate of reads is negatively correlated with both increasing divergence level and reducing read lengths. Nevertheless, we still observe 92.6% of recovered genes at 30% divergence irrespective of read length. This analysis also produces a categorization of genes relative to their assignment, and suggests guidelines for data processing prior to analyses of comparative transcriptomics and gene expression to minimize potential inferential bias associated with incorrect transcript assignment. We also compare the performances of de novo assembly alone vs in combination with a transcriptome-guided assembly based on blastn both via simulation and empirically, using data from a cyprinid fish species and from an oak species. For any simulated scenario, the transcriptome-guided assembly using blastn outperforms the de novo approach alone, including when the divergence level is beyond the reach of traditional mapping methods. Combining de novo assembly and a related reference transcriptome for read assignment also addresses the bias/error in contigs caused by the dependence on a related reference alone. Empirical data corroborate these findings when assembling transcriptomes from the two non-model organisms: Parachondrostoma toxostoma (fish) and Quercus pubescens (plant). For the fish species, out of the 31,944 genes known from D. rerio, the guided and de novo assemblies recover respectively 20,605 and 20,032 genes but the performance of the guided assembly approach is much higher for both the contiguity and completeness metrics. For the oak, out of the 29,971 genes known from Vitis vinifera, the transcriptome-guided and de novo assemblies display similar performance, but the new guided approach detects 16,326 genes where the de novo assembly only detects 9,385 genes. Public Library of Science 2017-09-20 /pmc/articles/PMC5607178/ /pubmed/28931057 http://dx.doi.org/10.1371/journal.pone.0185020 Text en © 2017 Ungaro et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Ungaro, Arnaud
Pech, Nicolas
Martin, Jean-François
McCairns, R. J. Scott
Mévy, Jean-Philippe
Chappaz, Rémi
Gilles, André
Challenges and advances for transcriptome assembly in non-model species
title Challenges and advances for transcriptome assembly in non-model species
title_full Challenges and advances for transcriptome assembly in non-model species
title_fullStr Challenges and advances for transcriptome assembly in non-model species
title_full_unstemmed Challenges and advances for transcriptome assembly in non-model species
title_short Challenges and advances for transcriptome assembly in non-model species
title_sort challenges and advances for transcriptome assembly in non-model species
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5607178/
https://www.ncbi.nlm.nih.gov/pubmed/28931057
http://dx.doi.org/10.1371/journal.pone.0185020
work_keys_str_mv AT ungaroarnaud challengesandadvancesfortranscriptomeassemblyinnonmodelspecies
AT pechnicolas challengesandadvancesfortranscriptomeassemblyinnonmodelspecies
AT martinjeanfrancois challengesandadvancesfortranscriptomeassemblyinnonmodelspecies
AT mccairnsrjscott challengesandadvancesfortranscriptomeassemblyinnonmodelspecies
AT mevyjeanphilippe challengesandadvancesfortranscriptomeassemblyinnonmodelspecies
AT chappazremi challengesandadvancesfortranscriptomeassemblyinnonmodelspecies
AT gillesandre challengesandadvancesfortranscriptomeassemblyinnonmodelspecies