Cargando…

Merging short and stranded long reads improves transcript assembly

Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approach...

Descripción completa

Detalles Bibliográficos
Autores principales: Kainth, Amoldeep S., Haddad, Gabriela A., Hall, Johnathon M., Ruthenburg, Alexander J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10629667/
https://www.ncbi.nlm.nih.gov/pubmed/37883581
http://dx.doi.org/10.1371/journal.pcbi.1011576
_version_ 1785132006022053888
author Kainth, Amoldeep S.
Haddad, Gabriela A.
Hall, Johnathon M.
Ruthenburg, Alexander J.
author_facet Kainth, Amoldeep S.
Haddad, Gabriela A.
Hall, Johnathon M.
Ruthenburg, Alexander J.
author_sort Kainth, Amoldeep S.
collection PubMed
description Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approaches to integrate short-read data. Here, we critically compare the existing methods and develop a new integrative approach to characterize a particularly challenging pool of low-abundance long noncoding RNA (lncRNA) transcripts from short- and long-read sequencing in two distinct cell lines. Our analysis reveals severe limitations in each of the sequencing platforms. For short-read assemblies, coverage declines at transcript termini resulting in ambiguous ends, and uneven low coverage results in segmentation of a single transcript into multiple transcripts. Conversely, long-read sequencing libraries lack depth and strand-of-origin information in cDNA-based methods, culminating in erroneous assembly and quantitation of transcripts. We also discover a cDNA synthesis artifact in long-read datasets that markedly impacts the identity and quantitation of assembled transcripts. Towards remediating these problems, we develop a computational pipeline to “strand” long-read cDNA libraries that rectifies inaccurate mapping and assembly of long-read transcripts. Leveraging the strengths of each platform and our computational stranding, we also present and benchmark a hybrid assembly approach that drastically increases the sensitivity and accuracy of full-length transcript assembly on the correct strand and improves detection of biological features of the transcriptome. When applied to a challenging set of under-annotated and cell-type variable lncRNA, our method resolves the segmentation problem of short-read sequencing and the depth problem of long-read sequencing, resulting in the assembly of coherent transcripts with precise 5’ and 3’ ends. Our workflow can be applied to existing datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.
format Online
Article
Text
id pubmed-10629667
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-106296672023-11-08 Merging short and stranded long reads improves transcript assembly Kainth, Amoldeep S. Haddad, Gabriela A. Hall, Johnathon M. Ruthenburg, Alexander J. PLoS Comput Biol Research Article Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approaches to integrate short-read data. Here, we critically compare the existing methods and develop a new integrative approach to characterize a particularly challenging pool of low-abundance long noncoding RNA (lncRNA) transcripts from short- and long-read sequencing in two distinct cell lines. Our analysis reveals severe limitations in each of the sequencing platforms. For short-read assemblies, coverage declines at transcript termini resulting in ambiguous ends, and uneven low coverage results in segmentation of a single transcript into multiple transcripts. Conversely, long-read sequencing libraries lack depth and strand-of-origin information in cDNA-based methods, culminating in erroneous assembly and quantitation of transcripts. We also discover a cDNA synthesis artifact in long-read datasets that markedly impacts the identity and quantitation of assembled transcripts. Towards remediating these problems, we develop a computational pipeline to “strand” long-read cDNA libraries that rectifies inaccurate mapping and assembly of long-read transcripts. Leveraging the strengths of each platform and our computational stranding, we also present and benchmark a hybrid assembly approach that drastically increases the sensitivity and accuracy of full-length transcript assembly on the correct strand and improves detection of biological features of the transcriptome. When applied to a challenging set of under-annotated and cell-type variable lncRNA, our method resolves the segmentation problem of short-read sequencing and the depth problem of long-read sequencing, resulting in the assembly of coherent transcripts with precise 5’ and 3’ ends. Our workflow can be applied to existing datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts. Public Library of Science 2023-10-26 /pmc/articles/PMC10629667/ /pubmed/37883581 http://dx.doi.org/10.1371/journal.pcbi.1011576 Text en © 2023 Kainth et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Kainth, Amoldeep S.
Haddad, Gabriela A.
Hall, Johnathon M.
Ruthenburg, Alexander J.
Merging short and stranded long reads improves transcript assembly
title Merging short and stranded long reads improves transcript assembly
title_full Merging short and stranded long reads improves transcript assembly
title_fullStr Merging short and stranded long reads improves transcript assembly
title_full_unstemmed Merging short and stranded long reads improves transcript assembly
title_short Merging short and stranded long reads improves transcript assembly
title_sort merging short and stranded long reads improves transcript assembly
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10629667/
https://www.ncbi.nlm.nih.gov/pubmed/37883581
http://dx.doi.org/10.1371/journal.pcbi.1011576
work_keys_str_mv AT kainthamoldeeps mergingshortandstrandedlongreadsimprovestranscriptassembly
AT haddadgabrielaa mergingshortandstrandedlongreadsimprovestranscriptassembly
AT halljohnathonm mergingshortandstrandedlongreadsimprovestranscriptassembly
AT ruthenburgalexanderj mergingshortandstrandedlongreadsimprovestranscriptassembly