Cargando…

De novo clustering of long reads by gene from transcriptomics data

Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford...

Descripción completa

Detalles Bibliográficos
Autores principales: Marchet, Camille, Lecompte, Lolita, Silva, Corinne Da, Cruaud, Corinne, Aury, Jean-Marc, Nicolas, Jacques, Peterlongo, Pierre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6326815/
https://www.ncbi.nlm.nih.gov/pubmed/30260405
http://dx.doi.org/10.1093/nar/gky834
_version_ 1783386372964352000
author Marchet, Camille
Lecompte, Lolita
Silva, Corinne Da
Cruaud, Corinne
Aury, Jean-Marc
Nicolas, Jacques
Peterlongo, Pierre
author_facet Marchet, Camille
Lecompte, Lolita
Silva, Corinne Da
Cruaud, Corinne
Aury, Jean-Marc
Nicolas, Jacques
Peterlongo, Pierre
author_sort Marchet, Camille
collection PubMed
description Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alternative method that predicts complementary clusters.
format Online
Article
Text
id pubmed-6326815
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-63268152019-01-15 De novo clustering of long reads by gene from transcriptomics data Marchet, Camille Lecompte, Lolita Silva, Corinne Da Cruaud, Corinne Aury, Jean-Marc Nicolas, Jacques Peterlongo, Pierre Nucleic Acids Res Methods Online Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alternative method that predicts complementary clusters. Oxford University Press 2019-01-10 2018-09-27 /pmc/articles/PMC6326815/ /pubmed/30260405 http://dx.doi.org/10.1093/nar/gky834 Text en © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Marchet, Camille
Lecompte, Lolita
Silva, Corinne Da
Cruaud, Corinne
Aury, Jean-Marc
Nicolas, Jacques
Peterlongo, Pierre
De novo clustering of long reads by gene from transcriptomics data
title De novo clustering of long reads by gene from transcriptomics data
title_full De novo clustering of long reads by gene from transcriptomics data
title_fullStr De novo clustering of long reads by gene from transcriptomics data
title_full_unstemmed De novo clustering of long reads by gene from transcriptomics data
title_short De novo clustering of long reads by gene from transcriptomics data
title_sort de novo clustering of long reads by gene from transcriptomics data
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6326815/
https://www.ncbi.nlm.nih.gov/pubmed/30260405
http://dx.doi.org/10.1093/nar/gky834
work_keys_str_mv AT marchetcamille denovoclusteringoflongreadsbygenefromtranscriptomicsdata
AT lecomptelolita denovoclusteringoflongreadsbygenefromtranscriptomicsdata
AT silvacorinneda denovoclusteringoflongreadsbygenefromtranscriptomicsdata
AT cruaudcorinne denovoclusteringoflongreadsbygenefromtranscriptomicsdata
AT auryjeanmarc denovoclusteringoflongreadsbygenefromtranscriptomicsdata
AT nicolasjacques denovoclusteringoflongreadsbygenefromtranscriptomicsdata
AT peterlongopierre denovoclusteringoflongreadsbygenefromtranscriptomicsdata