Cargando…

Finding a suitable library size to call variants in RNA-Seq

BACKGROUND: RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost...

Descripción completa

Detalles Bibliográficos
Autores principales:	Quaglieri, Anna, Flensburg, Christoffer, Speed, Terence P., Majewski, Ian J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7708150/ https://www.ncbi.nlm.nih.gov/pubmed/33261552 http://dx.doi.org/10.1186/s12859-020-03860-4

_version_	1783617505069105152
author	Quaglieri, Anna Flensburg, Christoffer Speed, Terence P. Majewski, Ian J.
author_facet	Quaglieri, Anna Flensburg, Christoffer Speed, Terence P. Majewski, Ian J.
author_sort	Quaglieri, Anna
collection	PubMed
description	BACKGROUND: RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Here we specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets. RESULTS : We simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 6% average loss in sensitivity using 40M fragments. CONCLUSIONS: Between 30M and 40M 100 bp PE reads are needed to recover 90–95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested.
format	Online Article Text
id	pubmed-7708150
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-77081502020-12-02 Finding a suitable library size to call variants in RNA-Seq Quaglieri, Anna Flensburg, Christoffer Speed, Terence P. Majewski, Ian J. BMC Bioinformatics Methodology Article BACKGROUND: RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Here we specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets. RESULTS : We simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 6% average loss in sensitivity using 40M fragments. CONCLUSIONS: Between 30M and 40M 100 bp PE reads are needed to recover 90–95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested. BioMed Central 2020-12-01 /pmc/articles/PMC7708150/ /pubmed/33261552 http://dx.doi.org/10.1186/s12859-020-03860-4 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Article Quaglieri, Anna Flensburg, Christoffer Speed, Terence P. Majewski, Ian J. Finding a suitable library size to call variants in RNA-Seq
title	Finding a suitable library size to call variants in RNA-Seq
title_full	Finding a suitable library size to call variants in RNA-Seq
title_fullStr	Finding a suitable library size to call variants in RNA-Seq
title_full_unstemmed	Finding a suitable library size to call variants in RNA-Seq
title_short	Finding a suitable library size to call variants in RNA-Seq
title_sort	finding a suitable library size to call variants in rna-seq
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7708150/ https://www.ncbi.nlm.nih.gov/pubmed/33261552 http://dx.doi.org/10.1186/s12859-020-03860-4
work_keys_str_mv	AT quaglierianna findingasuitablelibrarysizetocallvariantsinrnaseq AT flensburgchristoffer findingasuitablelibrarysizetocallvariantsinrnaseq AT speedterencep findingasuitablelibrarysizetocallvariantsinrnaseq AT majewskiianj findingasuitablelibrarysizetocallvariantsinrnaseq

Finding a suitable library size to call variants in RNA-Seq

Ejemplares similares