Cargando…

A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations

High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consens...

Descripción completa

Detalles Bibliográficos
Autores principales: King, David J., Freimanis, Graham, Lasecka-Dykes, Lidia, Asfor, Amin, Ribeca, Paolo, Waters, Ryan, King, Donald P., Laing, Emma
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7594041/
https://www.ncbi.nlm.nih.gov/pubmed/33092085
http://dx.doi.org/10.3390/v12101187
_version_ 1783601539818979328
author King, David J.
Freimanis, Graham
Lasecka-Dykes, Lidia
Asfor, Amin
Ribeca, Paolo
Waters, Ryan
King, Donald P.
Laing, Emma
author_facet King, David J.
Freimanis, Graham
Lasecka-Dykes, Lidia
Asfor, Amin
Ribeca, Paolo
Waters, Ryan
King, Donald P.
Laing, Emma
author_sort King, David J.
collection PubMed
description High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA “populations” were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (10(7) copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (10(5) copies) required more technical replicates to maintain accuracy, while low RNA inputs (10(3) copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants.
format Online
Article
Text
id pubmed-7594041
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75940412020-10-30 A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations King, David J. Freimanis, Graham Lasecka-Dykes, Lidia Asfor, Amin Ribeca, Paolo Waters, Ryan King, Donald P. Laing, Emma Viruses Article High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA “populations” were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (10(7) copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (10(5) copies) required more technical replicates to maintain accuracy, while low RNA inputs (10(3) copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants. MDPI 2020-10-20 /pmc/articles/PMC7594041/ /pubmed/33092085 http://dx.doi.org/10.3390/v12101187 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
King, David J.
Freimanis, Graham
Lasecka-Dykes, Lidia
Asfor, Amin
Ribeca, Paolo
Waters, Ryan
King, Donald P.
Laing, Emma
A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations
title A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations
title_full A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations
title_fullStr A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations
title_full_unstemmed A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations
title_short A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations
title_sort systematic evaluation of high-throughput sequencing approaches to identify low-frequency single nucleotide variants in viral populations
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7594041/
https://www.ncbi.nlm.nih.gov/pubmed/33092085
http://dx.doi.org/10.3390/v12101187
work_keys_str_mv AT kingdavidj asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT freimanisgraham asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT laseckadykeslidia asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT asforamin asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT ribecapaolo asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT watersryan asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT kingdonaldp asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT laingemma asystematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT kingdavidj systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT freimanisgraham systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT laseckadykeslidia systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT asforamin systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT ribecapaolo systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT watersryan systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT kingdonaldp systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations
AT laingemma systematicevaluationofhighthroughputsequencingapproachestoidentifylowfrequencysinglenucleotidevariantsinviralpopulations