Cargando…

Sequencing artifacts in the type A influenza databases and attempts to correct them

BACKGROUND: There are over 276 000 influenza gene sequences in public databases, with the quality of the sequences determined by the contributor. OBJECTIVE: As part of a high school class project, influenza sequences with possible errors were identified in the public databases based on the size of t...

Descripción completa

Detalles Bibliográficos
Autores principales: Suarez, David L, Chester, Nikki, Hatfield, Jason
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Blackwell Publishing Ltd 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4181811/
https://www.ncbi.nlm.nih.gov/pubmed/24512607
http://dx.doi.org/10.1111/irv.12239
_version_ 1782337426195218432
author Suarez, David L
Chester, Nikki
Hatfield, Jason
author_facet Suarez, David L
Chester, Nikki
Hatfield, Jason
author_sort Suarez, David L
collection PubMed
description BACKGROUND: There are over 276 000 influenza gene sequences in public databases, with the quality of the sequences determined by the contributor. OBJECTIVE: As part of a high school class project, influenza sequences with possible errors were identified in the public databases based on the size of the gene being longer than expected, with the hypothesis that these sequences would have an error. Students contacted sequence submitters alerting them of the possible sequence issue(s) and requested they the suspect sequence(s) be correct as appropriate. METHODS: Type A influenza viruses were screened, and gene segments longer than the accepted size were identified for further analysis. Attention was placed on sequences with additional nucleotides upstream or downstream of the highly conserved non-coding ends of the viral segments. RESULTS AND CONCLUSIONS: A total of 1081 sequences were identified that met this criterion. Three types of errors were commonly observed: non-influenza primer sequence wasn't removed from the sequence; PCR product was cloned and plasmid sequence was included in the sequence; and Taq polymerase added an adenine at the end of the PCR product. Internal insertions of nucleotide sequence were also commonly observed, but in many cases it was unclear if the sequence was correct or actually contained an error. A total of 215 sequences, or 22.8% of the suspect sequences, were corrected in the public databases in the first year of the student project. Unfortunately 138 additional sequences with possible errors were added to the databases in the second year. Additional awareness of the need for data integrity of sequences submitted to public databases is needed to fully reap the benefits of these large data sets.
format Online
Article
Text
id pubmed-4181811
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Blackwell Publishing Ltd
record_format MEDLINE/PubMed
spelling pubmed-41818112014-10-29 Sequencing artifacts in the type A influenza databases and attempts to correct them Suarez, David L Chester, Nikki Hatfield, Jason Influenza Other Respir Viruses Original Article BACKGROUND: There are over 276 000 influenza gene sequences in public databases, with the quality of the sequences determined by the contributor. OBJECTIVE: As part of a high school class project, influenza sequences with possible errors were identified in the public databases based on the size of the gene being longer than expected, with the hypothesis that these sequences would have an error. Students contacted sequence submitters alerting them of the possible sequence issue(s) and requested they the suspect sequence(s) be correct as appropriate. METHODS: Type A influenza viruses were screened, and gene segments longer than the accepted size were identified for further analysis. Attention was placed on sequences with additional nucleotides upstream or downstream of the highly conserved non-coding ends of the viral segments. RESULTS AND CONCLUSIONS: A total of 1081 sequences were identified that met this criterion. Three types of errors were commonly observed: non-influenza primer sequence wasn't removed from the sequence; PCR product was cloned and plasmid sequence was included in the sequence; and Taq polymerase added an adenine at the end of the PCR product. Internal insertions of nucleotide sequence were also commonly observed, but in many cases it was unclear if the sequence was correct or actually contained an error. A total of 215 sequences, or 22.8% of the suspect sequences, were corrected in the public databases in the first year of the student project. Unfortunately 138 additional sequences with possible errors were added to the databases in the second year. Additional awareness of the need for data integrity of sequences submitted to public databases is needed to fully reap the benefits of these large data sets. Blackwell Publishing Ltd 2014-07 2014-02-07 /pmc/articles/PMC4181811/ /pubmed/24512607 http://dx.doi.org/10.1111/irv.12239 Text en © 2014 The Authors. Influenza and Other Respiratory Viruses Published by John Wiley & Sons Ltd. http://creativecommons.org/licenses/by/3.0/ This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Suarez, David L
Chester, Nikki
Hatfield, Jason
Sequencing artifacts in the type A influenza databases and attempts to correct them
title Sequencing artifacts in the type A influenza databases and attempts to correct them
title_full Sequencing artifacts in the type A influenza databases and attempts to correct them
title_fullStr Sequencing artifacts in the type A influenza databases and attempts to correct them
title_full_unstemmed Sequencing artifacts in the type A influenza databases and attempts to correct them
title_short Sequencing artifacts in the type A influenza databases and attempts to correct them
title_sort sequencing artifacts in the type a influenza databases and attempts to correct them
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4181811/
https://www.ncbi.nlm.nih.gov/pubmed/24512607
http://dx.doi.org/10.1111/irv.12239
work_keys_str_mv AT suarezdavidl sequencingartifactsinthetypeainfluenzadatabasesandattemptstocorrectthem
AT chesternikki sequencingartifactsinthetypeainfluenzadatabasesandattemptstocorrectthem
AT hatfieldjason sequencingartifactsinthetypeainfluenzadatabasesandattemptstocorrectthem