Cargando…

Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data

BACKGROUND: Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not onl...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Sun, Ji, Guoli, Liu, Xiaolin, Li, Pei, Moler, James, Karro, John E, Liang, Chun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3424822/
https://www.ncbi.nlm.nih.gov/pubmed/22554190
http://dx.doi.org/10.1186/1472-6750-12-16
_version_ 1782241266452398080
author Zhou, Sun
Ji, Guoli
Liu, Xiaolin
Li, Pei
Moler, James
Karro, John E
Liang, Chun
author_facet Zhou, Sun
Ji, Guoli
Liu, Xiaolin
Li, Pei
Moler, James
Karro, John E
Liang, Chun
author_sort Zhou, Sun
collection PubMed
description BACKGROUND: Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction. RESULTS: After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3′-end terminal structures in designated 5′-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, http://code.google.com/p/afst/) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are “unclean” or abnormal, all of which could be cleaned or filtered by AFST. CONCLUSIONS: cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.
format Online
Article
Text
id pubmed-3424822
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34248222012-08-23 Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data Zhou, Sun Ji, Guoli Liu, Xiaolin Li, Pei Moler, James Karro, John E Liang, Chun BMC Biotechnol Methodology Article BACKGROUND: Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction. RESULTS: After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3′-end terminal structures in designated 5′-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, http://code.google.com/p/afst/) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are “unclean” or abnormal, all of which could be cleaned or filtered by AFST. CONCLUSIONS: cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications. BioMed Central 2012-05-03 /pmc/articles/PMC3424822/ /pubmed/22554190 http://dx.doi.org/10.1186/1472-6750-12-16 Text en Copyright ©2012 Zhou et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Zhou, Sun
Ji, Guoli
Liu, Xiaolin
Li, Pei
Moler, James
Karro, John E
Liang, Chun
Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
title Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
title_full Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
title_fullStr Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
title_full_unstemmed Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
title_short Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
title_sort pattern analysis approach reveals restriction enzyme cutting abnormalities and other cdna library construction artifacts using raw est data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3424822/
https://www.ncbi.nlm.nih.gov/pubmed/22554190
http://dx.doi.org/10.1186/1472-6750-12-16
work_keys_str_mv AT zhousun patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata
AT jiguoli patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata
AT liuxiaolin patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata
AT lipei patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata
AT molerjames patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata
AT karrojohne patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata
AT liangchun patternanalysisapproachrevealsrestrictionenzymecuttingabnormalitiesandothercdnalibraryconstructionartifactsusingrawestdata