Cargando…

ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)

BACKGROUND: With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present...

Descripción completa

Detalles Bibliográficos
Autores principales: Liang, Chun, Wang, Gang, Liu, Lin, Ji, Guoli, Fang, Lin, Liu, Yuansheng, Carter, Kikia, Webb, Jason S, Dean, Jeffrey FD
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1894976/
https://www.ncbi.nlm.nih.gov/pubmed/17535431
http://dx.doi.org/10.1186/1471-2164-8-134
_version_ 1782133900295798784
author Liang, Chun
Wang, Gang
Liu, Lin
Ji, Guoli
Fang, Lin
Liu, Yuansheng
Carter, Kikia
Webb, Jason S
Dean, Jeffrey FD
author_facet Liang, Chun
Wang, Gang
Liu, Lin
Ji, Guoli
Fang, Lin
Liu, Yuansheng
Carter, Kikia
Webb, Jason S
Dean, Jeffrey FD
author_sort Liang, Chun
collection PubMed
description BACKGROUND: With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences. DESCRIPTION: ConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software – WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3' and/or 5' termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3' ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5' ESTs had a verified 5' terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces. CONCLUSION: ConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets.
format Text
id pubmed-1894976
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18949762007-06-21 ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs) Liang, Chun Wang, Gang Liu, Lin Ji, Guoli Fang, Lin Liu, Yuansheng Carter, Kikia Webb, Jason S Dean, Jeffrey FD BMC Genomics Database BACKGROUND: With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences. DESCRIPTION: ConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software – WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3' and/or 5' termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3' ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5' ESTs had a verified 5' terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces. CONCLUSION: ConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets. BioMed Central 2007-05-29 /pmc/articles/PMC1894976/ /pubmed/17535431 http://dx.doi.org/10.1186/1471-2164-8-134 Text en Copyright © 2007 Liang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Database
Liang, Chun
Wang, Gang
Liu, Lin
Ji, Guoli
Fang, Lin
Liu, Yuansheng
Carter, Kikia
Webb, Jason S
Dean, Jeffrey FD
ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)
title ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)
title_full ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)
title_fullStr ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)
title_full_unstemmed ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)
title_short ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)
title_sort coniferest: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ests)
topic Database
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1894976/
https://www.ncbi.nlm.nih.gov/pubmed/17535431
http://dx.doi.org/10.1186/1471-2164-8-134
work_keys_str_mv AT liangchun coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT wanggang coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT liulin coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT jiguoli coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT fanglin coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT liuyuansheng coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT carterkikia coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT webbjasons coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests
AT deanjeffreyfd coniferestanintegratedbioinformaticssystemfordatareprocessingandminingofconiferexpressedsequencetagsests