Cargando…

Querying the public databases for sequences using complex keywords contained in the feature lines

BACKGROUND: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords. RESULTS: We show th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Croce, Olivier, Lamarre, Michaël, Christen, Richard
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1403806/ https://www.ncbi.nlm.nih.gov/pubmed/16441875 http://dx.doi.org/10.1186/1471-2105-7-45

_version_	1782127040598638592
author	Croce, Olivier Lamarre, Michaël Christen, Richard
author_facet	Croce, Olivier Lamarre, Michaël Christen, Richard
author_sort	Croce, Olivier
collection	PubMed
description	BACKGROUND: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords. RESULTS: We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use. CONCLUSION: Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.
format	Text
id	pubmed-1403806
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-14038062006-03-18 Querying the public databases for sequences using complex keywords contained in the feature lines Croce, Olivier Lamarre, Michaël Christen, Richard BMC Bioinformatics Software BACKGROUND: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords. RESULTS: We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use. CONCLUSION: Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs. BioMed Central 2006-01-27 /pmc/articles/PMC1403806/ /pubmed/16441875 http://dx.doi.org/10.1186/1471-2105-7-45 Text en Copyright © 2006 Croce et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Croce, Olivier Lamarre, Michaël Christen, Richard Querying the public databases for sequences using complex keywords contained in the feature lines
title	Querying the public databases for sequences using complex keywords contained in the feature lines
title_full	Querying the public databases for sequences using complex keywords contained in the feature lines
title_fullStr	Querying the public databases for sequences using complex keywords contained in the feature lines
title_full_unstemmed	Querying the public databases for sequences using complex keywords contained in the feature lines
title_short	Querying the public databases for sequences using complex keywords contained in the feature lines
title_sort	querying the public databases for sequences using complex keywords contained in the feature lines
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1403806/ https://www.ncbi.nlm.nih.gov/pubmed/16441875 http://dx.doi.org/10.1186/1471-2105-7-45
work_keys_str_mv	AT croceolivier queryingthepublicdatabasesforsequencesusingcomplexkeywordscontainedinthefeaturelines AT lamarremichael queryingthepublicdatabasesforsequencesusingcomplexkeywordscontainedinthefeaturelines AT christenrichard queryingthepublicdatabasesforsequencesusingcomplexkeywordscontainedinthefeaturelines

Querying the public databases for sequences using complex keywords contained in the feature lines

Ejemplares similares