Cargando…

Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi

BACKGROUND: During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for...

Descripción completa

Detalles Bibliográficos
Autores principales: Nilsson, R Henrik, Kristiansson, Erik, Ryberg, Martin, Larsson, Karl-Henrik
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1186019/
https://www.ncbi.nlm.nih.gov/pubmed/16022740
http://dx.doi.org/10.1186/1471-2105-6-178
_version_ 1782124747684839424
author Nilsson, R Henrik
Kristiansson, Erik
Ryberg, Martin
Larsson, Karl-Henrik
author_facet Nilsson, R Henrik
Kristiansson, Erik
Ryberg, Martin
Larsson, Karl-Henrik
author_sort Nilsson, R Henrik
collection PubMed
description BACKGROUND: During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public databases such as GenBank increases exponentially, only a minuscule fraction of all organisms have been sequenced, leaving taxon sampling a momentous problem for sequence-based taxonomic identification. When querying GenBank with a set of unidentified sequences, a considerable proportion typically lack fully identified matches, forming an ever-mounting pile of sequences that the researcher will have to monitor manually in the hope that new, clarifying sequences have been submitted by other researchers. To alleviate these concerns, a project to automatically monitor select unidentified sequences in GenBank for taxonomic progress through repeated local BLAST searches was initiated. Mycorrhizal fungi – a field where species identification often is prohibitively complex – and the much used ITS locus were chosen as test bed. RESULTS: A Perl script package called emerencia is presented. On a regular basis, it downloads select sequences from GenBank, separates the identified sequences from those insufficiently identified, and performs BLAST searches between these two datasets, storing all results in an SQL database. On the accompanying web-service , users can monitor the taxonomic progress of insufficiently identified sequences over time, either through active searches or by signing up for e-mail notification upon disclosure of better matches. Other search categories, such as listing all insufficiently identified sequences (and their present best fully identified matches) publication-wise, are also available. DISCUSSION: The ever-increasing use of DNA sequences for identification purposes largely falls back on the assumption that public sequence databases contain a thorough sampling of taxonomically well-annotated sequences. Taxonomy, held by some to be an old-fashioned trade, has accordingly never been more important. emerencia does not automate the taxonomic process, but it does allow researchers to focus their efforts elsewhere than countless manual BLAST runs and arduous sieving of BLAST hit lists. The emerencia system is available on an open source basis for local installation with any organism and gene group as targets.
format Text
id pubmed-1186019
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-11860192005-08-16 Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi Nilsson, R Henrik Kristiansson, Erik Ryberg, Martin Larsson, Karl-Henrik BMC Bioinformatics Software BACKGROUND: During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public databases such as GenBank increases exponentially, only a minuscule fraction of all organisms have been sequenced, leaving taxon sampling a momentous problem for sequence-based taxonomic identification. When querying GenBank with a set of unidentified sequences, a considerable proportion typically lack fully identified matches, forming an ever-mounting pile of sequences that the researcher will have to monitor manually in the hope that new, clarifying sequences have been submitted by other researchers. To alleviate these concerns, a project to automatically monitor select unidentified sequences in GenBank for taxonomic progress through repeated local BLAST searches was initiated. Mycorrhizal fungi – a field where species identification often is prohibitively complex – and the much used ITS locus were chosen as test bed. RESULTS: A Perl script package called emerencia is presented. On a regular basis, it downloads select sequences from GenBank, separates the identified sequences from those insufficiently identified, and performs BLAST searches between these two datasets, storing all results in an SQL database. On the accompanying web-service , users can monitor the taxonomic progress of insufficiently identified sequences over time, either through active searches or by signing up for e-mail notification upon disclosure of better matches. Other search categories, such as listing all insufficiently identified sequences (and their present best fully identified matches) publication-wise, are also available. DISCUSSION: The ever-increasing use of DNA sequences for identification purposes largely falls back on the assumption that public sequence databases contain a thorough sampling of taxonomically well-annotated sequences. Taxonomy, held by some to be an old-fashioned trade, has accordingly never been more important. emerencia does not automate the taxonomic process, but it does allow researchers to focus their efforts elsewhere than countless manual BLAST runs and arduous sieving of BLAST hit lists. The emerencia system is available on an open source basis for local installation with any organism and gene group as targets. BioMed Central 2005-07-18 /pmc/articles/PMC1186019/ /pubmed/16022740 http://dx.doi.org/10.1186/1471-2105-6-178 Text en Copyright © 2005 Nilsson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Nilsson, R Henrik
Kristiansson, Erik
Ryberg, Martin
Larsson, Karl-Henrik
Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
title Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
title_full Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
title_fullStr Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
title_full_unstemmed Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
title_short Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
title_sort approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1186019/
https://www.ncbi.nlm.nih.gov/pubmed/16022740
http://dx.doi.org/10.1186/1471-2105-6-178
work_keys_str_mv AT nilssonrhenrik approachingthetaxonomicaffiliationofunidentifiedsequencesinpublicdatabasesanexamplefromthemycorrhizalfungi
AT kristianssonerik approachingthetaxonomicaffiliationofunidentifiedsequencesinpublicdatabasesanexamplefromthemycorrhizalfungi
AT rybergmartin approachingthetaxonomicaffiliationofunidentifiedsequencesinpublicdatabasesanexamplefromthemycorrhizalfungi
AT larssonkarlhenrik approachingthetaxonomicaffiliationofunidentifiedsequencesinpublicdatabasesanexamplefromthemycorrhizalfungi