Cargando…

k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage

Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerab...

Descripción completa

Detalles Bibliográficos
Autores principales: Bragg, Lauren M., Stone, Glenn
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735666/
https://www.ncbi.nlm.nih.gov/pubmed/19570806
http://dx.doi.org/10.1093/bioinformatics/btp410
_version_ 1782171271613644800
author Bragg, Lauren M.
Stone, Glenn
author_facet Bragg, Lauren M.
Stone, Glenn
author_sort Bragg, Lauren M.
collection PubMed
description Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact: lauren.bragg@csiro.au Supplementary information: Supplementary data are available at Bioinformatics online.
format Text
id pubmed-2735666
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-27356662009-09-02 k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage Bragg, Lauren M. Stone, Glenn Bioinformatics Original Papers Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact: lauren.bragg@csiro.au Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2009-09-15 2009-07-01 /pmc/articles/PMC2735666/ /pubmed/19570806 http://dx.doi.org/10.1093/bioinformatics/btp410 Text en http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Bragg, Lauren M.
Stone, Glenn
k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
title k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
title_full k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
title_fullStr k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
title_full_unstemmed k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
title_short k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
title_sort k-link est clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735666/
https://www.ncbi.nlm.nih.gov/pubmed/19570806
http://dx.doi.org/10.1093/bioinformatics/btp410
work_keys_str_mv AT bragglaurenm klinkestclusteringevaluatingerrorintroducedbychimericsequencesunderdifferentdegreesoflinkage
AT stoneglenn klinkestclusteringevaluatingerrorintroducedbychimericsequencesunderdifferentdegreesoflinkage