Cargando…
k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerab...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735666/ https://www.ncbi.nlm.nih.gov/pubmed/19570806 http://dx.doi.org/10.1093/bioinformatics/btp410 |
_version_ | 1782171271613644800 |
---|---|
author | Bragg, Lauren M. Stone, Glenn |
author_facet | Bragg, Lauren M. Stone, Glenn |
author_sort | Bragg, Lauren M. |
collection | PubMed |
description | Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact: lauren.bragg@csiro.au Supplementary information: Supplementary data are available at Bioinformatics online. |
format | Text |
id | pubmed-2735666 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-27356662009-09-02 k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage Bragg, Lauren M. Stone, Glenn Bioinformatics Original Papers Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact: lauren.bragg@csiro.au Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2009-09-15 2009-07-01 /pmc/articles/PMC2735666/ /pubmed/19570806 http://dx.doi.org/10.1093/bioinformatics/btp410 Text en http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Bragg, Lauren M. Stone, Glenn k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
title | k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
title_full | k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
title_fullStr | k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
title_full_unstemmed | k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
title_short | k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
title_sort | k-link est clustering: evaluating error introduced by chimeric sequences under different degrees of linkage |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735666/ https://www.ncbi.nlm.nih.gov/pubmed/19570806 http://dx.doi.org/10.1093/bioinformatics/btp410 |
work_keys_str_mv | AT bragglaurenm klinkestclusteringevaluatingerrorintroducedbychimericsequencesunderdifferentdegreesoflinkage AT stoneglenn klinkestclusteringevaluatingerrorintroducedbychimericsequencesunderdifferentdegreesoflinkage |