Cargando…

OD-seq: outlier detection in multiple sequence alignments

BACKGROUND: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detectin...

Descripción completa

Detalles Bibliográficos
Autores principales: Jehl, Peter, Sievers, Fabian, Higgins, Desmond G.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4548304/
https://www.ncbi.nlm.nih.gov/pubmed/26303676
http://dx.doi.org/10.1186/s12859-015-0702-1
_version_ 1782387180730056704
author Jehl, Peter
Sievers, Fabian
Higgins, Desmond G.
author_facet Jehl, Peter
Sievers, Fabian
Higgins, Desmond G.
author_sort Jehl, Peter
collection PubMed
description BACKGROUND: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. RESULTS: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N (2)) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. CONCLUSION: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz.
format Online
Article
Text
id pubmed-4548304
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45483042015-08-26 OD-seq: outlier detection in multiple sequence alignments Jehl, Peter Sievers, Fabian Higgins, Desmond G. BMC Bioinformatics Research Article BACKGROUND: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. RESULTS: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N (2)) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. CONCLUSION: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz. BioMed Central 2015-08-25 /pmc/articles/PMC4548304/ /pubmed/26303676 http://dx.doi.org/10.1186/s12859-015-0702-1 Text en © Jehl et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Jehl, Peter
Sievers, Fabian
Higgins, Desmond G.
OD-seq: outlier detection in multiple sequence alignments
title OD-seq: outlier detection in multiple sequence alignments
title_full OD-seq: outlier detection in multiple sequence alignments
title_fullStr OD-seq: outlier detection in multiple sequence alignments
title_full_unstemmed OD-seq: outlier detection in multiple sequence alignments
title_short OD-seq: outlier detection in multiple sequence alignments
title_sort od-seq: outlier detection in multiple sequence alignments
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4548304/
https://www.ncbi.nlm.nih.gov/pubmed/26303676
http://dx.doi.org/10.1186/s12859-015-0702-1
work_keys_str_mv AT jehlpeter odseqoutlierdetectioninmultiplesequencealignments
AT sieversfabian odseqoutlierdetectioninmultiplesequencealignments
AT higginsdesmondg odseqoutlierdetectioninmultiplesequencealignments