Cargando…

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly

MOTIVATION: Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the i...

Descripción completa

Detalles Bibliográficos
Autores principales: Lewis, T E, Sillitoe, I, Lees, J G
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6513158/
https://www.ncbi.nlm.nih.gov/pubmed/30295745
http://dx.doi.org/10.1093/bioinformatics/bty863
Descripción
Sumario:MOTIVATION: Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the input data. Existing tools are increasingly facing very large data-sets for which they require prohibitive amounts of CPU-time and memory. RESULTS: We present cath-resolve-hits (CRH), a new tool that uses a dynamic-programming algorithm implemented in open-source C++ to handle large datasets quickly (up to ∼1 million hits/second) and in reasonable amounts of memory. It accepts multiple input formats and provides its output in plain text, JSON or graphical HTML. We describe a benchmark against an existing algorithm, which shows CRH delivers very similar or slightly improved results and very much improved CPU/memory performance on large datasets. AVAILABILITY AND IMPLEMENTATION: CRH is available at https://github.com/UCLOrengoGroup/cath-tools; documentation is available at http://cath-tools.readthedocs.io. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.