Cargando…

Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding

MOTIVATION: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has b...

Descripción completa

Detalles Bibliográficos
Autores principales: Chang, Christine H, Nelson, William C, Jerger, Abby, Wright, Aaron T, Egbert, Robert G, McDermott, Jason E
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9913046/
https://www.ncbi.nlm.nih.gov/pubmed/36789294
http://dx.doi.org/10.1093/bioadv/vbad005
_version_ 1784885333344649216
author Chang, Christine H
Nelson, William C
Jerger, Abby
Wright, Aaron T
Egbert, Robert G
McDermott, Jason E
author_facet Chang, Christine H
Nelson, William C
Jerger, Abby
Wright, Aaron T
Egbert, Robert G
McDermott, Jason E
author_sort Chang, Christine H
collection PubMed
description MOTIVATION: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. RESULTS: Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. AVAILABILITY AND IMPLEMENTATION: Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format Online
Article
Text
id pubmed-9913046
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-99130462023-02-13 Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding Chang, Christine H Nelson, William C Jerger, Abby Wright, Aaron T Egbert, Robert G McDermott, Jason E Bioinform Adv Original Article MOTIVATION: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. RESULTS: Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. AVAILABILITY AND IMPLEMENTATION: Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2023-02-02 /pmc/articles/PMC9913046/ /pubmed/36789294 http://dx.doi.org/10.1093/bioadv/vbad005 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Article
Chang, Christine H
Nelson, William C
Jerger, Abby
Wright, Aaron T
Egbert, Robert G
McDermott, Jason E
Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
title Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
title_full Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
title_fullStr Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
title_full_unstemmed Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
title_short Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
title_sort snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9913046/
https://www.ncbi.nlm.nih.gov/pubmed/36789294
http://dx.doi.org/10.1093/bioadv/vbad005
work_keys_str_mv AT changchristineh snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding
AT nelsonwilliamc snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding
AT jergerabby snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding
AT wrightaaront snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding
AT egbertrobertg snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding
AT mcdermottjasone snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding