Cargando…
Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
MOTIVATION: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has b...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9913046/ https://www.ncbi.nlm.nih.gov/pubmed/36789294 http://dx.doi.org/10.1093/bioadv/vbad005 |
_version_ | 1784885333344649216 |
---|---|
author | Chang, Christine H Nelson, William C Jerger, Abby Wright, Aaron T Egbert, Robert G McDermott, Jason E |
author_facet | Chang, Christine H Nelson, William C Jerger, Abby Wright, Aaron T Egbert, Robert G McDermott, Jason E |
author_sort | Chang, Christine H |
collection | PubMed |
description | MOTIVATION: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. RESULTS: Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. AVAILABILITY AND IMPLEMENTATION: Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. |
format | Online Article Text |
id | pubmed-9913046 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-99130462023-02-13 Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding Chang, Christine H Nelson, William C Jerger, Abby Wright, Aaron T Egbert, Robert G McDermott, Jason E Bioinform Adv Original Article MOTIVATION: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. RESULTS: Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. AVAILABILITY AND IMPLEMENTATION: Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2023-02-02 /pmc/articles/PMC9913046/ /pubmed/36789294 http://dx.doi.org/10.1093/bioadv/vbad005 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Article Chang, Christine H Nelson, William C Jerger, Abby Wright, Aaron T Egbert, Robert G McDermott, Jason E Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
title | Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
title_full | Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
title_fullStr | Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
title_full_unstemmed | Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
title_short | Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
title_sort | snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9913046/ https://www.ncbi.nlm.nih.gov/pubmed/36789294 http://dx.doi.org/10.1093/bioadv/vbad005 |
work_keys_str_mv | AT changchristineh snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding AT nelsonwilliamc snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding AT jergerabby snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding AT wrightaaront snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding AT egbertrobertg snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding AT mcdermottjasone snekmerascalablepipelineforproteinsequencefingerprintingbasedonaminoacidrecoding |