Cargando…

E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants

MOTIVATION: The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in...

Descripción completa

Detalles Bibliográficos
Autores principales: Manfredi, Matteo, Savojardo, Castrense, Martelli, Pier Luigi, Casadio, Rita
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710551/
https://www.ncbi.nlm.nih.gov/pubmed/36227117
http://dx.doi.org/10.1093/bioinformatics/btac678
_version_ 1784841391373811712
author Manfredi, Matteo
Savojardo, Castrense
Martelli, Pier Luigi
Casadio, Rita
author_facet Manfredi, Matteo
Savojardo, Castrense
Martelli, Pier Luigi
Casadio, Rita
author_sort Manfredi, Matteo
collection PubMed
description MOTIVATION: The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants. RESULTS: E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets. AVAILABILITY AND IMPLEMENTATION: The method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9710551
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-97105512022-12-01 E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants Manfredi, Matteo Savojardo, Castrense Martelli, Pier Luigi Casadio, Rita Bioinformatics Original Paper MOTIVATION: The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants. RESULTS: E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets. AVAILABILITY AND IMPLEMENTATION: The method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-10-13 /pmc/articles/PMC9710551/ /pubmed/36227117 http://dx.doi.org/10.1093/bioinformatics/btac678 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Manfredi, Matteo
Savojardo, Castrense
Martelli, Pier Luigi
Casadio, Rita
E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants
title E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants
title_full E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants
title_fullStr E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants
title_full_unstemmed E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants
title_short E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants
title_sort e-snps&go: embedding of protein sequence and function improves the annotation of human pathogenic variants
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710551/
https://www.ncbi.nlm.nih.gov/pubmed/36227117
http://dx.doi.org/10.1093/bioinformatics/btac678
work_keys_str_mv AT manfredimatteo esnpsgoembeddingofproteinsequenceandfunctionimprovestheannotationofhumanpathogenicvariants
AT savojardocastrense esnpsgoembeddingofproteinsequenceandfunctionimprovestheannotationofhumanpathogenicvariants
AT martellipierluigi esnpsgoembeddingofproteinsequenceandfunctionimprovestheannotationofhumanpathogenicvariants
AT casadiorita esnpsgoembeddingofproteinsequenceandfunctionimprovestheannotationofhumanpathogenicvariants