Cargando…

CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites

BACKGROUND: It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential...

Descripción completa

Detalles Bibliográficos
Autores principales: Strauch, Yaron, Lord, Jenny, Niranjan, Mahesan, Baralle, Diana
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9165884/
https://www.ncbi.nlm.nih.gov/pubmed/35657932
http://dx.doi.org/10.1371/journal.pone.0269159
_version_ 1784720487976271872
author Strauch, Yaron
Lord, Jenny
Niranjan, Mahesan
Baralle, Diana
author_facet Strauch, Yaron
Lord, Jenny
Niranjan, Mahesan
Baralle, Diana
author_sort Strauch, Yaron
collection PubMed
description BACKGROUND: It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods. METHODS AND FINDINGS: The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants. CONCLUSIONS: We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.
format Online
Article
Text
id pubmed-9165884
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-91658842022-06-05 CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites Strauch, Yaron Lord, Jenny Niranjan, Mahesan Baralle, Diana PLoS One Research Article BACKGROUND: It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods. METHODS AND FINDINGS: The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants. CONCLUSIONS: We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements. Public Library of Science 2022-06-03 /pmc/articles/PMC9165884/ /pubmed/35657932 http://dx.doi.org/10.1371/journal.pone.0269159 Text en © 2022 Strauch et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Strauch, Yaron
Lord, Jenny
Niranjan, Mahesan
Baralle, Diana
CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_full CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_fullStr CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_full_unstemmed CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_short CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_sort ci-spliceai—improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9165884/
https://www.ncbi.nlm.nih.gov/pubmed/35657932
http://dx.doi.org/10.1371/journal.pone.0269159
work_keys_str_mv AT strauchyaron cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites
AT lordjenny cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites
AT niranjanmahesan cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites
AT barallediana cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites