Cargando…

Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data

Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent ge...

Descripción completa

Detalles Bibliográficos
Autores principales: Parkinson, Edward, Liberatore , Federico, Watkins , W. John, Andrews , Robert, Edkins , Sarah, Hibbert , Julie, Strunk , Tobias, Currie , Andrew, Ghazal , Peter
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126415/
https://www.ncbi.nlm.nih.gov/pubmed/37113992
http://dx.doi.org/10.3389/fgene.2023.1158352
_version_ 1785030238640537600
author Parkinson, Edward
Liberatore , Federico
Watkins , W. John
Andrews , Robert
Edkins , Sarah
Hibbert , Julie
Strunk , Tobias
Currie , Andrew
Ghazal , Peter
author_facet Parkinson, Edward
Liberatore , Federico
Watkins , W. John
Andrews , Robert
Edkins , Sarah
Hibbert , Julie
Strunk , Tobias
Currie , Andrew
Ghazal , Peter
author_sort Parkinson, Edward
collection PubMed
description Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data.
format Online
Article
Text
id pubmed-10126415
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-101264152023-04-26 Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data Parkinson, Edward Liberatore , Federico Watkins , W. John Andrews , Robert Edkins , Sarah Hibbert , Julie Strunk , Tobias Currie , Andrew Ghazal , Peter Front Genet Genetics Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data. Frontiers Media S.A. 2023-04-11 /pmc/articles/PMC10126415/ /pubmed/37113992 http://dx.doi.org/10.3389/fgene.2023.1158352 Text en Copyright © 2023 Parkinson, Liberatore , Watkins , Andrews , Edkins , Hibbert , Strunk , Currie  and Ghazal . https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Parkinson, Edward
Liberatore , Federico
Watkins , W. John
Andrews , Robert
Edkins , Sarah
Hibbert , Julie
Strunk , Tobias
Currie , Andrew
Ghazal , Peter
Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
title Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
title_full Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
title_fullStr Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
title_full_unstemmed Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
title_short Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
title_sort gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis rna-seq data
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126415/
https://www.ncbi.nlm.nih.gov/pubmed/37113992
http://dx.doi.org/10.3389/fgene.2023.1158352
work_keys_str_mv AT parkinsonedward genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT liberatorefederico genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT watkinswjohn genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT andrewsrobert genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT edkinssarah genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT hibbertjulie genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT strunktobias genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT currieandrew genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata
AT ghazalpeter genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata