Cargando…
Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data
Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent ge...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126415/ https://www.ncbi.nlm.nih.gov/pubmed/37113992 http://dx.doi.org/10.3389/fgene.2023.1158352 |
_version_ | 1785030238640537600 |
---|---|
author | Parkinson, Edward Liberatore , Federico Watkins , W. John Andrews , Robert Edkins , Sarah Hibbert , Julie Strunk , Tobias Currie , Andrew Ghazal , Peter |
author_facet | Parkinson, Edward Liberatore , Federico Watkins , W. John Andrews , Robert Edkins , Sarah Hibbert , Julie Strunk , Tobias Currie , Andrew Ghazal , Peter |
author_sort | Parkinson, Edward |
collection | PubMed |
description | Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data. |
format | Online Article Text |
id | pubmed-10126415 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-101264152023-04-26 Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data Parkinson, Edward Liberatore , Federico Watkins , W. John Andrews , Robert Edkins , Sarah Hibbert , Julie Strunk , Tobias Currie , Andrew Ghazal , Peter Front Genet Genetics Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data. Frontiers Media S.A. 2023-04-11 /pmc/articles/PMC10126415/ /pubmed/37113992 http://dx.doi.org/10.3389/fgene.2023.1158352 Text en Copyright © 2023 Parkinson, Liberatore , Watkins , Andrews , Edkins , Hibbert , Strunk , Currie and Ghazal . https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Parkinson, Edward Liberatore , Federico Watkins , W. John Andrews , Robert Edkins , Sarah Hibbert , Julie Strunk , Tobias Currie , Andrew Ghazal , Peter Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data |
title | Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data |
title_full | Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data |
title_fullStr | Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data |
title_full_unstemmed | Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data |
title_short | Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data |
title_sort | gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis rna-seq data |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126415/ https://www.ncbi.nlm.nih.gov/pubmed/37113992 http://dx.doi.org/10.3389/fgene.2023.1158352 |
work_keys_str_mv | AT parkinsonedward genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT liberatorefederico genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT watkinswjohn genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT andrewsrobert genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT edkinssarah genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT hibbertjulie genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT strunktobias genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT currieandrew genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata AT ghazalpeter genefilteringstrategiesformachinelearningguidedbiomarkerdiscoveryusingneonatalsepsisrnaseqdata |