Cargando…

The impact of different negative training data on regulatory sequence predictions

Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in...

Descripción completa

Detalles Bibliográficos
Autores principales: Krützfeldt, Louisa-Marie, Schubach, Max, Kircher, Martin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7707526/
https://www.ncbi.nlm.nih.gov/pubmed/33259518
http://dx.doi.org/10.1371/journal.pone.0237412
_version_ 1783617367974084608
author Krützfeldt, Louisa-Marie
Schubach, Max
Kircher, Martin
author_facet Krützfeldt, Louisa-Marie
Schubach, Max
Kircher, Martin
author_sort Krützfeldt, Louisa-Marie
collection PubMed
description Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
format Online
Article
Text
id pubmed-7707526
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-77075262020-12-08 The impact of different negative training data on regulatory sequence predictions Krützfeldt, Louisa-Marie Schubach, Max Kircher, Martin PLoS One Research Article Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization. Public Library of Science 2020-12-01 /pmc/articles/PMC7707526/ /pubmed/33259518 http://dx.doi.org/10.1371/journal.pone.0237412 Text en © 2020 Krützfeldt et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Krützfeldt, Louisa-Marie
Schubach, Max
Kircher, Martin
The impact of different negative training data on regulatory sequence predictions
title The impact of different negative training data on regulatory sequence predictions
title_full The impact of different negative training data on regulatory sequence predictions
title_fullStr The impact of different negative training data on regulatory sequence predictions
title_full_unstemmed The impact of different negative training data on regulatory sequence predictions
title_short The impact of different negative training data on regulatory sequence predictions
title_sort impact of different negative training data on regulatory sequence predictions
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7707526/
https://www.ncbi.nlm.nih.gov/pubmed/33259518
http://dx.doi.org/10.1371/journal.pone.0237412
work_keys_str_mv AT krutzfeldtlouisamarie theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT schubachmax theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT kirchermartin theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT krutzfeldtlouisamarie impactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT schubachmax impactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT kirchermartin impactofdifferentnegativetrainingdataonregulatorysequencepredictions