Cargando…
The impact of different negative training data on regulatory sequence predictions
Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7707526/ https://www.ncbi.nlm.nih.gov/pubmed/33259518 http://dx.doi.org/10.1371/journal.pone.0237412 |
_version_ | 1783617367974084608 |
---|---|
author | Krützfeldt, Louisa-Marie Schubach, Max Kircher, Martin |
author_facet | Krützfeldt, Louisa-Marie Schubach, Max Kircher, Martin |
author_sort | Krützfeldt, Louisa-Marie |
collection | PubMed |
description | Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization. |
format | Online Article Text |
id | pubmed-7707526 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-77075262020-12-08 The impact of different negative training data on regulatory sequence predictions Krützfeldt, Louisa-Marie Schubach, Max Kircher, Martin PLoS One Research Article Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization. Public Library of Science 2020-12-01 /pmc/articles/PMC7707526/ /pubmed/33259518 http://dx.doi.org/10.1371/journal.pone.0237412 Text en © 2020 Krützfeldt et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Krützfeldt, Louisa-Marie Schubach, Max Kircher, Martin The impact of different negative training data on regulatory sequence predictions |
title | The impact of different negative training data on regulatory sequence predictions |
title_full | The impact of different negative training data on regulatory sequence predictions |
title_fullStr | The impact of different negative training data on regulatory sequence predictions |
title_full_unstemmed | The impact of different negative training data on regulatory sequence predictions |
title_short | The impact of different negative training data on regulatory sequence predictions |
title_sort | impact of different negative training data on regulatory sequence predictions |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7707526/ https://www.ncbi.nlm.nih.gov/pubmed/33259518 http://dx.doi.org/10.1371/journal.pone.0237412 |
work_keys_str_mv | AT krutzfeldtlouisamarie theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions AT schubachmax theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions AT kirchermartin theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions AT krutzfeldtlouisamarie impactofdifferentnegativetrainingdataonregulatorysequencepredictions AT schubachmax impactofdifferentnegativetrainingdataonregulatorysequencepredictions AT kirchermartin impactofdifferentnegativetrainingdataonregulatorysequencepredictions |