Cargando…

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the...

Descripción completa

Detalles Bibliográficos
Autores principales: Penzar, Dmitry D., Zinkevich, Arsenii O., Vorontsov, Ilya E., Sitnik, Vasily V., Favorov, Alexander V., Makeev, Vsevolod J., Kulakovskiy, Ivan V.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6834773/
https://www.ncbi.nlm.nih.gov/pubmed/31737053
http://dx.doi.org/10.3389/fgene.2019.01078
_version_ 1783466548859502592
author Penzar, Dmitry D.
Zinkevich, Arsenii O.
Vorontsov, Ilya E.
Sitnik, Vasily V.
Favorov, Alexander V.
Makeev, Vsevolod J.
Kulakovskiy, Ivan V.
author_facet Penzar, Dmitry D.
Zinkevich, Arsenii O.
Vorontsov, Ilya E.
Sitnik, Vasily V.
Favorov, Alexander V.
Makeev, Vsevolod J.
Kulakovskiy, Ivan V.
author_sort Penzar, Dmitry D.
collection PubMed
description Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.
format Online
Article
Text
id pubmed-6834773
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-68347732019-11-15 What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants Penzar, Dmitry D. Zinkevich, Arsenii O. Vorontsov, Ilya E. Sitnik, Vasily V. Favorov, Alexander V. Makeev, Vsevolod J. Kulakovskiy, Ivan V. Front Genet Genetics Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants. Frontiers Media S.A. 2019-10-31 /pmc/articles/PMC6834773/ /pubmed/31737053 http://dx.doi.org/10.3389/fgene.2019.01078 Text en Copyright © 2019 Penzar, Zinkevich, Vorontsov, Sitnik, Favorov, Makeev and Kulakovskiy http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Penzar, Dmitry D.
Zinkevich, Arsenii O.
Vorontsov, Ilya E.
Sitnik, Vasily V.
Favorov, Alexander V.
Makeev, Vsevolod J.
Kulakovskiy, Ivan V.
What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants
title What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants
title_full What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants
title_fullStr What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants
title_full_unstemmed What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants
title_short What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants
title_sort what do neighbors tell about you: the local context of cis-regulatory modules complicates prediction of regulatory variants
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6834773/
https://www.ncbi.nlm.nih.gov/pubmed/31737053
http://dx.doi.org/10.3389/fgene.2019.01078
work_keys_str_mv AT penzardmitryd whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants
AT zinkevicharseniio whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants
AT vorontsovilyae whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants
AT sitnikvasilyv whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants
AT favorovalexanderv whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants
AT makeevvsevolodj whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants
AT kulakovskiyivanv whatdoneighborstellaboutyouthelocalcontextofcisregulatorymodulescomplicatespredictionofregulatoryvariants