Cargando…

Context dependency of nucleotide probabilities and variants in human DNA

BACKGROUND: Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain pos...

Descripción completa

Detalles Bibliográficos
Autores principales: Liang, Yuhu, Grønbæk, Christian, Fariselli, Piero, Krogh, Anders
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8802520/
https://www.ncbi.nlm.nih.gov/pubmed/35100973
http://dx.doi.org/10.1186/s12864-021-08246-1
_version_ 1784642698438770688
author Liang, Yuhu
Grønbæk, Christian
Fariselli, Piero
Krogh, Anders
author_facet Liang, Yuhu
Grønbæk, Christian
Fariselli, Piero
Krogh, Anders
author_sort Liang, Yuhu
collection PubMed
description BACKGROUND: Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. RESULTS: Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. CONCLUSIONS: Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-08246-1).
format Online
Article
Text
id pubmed-8802520
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-88025202022-02-02 Context dependency of nucleotide probabilities and variants in human DNA Liang, Yuhu Grønbæk, Christian Fariselli, Piero Krogh, Anders BMC Genomics Research Article BACKGROUND: Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. RESULTS: Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. CONCLUSIONS: Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-021-08246-1). BioMed Central 2022-01-31 /pmc/articles/PMC8802520/ /pubmed/35100973 http://dx.doi.org/10.1186/s12864-021-08246-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Liang, Yuhu
Grønbæk, Christian
Fariselli, Piero
Krogh, Anders
Context dependency of nucleotide probabilities and variants in human DNA
title Context dependency of nucleotide probabilities and variants in human DNA
title_full Context dependency of nucleotide probabilities and variants in human DNA
title_fullStr Context dependency of nucleotide probabilities and variants in human DNA
title_full_unstemmed Context dependency of nucleotide probabilities and variants in human DNA
title_short Context dependency of nucleotide probabilities and variants in human DNA
title_sort context dependency of nucleotide probabilities and variants in human dna
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8802520/
https://www.ncbi.nlm.nih.gov/pubmed/35100973
http://dx.doi.org/10.1186/s12864-021-08246-1
work_keys_str_mv AT liangyuhu contextdependencyofnucleotideprobabilitiesandvariantsinhumandna
AT grønbækchristian contextdependencyofnucleotideprobabilitiesandvariantsinhumandna
AT farisellipiero contextdependencyofnucleotideprobabilitiesandvariantsinhumandna
AT kroghanders contextdependencyofnucleotideprobabilitiesandvariantsinhumandna