Cargando…
Big data analysis of human mitochondrial DNA substitution models: a regression approach
BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6195736/ https://www.ncbi.nlm.nih.gov/pubmed/30340456 http://dx.doi.org/10.1186/s12864-018-5123-x |
_version_ | 1783364445991337984 |
---|---|
author | Levinstein Hallak, Keren Tzur, Shay Rosset, Saharon |
author_facet | Levinstein Hallak, Keren Tzur, Shay Rosset, Saharon |
author_sort | Levinstein Hallak, Keren |
collection | PubMed |
description | BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor’s value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS: Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS: We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5123-x) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6195736 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-61957362018-10-30 Big data analysis of human mitochondrial DNA substitution models: a regression approach Levinstein Hallak, Keren Tzur, Shay Rosset, Saharon BMC Genomics Methodology Article BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor’s value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS: Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS: We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5123-x) contains supplementary material, which is available to authorized users. BioMed Central 2018-10-19 /pmc/articles/PMC6195736/ /pubmed/30340456 http://dx.doi.org/10.1186/s12864-018-5123-x Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Levinstein Hallak, Keren Tzur, Shay Rosset, Saharon Big data analysis of human mitochondrial DNA substitution models: a regression approach |
title | Big data analysis of human mitochondrial DNA substitution models: a regression approach |
title_full | Big data analysis of human mitochondrial DNA substitution models: a regression approach |
title_fullStr | Big data analysis of human mitochondrial DNA substitution models: a regression approach |
title_full_unstemmed | Big data analysis of human mitochondrial DNA substitution models: a regression approach |
title_short | Big data analysis of human mitochondrial DNA substitution models: a regression approach |
title_sort | big data analysis of human mitochondrial dna substitution models: a regression approach |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6195736/ https://www.ncbi.nlm.nih.gov/pubmed/30340456 http://dx.doi.org/10.1186/s12864-018-5123-x |
work_keys_str_mv | AT levinsteinhallakkeren bigdataanalysisofhumanmitochondrialdnasubstitutionmodelsaregressionapproach AT tzurshay bigdataanalysisofhumanmitochondrialdnasubstitutionmodelsaregressionapproach AT rossetsaharon bigdataanalysisofhumanmitochondrialdnasubstitutionmodelsaregressionapproach |