Cargando…

Big data analysis of human mitochondrial DNA substitution models: a regression approach

BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict...

Descripción completa

Detalles Bibliográficos
Autores principales: Levinstein Hallak, Keren, Tzur, Shay, Rosset, Saharon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6195736/
https://www.ncbi.nlm.nih.gov/pubmed/30340456
http://dx.doi.org/10.1186/s12864-018-5123-x
_version_ 1783364445991337984
author Levinstein Hallak, Keren
Tzur, Shay
Rosset, Saharon
author_facet Levinstein Hallak, Keren
Tzur, Shay
Rosset, Saharon
author_sort Levinstein Hallak, Keren
collection PubMed
description BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor’s value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS: Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS: We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5123-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6195736
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-61957362018-10-30 Big data analysis of human mitochondrial DNA substitution models: a regression approach Levinstein Hallak, Keren Tzur, Shay Rosset, Saharon BMC Genomics Methodology Article BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor’s value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS: Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS: We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5123-x) contains supplementary material, which is available to authorized users. BioMed Central 2018-10-19 /pmc/articles/PMC6195736/ /pubmed/30340456 http://dx.doi.org/10.1186/s12864-018-5123-x Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Levinstein Hallak, Keren
Tzur, Shay
Rosset, Saharon
Big data analysis of human mitochondrial DNA substitution models: a regression approach
title Big data analysis of human mitochondrial DNA substitution models: a regression approach
title_full Big data analysis of human mitochondrial DNA substitution models: a regression approach
title_fullStr Big data analysis of human mitochondrial DNA substitution models: a regression approach
title_full_unstemmed Big data analysis of human mitochondrial DNA substitution models: a regression approach
title_short Big data analysis of human mitochondrial DNA substitution models: a regression approach
title_sort big data analysis of human mitochondrial dna substitution models: a regression approach
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6195736/
https://www.ncbi.nlm.nih.gov/pubmed/30340456
http://dx.doi.org/10.1186/s12864-018-5123-x
work_keys_str_mv AT levinsteinhallakkeren bigdataanalysisofhumanmitochondrialdnasubstitutionmodelsaregressionapproach
AT tzurshay bigdataanalysisofhumanmitochondrialdnasubstitutionmodelsaregressionapproach
AT rossetsaharon bigdataanalysisofhumanmitochondrialdnasubstitutionmodelsaregressionapproach