Cargando…

Statistical modelling of CG interdistance across multiple organisms

BACKGROUND: Statistical approaches to genetic sequences have revealed helpful to gain deeper insight into biological and structural functionalities, using ideas coming from information theory and stochastic modelling of symbolic sequences. In particular, previous analyses on CG dinucleotide position...

Descripción completa

Detalles Bibliográficos
Autores principales: A., Merlotti, Valle I., Faria do, G., Castellani, D., Remondini
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6191944/
https://www.ncbi.nlm.nih.gov/pubmed/30367587
http://dx.doi.org/10.1186/s12859-018-2303-2
_version_ 1783363812995366912
author A., Merlotti
Valle I., Faria do
G., Castellani
D., Remondini
author_facet A., Merlotti
Valle I., Faria do
G., Castellani
D., Remondini
author_sort A., Merlotti
collection PubMed
description BACKGROUND: Statistical approaches to genetic sequences have revealed helpful to gain deeper insight into biological and structural functionalities, using ideas coming from information theory and stochastic modelling of symbolic sequences. In particular, previous analyses on CG dinucleotide position along the genome allowed to highlight its epigenetic role in DNA methylation, showing a different distribution tail as compared to other dinucleotides. In this paper we extend the analysis to the whole CG distance distribution over a selected set of higher-order organisms. Then we apply the best fitting probability density function to a large range of organisms (>4400) of different complexity (from bacteria to mammals) and we characterize some emerging global features. RESULTS: We find that the Gamma distribution is optimal for the selected subset as compared to a group of several distributions, chosen for their physical meaning or because recently used in literature for similar studies. The parameters of this distribution, when applied to our larger set of organisms, allows to highlight some biologically relavant features for the considered organism classes, that can be useful also for classification purposes. CONCLUSIONS: The quantification of statistical properties of CG dinucleotide positioning along the genome is confirmed as a useful tool to characterize broad classes of organisms, spanning the whole range of biological complexity. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2303-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6191944
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-61919442018-10-23 Statistical modelling of CG interdistance across multiple organisms A., Merlotti Valle I., Faria do G., Castellani D., Remondini BMC Bioinformatics Research BACKGROUND: Statistical approaches to genetic sequences have revealed helpful to gain deeper insight into biological and structural functionalities, using ideas coming from information theory and stochastic modelling of symbolic sequences. In particular, previous analyses on CG dinucleotide position along the genome allowed to highlight its epigenetic role in DNA methylation, showing a different distribution tail as compared to other dinucleotides. In this paper we extend the analysis to the whole CG distance distribution over a selected set of higher-order organisms. Then we apply the best fitting probability density function to a large range of organisms (>4400) of different complexity (from bacteria to mammals) and we characterize some emerging global features. RESULTS: We find that the Gamma distribution is optimal for the selected subset as compared to a group of several distributions, chosen for their physical meaning or because recently used in literature for similar studies. The parameters of this distribution, when applied to our larger set of organisms, allows to highlight some biologically relavant features for the considered organism classes, that can be useful also for classification purposes. CONCLUSIONS: The quantification of statistical properties of CG dinucleotide positioning along the genome is confirmed as a useful tool to characterize broad classes of organisms, spanning the whole range of biological complexity. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2303-2) contains supplementary material, which is available to authorized users. BioMed Central 2018-10-15 /pmc/articles/PMC6191944/ /pubmed/30367587 http://dx.doi.org/10.1186/s12859-018-2303-2 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
A., Merlotti
Valle I., Faria do
G., Castellani
D., Remondini
Statistical modelling of CG interdistance across multiple organisms
title Statistical modelling of CG interdistance across multiple organisms
title_full Statistical modelling of CG interdistance across multiple organisms
title_fullStr Statistical modelling of CG interdistance across multiple organisms
title_full_unstemmed Statistical modelling of CG interdistance across multiple organisms
title_short Statistical modelling of CG interdistance across multiple organisms
title_sort statistical modelling of cg interdistance across multiple organisms
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6191944/
https://www.ncbi.nlm.nih.gov/pubmed/30367587
http://dx.doi.org/10.1186/s12859-018-2303-2
work_keys_str_mv AT amerlotti statisticalmodellingofcginterdistanceacrossmultipleorganisms
AT valleifariado statisticalmodellingofcginterdistanceacrossmultipleorganisms
AT gcastellani statisticalmodellingofcginterdistanceacrossmultipleorganisms
AT dremondini statisticalmodellingofcginterdistanceacrossmultipleorganisms