Cargando…

A model of k-mer surprisal to quantify local sequence information content surrounding splice regions

Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These meth...

Descripción completa

Detalles Bibliográficos
Autores principales:	Humphrey, Sam, Kerr, Alastair, Rattray, Magnus, Dive, Caroline, Miller, Crispin J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2020
Materias:	Computational Biology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648452/ https://www.ncbi.nlm.nih.gov/pubmed/33194378 http://dx.doi.org/10.7717/peerj.10063

_version_	1783607113015099392
author	Humphrey, Sam Kerr, Alastair Rattray, Magnus Dive, Caroline Miller, Crispin J.
author_facet	Humphrey, Sam Kerr, Alastair Rattray, Magnus Dive, Caroline Miller, Crispin J.
author_sort	Humphrey, Sam
collection	PubMed
description	Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These methods therefore rely on sufficient underlying sequence similarity with which to construct a representative alignment. Here we describe a method using a formal metric of information, surprisal, to analyse biological sub-sequences without alignment constraints. We applied our model to the genomes of five different species to reveal similar patterns across a panel of eukaryotes. As the surprisal of a sub-sequence is inversely proportional to its occurrence within the genome, the optimal size of the sub-sequences was selected for each species under consideration. With the model optimized, we found a strong correlation between surprisal and CG dinucleotide usage. The utility of our model was tested by examining the sequences of genes known to undergo splicing. We demonstrate that our model can identify biological features of interest such as known donor and acceptor sites. Analysis across all annotated coding exon junctions in Homo sapiens reveals the information content of coding exons to be greater than the surrounding intron regions, a consequence of increased suppression of the CG dinucleotide in intronic space. Sequences within coding regions proximal to exon junctions exhibited novel patterns within DNA and coding mRNA that are not a function of the encoded amino acid sequence. Our findings are consistent with the presence of secondary information encoding features such as DNA and RNA binding sites, multiplexed through the coding sequence and independent of the information required to define the corresponding amino-acid sequence. We conclude that surprisal provides a complementary methodology with which to locate regions of interest in the genome, particularly in situations that lack an appropriate multiple sequence alignment.
format	Online Article Text
id	pubmed-7648452
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-76484522020-11-12 A model of k-mer surprisal to quantify local sequence information content surrounding splice regions Humphrey, Sam Kerr, Alastair Rattray, Magnus Dive, Caroline Miller, Crispin J. PeerJ Computational Biology Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These methods therefore rely on sufficient underlying sequence similarity with which to construct a representative alignment. Here we describe a method using a formal metric of information, surprisal, to analyse biological sub-sequences without alignment constraints. We applied our model to the genomes of five different species to reveal similar patterns across a panel of eukaryotes. As the surprisal of a sub-sequence is inversely proportional to its occurrence within the genome, the optimal size of the sub-sequences was selected for each species under consideration. With the model optimized, we found a strong correlation between surprisal and CG dinucleotide usage. The utility of our model was tested by examining the sequences of genes known to undergo splicing. We demonstrate that our model can identify biological features of interest such as known donor and acceptor sites. Analysis across all annotated coding exon junctions in Homo sapiens reveals the information content of coding exons to be greater than the surrounding intron regions, a consequence of increased suppression of the CG dinucleotide in intronic space. Sequences within coding regions proximal to exon junctions exhibited novel patterns within DNA and coding mRNA that are not a function of the encoded amino acid sequence. Our findings are consistent with the presence of secondary information encoding features such as DNA and RNA binding sites, multiplexed through the coding sequence and independent of the information required to define the corresponding amino-acid sequence. We conclude that surprisal provides a complementary methodology with which to locate regions of interest in the genome, particularly in situations that lack an appropriate multiple sequence alignment. PeerJ Inc. 2020-11-04 /pmc/articles/PMC7648452/ /pubmed/33194378 http://dx.doi.org/10.7717/peerj.10063 Text en ©2020 Humphrey et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Computational Biology Humphrey, Sam Kerr, Alastair Rattray, Magnus Dive, Caroline Miller, Crispin J. A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_full	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_fullStr	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_full_unstemmed	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_short	A model of k-mer surprisal to quantify local sequence information content surrounding splice regions
title_sort	model of k-mer surprisal to quantify local sequence information content surrounding splice regions
topic	Computational Biology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648452/ https://www.ncbi.nlm.nih.gov/pubmed/33194378 http://dx.doi.org/10.7717/peerj.10063
work_keys_str_mv	AT humphreysam amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT kerralastair amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT rattraymagnus amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT divecaroline amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT millercrispinj amodelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT humphreysam modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT kerralastair modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT rattraymagnus modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT divecaroline modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions AT millercrispinj modelofkmersurprisaltoquantifylocalsequenceinformationcontentsurroundingspliceregions

A model of k-mer surprisal to quantify local sequence information content surrounding splice regions

Ejemplares similares