Cargando…

Significant non-existence of sequences in genomes and proteomes

Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by...

Descripción completa

Detalles Bibliográficos
Autores principales: Koulouras, Grigorios, Frith, Martin C
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8034619/
https://www.ncbi.nlm.nih.gov/pubmed/33693858
http://dx.doi.org/10.1093/nar/gkab139
_version_ 1783676568101453824
author Koulouras, Grigorios
Frith, Martin C
author_facet Koulouras, Grigorios
Frith, Martin C
author_sort Koulouras, Grigorios
collection PubMed
description Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
format Online
Article
Text
id pubmed-8034619
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-80346192021-04-14 Significant non-existence of sequences in genomes and proteomes Koulouras, Grigorios Frith, Martin C Nucleic Acids Res Computational Biology Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes. Oxford University Press 2021-03-10 /pmc/articles/PMC8034619/ /pubmed/33693858 http://dx.doi.org/10.1093/nar/gkab139 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Koulouras, Grigorios
Frith, Martin C
Significant non-existence of sequences in genomes and proteomes
title Significant non-existence of sequences in genomes and proteomes
title_full Significant non-existence of sequences in genomes and proteomes
title_fullStr Significant non-existence of sequences in genomes and proteomes
title_full_unstemmed Significant non-existence of sequences in genomes and proteomes
title_short Significant non-existence of sequences in genomes and proteomes
title_sort significant non-existence of sequences in genomes and proteomes
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8034619/
https://www.ncbi.nlm.nih.gov/pubmed/33693858
http://dx.doi.org/10.1093/nar/gkab139
work_keys_str_mv AT koulourasgrigorios significantnonexistenceofsequencesingenomesandproteomes
AT frithmartinc significantnonexistenceofsequencesingenomesandproteomes