Cargando…

NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents

The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patter...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Sophia S., Hockenberry, Adam J., Lancichinetti, Andrea, Jewett, Michael C., Amaral, Luís A. N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5106001/
https://www.ncbi.nlm.nih.gov/pubmed/27835644
http://dx.doi.org/10.1371/journal.pcbi.1005184
_version_ 1782466973853024256
author Liu, Sophia S.
Hockenberry, Adam J.
Lancichinetti, Andrea
Jewett, Michael C.
Amaral, Luís A. N.
author_facet Liu, Sophia S.
Hockenberry, Adam J.
Lancichinetti, Andrea
Jewett, Michael C.
Amaral, Luís A. N.
author_sort Liu, Sophia S.
collection PubMed
description The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.
format Online
Article
Text
id pubmed-5106001
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-51060012016-12-08 NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents Liu, Sophia S. Hockenberry, Adam J. Lancichinetti, Andrea Jewett, Michael C. Amaral, Luís A. N. PLoS Comput Biol Research Article The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems. Public Library of Science 2016-11-11 /pmc/articles/PMC5106001/ /pubmed/27835644 http://dx.doi.org/10.1371/journal.pcbi.1005184 Text en © 2016 Liu et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Liu, Sophia S.
Hockenberry, Adam J.
Lancichinetti, Andrea
Jewett, Michael C.
Amaral, Luís A. N.
NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents
title NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents
title_full NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents
title_fullStr NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents
title_full_unstemmed NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents
title_short NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents
title_sort nullseq: a tool for generating random coding sequences with desired amino acid and gc contents
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5106001/
https://www.ncbi.nlm.nih.gov/pubmed/27835644
http://dx.doi.org/10.1371/journal.pcbi.1005184
work_keys_str_mv AT liusophias nullseqatoolforgeneratingrandomcodingsequenceswithdesiredaminoacidandgccontents
AT hockenberryadamj nullseqatoolforgeneratingrandomcodingsequenceswithdesiredaminoacidandgccontents
AT lancichinettiandrea nullseqatoolforgeneratingrandomcodingsequenceswithdesiredaminoacidandgccontents
AT jewettmichaelc nullseqatoolforgeneratingrandomcodingsequenceswithdesiredaminoacidandgccontents
AT amaralluisan nullseqatoolforgeneratingrandomcodingsequenceswithdesiredaminoacidandgccontents