Cargando…

A k-mer grammar analysis to uncover maize regulatory architecture

BACKGROUND: Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Mejía-Guerra, María Katherine, Buckler, Edward S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419808/
https://www.ncbi.nlm.nih.gov/pubmed/30876396
http://dx.doi.org/10.1186/s12870-019-1693-2
_version_ 1783404001140670464
author Mejía-Guerra, María Katherine
Buckler, Edward S.
author_facet Mejía-Guerra, María Katherine
Buckler, Edward S.
author_sort Mejía-Guerra, María Katherine
collection PubMed
description BACKGROUND: Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified. RESULTS: We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built “bag-of-k-mers” and “vector-k-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-k-mers” achieved higher overall accuracy, while the “vector-k-mers” models were more useful in highlighting key groups of sequences within the regulatory regions. CONCLUSIONS: These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12870-019-1693-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6419808
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64198082019-03-28 A k-mer grammar analysis to uncover maize regulatory architecture Mejía-Guerra, María Katherine Buckler, Edward S. BMC Plant Biol Research Article BACKGROUND: Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified. RESULTS: We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) “bag-of-words” which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built “bag-of-k-mers” and “vector-k-mers” models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our “bag-of-k-mers” achieved higher overall accuracy, while the “vector-k-mers” models were more useful in highlighting key groups of sequences within the regulatory regions. CONCLUSIONS: These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12870-019-1693-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-03-15 /pmc/articles/PMC6419808/ /pubmed/30876396 http://dx.doi.org/10.1186/s12870-019-1693-2 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Mejía-Guerra, María Katherine
Buckler, Edward S.
A k-mer grammar analysis to uncover maize regulatory architecture
title A k-mer grammar analysis to uncover maize regulatory architecture
title_full A k-mer grammar analysis to uncover maize regulatory architecture
title_fullStr A k-mer grammar analysis to uncover maize regulatory architecture
title_full_unstemmed A k-mer grammar analysis to uncover maize regulatory architecture
title_short A k-mer grammar analysis to uncover maize regulatory architecture
title_sort k-mer grammar analysis to uncover maize regulatory architecture
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419808/
https://www.ncbi.nlm.nih.gov/pubmed/30876396
http://dx.doi.org/10.1186/s12870-019-1693-2
work_keys_str_mv AT mejiaguerramariakatherine akmergrammaranalysistouncovermaizeregulatoryarchitecture
AT buckleredwards akmergrammaranalysistouncovermaizeregulatoryarchitecture
AT mejiaguerramariakatherine kmergrammaranalysistouncovermaizeregulatoryarchitecture
AT buckleredwards kmergrammaranalysistouncovermaizeregulatoryarchitecture