Cargando…

CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers

BACKGROUND: Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits ooc...

Descripción completa

Detalles Bibliográficos
Autores principales: Maruyama, Osamu, Li, Yinuo, Narita, Hiroki, Toh, Hidehiro, Au Yeung, Wan Kin, Sasaki, Hiroyuki
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9469632/
https://www.ncbi.nlm.nih.gov/pubmed/36096737
http://dx.doi.org/10.1186/s12859-022-04916-3
_version_ 1784788684534448128
author Maruyama, Osamu
Li, Yinuo
Narita, Hiroki
Toh, Hidehiro
Au Yeung, Wan Kin
Sasaki, Hiroyuki
author_facet Maruyama, Osamu
Li, Yinuo
Narita, Hiroki
Toh, Hidehiro
Au Yeung, Wan Kin
Sasaki, Hiroyuki
author_sort Maruyama, Osamu
collection PubMed
description BACKGROUND: Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks. RESULTS: We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range [Formula: see text] to [Formula: see text] , N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec. CONCLUSIONS: We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04916-3.
format Online
Article
Text
id pubmed-9469632
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-94696322022-09-14 CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers Maruyama, Osamu Li, Yinuo Narita, Hiroki Toh, Hidehiro Au Yeung, Wan Kin Sasaki, Hiroyuki BMC Bioinformatics Research BACKGROUND: Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks. RESULTS: We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range [Formula: see text] to [Formula: see text] , N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec. CONCLUSIONS: We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04916-3. BioMed Central 2022-09-12 /pmc/articles/PMC9469632/ /pubmed/36096737 http://dx.doi.org/10.1186/s12859-022-04916-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Maruyama, Osamu
Li, Yinuo
Narita, Hiroki
Toh, Hidehiro
Au Yeung, Wan Kin
Sasaki, Hiroyuki
CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_full CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_fullStr CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_full_unstemmed CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_short CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
title_sort cmic: predicting dna methylation inheritance of cpg islands with embedding vectors of variable-length k-mers
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9469632/
https://www.ncbi.nlm.nih.gov/pubmed/36096737
http://dx.doi.org/10.1186/s12859-022-04916-3
work_keys_str_mv AT maruyamaosamu cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT liyinuo cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT naritahiroki cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT tohhidehiro cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT auyeungwankin cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers
AT sasakihiroyuki cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers