Cargando…
CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers
BACKGROUND: Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits ooc...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9469632/ https://www.ncbi.nlm.nih.gov/pubmed/36096737 http://dx.doi.org/10.1186/s12859-022-04916-3 |
_version_ | 1784788684534448128 |
---|---|
author | Maruyama, Osamu Li, Yinuo Narita, Hiroki Toh, Hidehiro Au Yeung, Wan Kin Sasaki, Hiroyuki |
author_facet | Maruyama, Osamu Li, Yinuo Narita, Hiroki Toh, Hidehiro Au Yeung, Wan Kin Sasaki, Hiroyuki |
author_sort | Maruyama, Osamu |
collection | PubMed |
description | BACKGROUND: Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks. RESULTS: We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range [Formula: see text] to [Formula: see text] , N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec. CONCLUSIONS: We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04916-3. |
format | Online Article Text |
id | pubmed-9469632 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-94696322022-09-14 CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers Maruyama, Osamu Li, Yinuo Narita, Hiroki Toh, Hidehiro Au Yeung, Wan Kin Sasaki, Hiroyuki BMC Bioinformatics Research BACKGROUND: Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks. RESULTS: We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range [Formula: see text] to [Formula: see text] , N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec. CONCLUSIONS: We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04916-3. BioMed Central 2022-09-12 /pmc/articles/PMC9469632/ /pubmed/36096737 http://dx.doi.org/10.1186/s12859-022-04916-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Maruyama, Osamu Li, Yinuo Narita, Hiroki Toh, Hidehiro Au Yeung, Wan Kin Sasaki, Hiroyuki CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers |
title | CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers |
title_full | CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers |
title_fullStr | CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers |
title_full_unstemmed | CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers |
title_short | CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers |
title_sort | cmic: predicting dna methylation inheritance of cpg islands with embedding vectors of variable-length k-mers |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9469632/ https://www.ncbi.nlm.nih.gov/pubmed/36096737 http://dx.doi.org/10.1186/s12859-022-04916-3 |
work_keys_str_mv | AT maruyamaosamu cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers AT liyinuo cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers AT naritahiroki cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers AT tohhidehiro cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers AT auyeungwankin cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers AT sasakihiroyuki cmicpredictingdnamethylationinheritanceofcpgislandswithembeddingvectorsofvariablelengthkmers |