Cargando…

K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification

Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, i...

Descripción completa

Detalles Bibliográficos
Autores principales: Cserhati, Matyas, Xiao, Peng, Guda, Chittibabu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6881769/
https://www.ncbi.nlm.nih.gov/pubmed/31827584
http://dx.doi.org/10.1155/2019/4259479
_version_ 1783474012197748736
author Cserhati, Matyas
Xiao, Peng
Guda, Chittibabu
author_facet Cserhati, Matyas
Xiao, Peng
Guda, Chittibabu
author_sort Cserhati, Matyas
collection PubMed
description Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, intron, and 3′ UTR regions from 58 insect species belonging to three genera of Diptera that include Anopheles, Drosophila, and Glossina. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7–9 bp for the whole genome, 5′ and 3′ UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two Aedes and one Culex species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.
format Online
Article
Text
id pubmed-6881769
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-68817692019-12-11 K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification Cserhati, Matyas Xiao, Peng Guda, Chittibabu Comput Math Methods Med Research Article Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, intron, and 3′ UTR regions from 58 insect species belonging to three genera of Diptera that include Anopheles, Drosophila, and Glossina. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7–9 bp for the whole genome, 5′ and 3′ UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two Aedes and one Culex species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures. Hindawi 2019-11-15 /pmc/articles/PMC6881769/ /pubmed/31827584 http://dx.doi.org/10.1155/2019/4259479 Text en Copyright © 2019 Matyas Cserhati et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Cserhati, Matyas
Xiao, Peng
Guda, Chittibabu
K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification
title K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification
title_full K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification
title_fullStr K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification
title_full_unstemmed K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification
title_short K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification
title_sort k-mer-based motif analysis in insect species across anopheles, drosophila, and glossina genera and its application to species classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6881769/
https://www.ncbi.nlm.nih.gov/pubmed/31827584
http://dx.doi.org/10.1155/2019/4259479
work_keys_str_mv AT cserhatimatyas kmerbasedmotifanalysisininsectspeciesacrossanophelesdrosophilaandglossinageneraanditsapplicationtospeciesclassification
AT xiaopeng kmerbasedmotifanalysisininsectspeciesacrossanophelesdrosophilaandglossinageneraanditsapplicationtospeciesclassification
AT gudachittibabu kmerbasedmotifanalysisininsectspeciesacrossanophelesdrosophilaandglossinageneraanditsapplicationtospeciesclassification