Cargando…

Classification of group A rotavirus VP7 and VP4 genotypes using random forest

Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotypin...

Descripción completa

Detalles Bibliográficos
Autores principales: Tran, Hoc, Friendship, Robert, Poljak, Zvonimir
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267748/
https://www.ncbi.nlm.nih.gov/pubmed/37323680
http://dx.doi.org/10.3389/fgene.2023.1029185
_version_ 1785058991267643392
author Tran, Hoc
Friendship, Robert
Poljak, Zvonimir
author_facet Tran, Hoc
Friendship, Robert
Poljak, Zvonimir
author_sort Tran, Hoc
collection PubMed
description Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975–0.992, 0.970–0.989) and during model testing (0.972–0.996, 0.969–0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available.
format Online
Article
Text
id pubmed-10267748
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-102677482023-06-15 Classification of group A rotavirus VP7 and VP4 genotypes using random forest Tran, Hoc Friendship, Robert Poljak, Zvonimir Front Genet Genetics Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975–0.992, 0.970–0.989) and during model testing (0.972–0.996, 0.969–0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available. Frontiers Media S.A. 2023-05-30 /pmc/articles/PMC10267748/ /pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185 Text en Copyright © 2023 Tran, Friendship and Poljak. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Tran, Hoc
Friendship, Robert
Poljak, Zvonimir
Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_full Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_fullStr Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_full_unstemmed Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_short Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_sort classification of group a rotavirus vp7 and vp4 genotypes using random forest
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267748/
https://www.ncbi.nlm.nih.gov/pubmed/37323680
http://dx.doi.org/10.3389/fgene.2023.1029185
work_keys_str_mv AT tranhoc classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest
AT friendshiprobert classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest
AT poljakzvonimir classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest