Cargando…

Classification of group A rotavirus VP7 and VP4 genotypes using random forest

Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotypin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tran, Hoc, Friendship, Robert, Poljak, Zvonimir
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2023
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267748/ https://www.ncbi.nlm.nih.gov/pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185

_version_	1785058991267643392
author	Tran, Hoc Friendship, Robert Poljak, Zvonimir
author_facet	Tran, Hoc Friendship, Robert Poljak, Zvonimir
author_sort	Tran, Hoc
collection	PubMed
description	Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975–0.992, 0.970–0.989) and during model testing (0.972–0.996, 0.969–0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available.
format	Online Article Text
id	pubmed-10267748
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-102677482023-06-15 Classification of group A rotavirus VP7 and VP4 genotypes using random forest Tran, Hoc Friendship, Robert Poljak, Zvonimir Front Genet Genetics Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975–0.992, 0.970–0.989) and during model testing (0.972–0.996, 0.969–0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available. Frontiers Media S.A. 2023-05-30 /pmc/articles/PMC10267748/ /pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185 Text en Copyright © 2023 Tran, Friendship and Poljak. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Tran, Hoc Friendship, Robert Poljak, Zvonimir Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title	Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_full	Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_fullStr	Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_full_unstemmed	Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_short	Classification of group A rotavirus VP7 and VP4 genotypes using random forest
title_sort	classification of group a rotavirus vp7 and vp4 genotypes using random forest
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267748/ https://www.ncbi.nlm.nih.gov/pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185
work_keys_str_mv	AT tranhoc classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest AT friendshiprobert classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest AT poljakzvonimir classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest

Classification of group A rotavirus VP7 and VP4 genotypes using random forest

Ejemplares similares