Cargando…
Classification of group A rotavirus VP7 and VP4 genotypes using random forest
Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotypin...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267748/ https://www.ncbi.nlm.nih.gov/pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185 |
_version_ | 1785058991267643392 |
---|---|
author | Tran, Hoc Friendship, Robert Poljak, Zvonimir |
author_facet | Tran, Hoc Friendship, Robert Poljak, Zvonimir |
author_sort | Tran, Hoc |
collection | PubMed |
description | Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975–0.992, 0.970–0.989) and during model testing (0.972–0.996, 0.969–0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available. |
format | Online Article Text |
id | pubmed-10267748 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-102677482023-06-15 Classification of group A rotavirus VP7 and VP4 genotypes using random forest Tran, Hoc Friendship, Robert Poljak, Zvonimir Front Genet Genetics Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975–0.992, 0.970–0.989) and during model testing (0.972–0.996, 0.969–0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available. Frontiers Media S.A. 2023-05-30 /pmc/articles/PMC10267748/ /pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185 Text en Copyright © 2023 Tran, Friendship and Poljak. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Tran, Hoc Friendship, Robert Poljak, Zvonimir Classification of group A rotavirus VP7 and VP4 genotypes using random forest |
title | Classification of group A rotavirus VP7 and VP4 genotypes using random forest |
title_full | Classification of group A rotavirus VP7 and VP4 genotypes using random forest |
title_fullStr | Classification of group A rotavirus VP7 and VP4 genotypes using random forest |
title_full_unstemmed | Classification of group A rotavirus VP7 and VP4 genotypes using random forest |
title_short | Classification of group A rotavirus VP7 and VP4 genotypes using random forest |
title_sort | classification of group a rotavirus vp7 and vp4 genotypes using random forest |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267748/ https://www.ncbi.nlm.nih.gov/pubmed/37323680 http://dx.doi.org/10.3389/fgene.2023.1029185 |
work_keys_str_mv | AT tranhoc classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest AT friendshiprobert classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest AT poljakzvonimir classificationofgrouparotavirusvp7andvp4genotypesusingrandomforest |