Cargando…
Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history
With the continuous development of information technology and the running speed of computers, the development of informatization has led to the generation of increasingly more medical data. Solving unmet needs such as employing the constantly developing artificial intelligence technology to medical...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10043529/ https://www.ncbi.nlm.nih.gov/pubmed/36977685 http://dx.doi.org/10.1038/s41598-023-31013-z |
_version_ | 1784913175137746944 |
---|---|
author | Zhou, Kaiyue Huo, Jiaxin Gao, Caixia Wang, Xu Xu, Pengfei Hou, Jiahuan Guo, Wenying Sun, Tao Da, Lin |
author_facet | Zhou, Kaiyue Huo, Jiaxin Gao, Caixia Wang, Xu Xu, Pengfei Hou, Jiahuan Guo, Wenying Sun, Tao Da, Lin |
author_sort | Zhou, Kaiyue |
collection | PubMed |
description | With the continuous development of information technology and the running speed of computers, the development of informatization has led to the generation of increasingly more medical data. Solving unmet needs such as employing the constantly developing artificial intelligence technology to medical data and providing support for the medical industry is a hot research topic. Cytomegalovirus (CMV) is a kind of virus that exists widely in nature with strict species specificity, and the infection rate among Chinese adults is more than 95%. Therefore, the detection of CMV is of great importance since the vast majority of infected patients are in a state of invisible infection after the infection, except for a few patients with clinical symptoms. In this study, we present a new method to detect CMV infection status by analyzing high-throughput sequencing results of T cell receptor beta chains (TCRβ). Based on the high-throughput sequencing data of 640 subjects from cohort 1, Fisher’s exact test was performed to evaluate the relationship between TCRβ sequences and CMV status. Furthermore, the number of subjects with these correlated sequences to different degrees in cohort 1 and cohort 2 were measured to build binary classifier models to identify whether the subject was CMV positive or negative. We select four binary classification algorithms: logistic regression (LR), support vector machine (SVM), random forest (RF), and linear discriminant analysis (LDA) for side-by-side comparison. According to the performance of different algorithms corresponding to different thresholds, four optimal binary classification algorithm models are obtained. The logistic regression algorithm performs best when Fisher's exact test threshold is 10(−5), and the sensitivity and specificity are 87.5% and 96.88%, respectively. The RF algorithm performs better at the threshold of 10(−5), with a sensitivity of 87.5% and a specificity of 90.63%. The SVM algorithm also achieves high accuracy at the threshold value of 10(−5), with a sensitivity of 85.42% and specificity of 96.88%. The LDA algorithm achieves high accuracy with 95.83% sensitivity and 90.63% specificity when the threshold value is 10(−4). This is probably because the two-dimensional distribution of CMV data samples is linearly separable, and linear division models such as LDA are more effective, while the division effect of nonlinear separable algorithms such as random forest is relatively inaccurate. This new finding may be a potential diagnostic method for CMV and may even be applicable to other viruses, such as the infectious history detection of the new coronavirus. |
format | Online Article Text |
id | pubmed-10043529 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-100435292023-03-28 Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history Zhou, Kaiyue Huo, Jiaxin Gao, Caixia Wang, Xu Xu, Pengfei Hou, Jiahuan Guo, Wenying Sun, Tao Da, Lin Sci Rep Article With the continuous development of information technology and the running speed of computers, the development of informatization has led to the generation of increasingly more medical data. Solving unmet needs such as employing the constantly developing artificial intelligence technology to medical data and providing support for the medical industry is a hot research topic. Cytomegalovirus (CMV) is a kind of virus that exists widely in nature with strict species specificity, and the infection rate among Chinese adults is more than 95%. Therefore, the detection of CMV is of great importance since the vast majority of infected patients are in a state of invisible infection after the infection, except for a few patients with clinical symptoms. In this study, we present a new method to detect CMV infection status by analyzing high-throughput sequencing results of T cell receptor beta chains (TCRβ). Based on the high-throughput sequencing data of 640 subjects from cohort 1, Fisher’s exact test was performed to evaluate the relationship between TCRβ sequences and CMV status. Furthermore, the number of subjects with these correlated sequences to different degrees in cohort 1 and cohort 2 were measured to build binary classifier models to identify whether the subject was CMV positive or negative. We select four binary classification algorithms: logistic regression (LR), support vector machine (SVM), random forest (RF), and linear discriminant analysis (LDA) for side-by-side comparison. According to the performance of different algorithms corresponding to different thresholds, four optimal binary classification algorithm models are obtained. The logistic regression algorithm performs best when Fisher's exact test threshold is 10(−5), and the sensitivity and specificity are 87.5% and 96.88%, respectively. The RF algorithm performs better at the threshold of 10(−5), with a sensitivity of 87.5% and a specificity of 90.63%. The SVM algorithm also achieves high accuracy at the threshold value of 10(−5), with a sensitivity of 85.42% and specificity of 96.88%. The LDA algorithm achieves high accuracy with 95.83% sensitivity and 90.63% specificity when the threshold value is 10(−4). This is probably because the two-dimensional distribution of CMV data samples is linearly separable, and linear division models such as LDA are more effective, while the division effect of nonlinear separable algorithms such as random forest is relatively inaccurate. This new finding may be a potential diagnostic method for CMV and may even be applicable to other viruses, such as the infectious history detection of the new coronavirus. Nature Publishing Group UK 2023-03-28 /pmc/articles/PMC10043529/ /pubmed/36977685 http://dx.doi.org/10.1038/s41598-023-31013-z Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Zhou, Kaiyue Huo, Jiaxin Gao, Caixia Wang, Xu Xu, Pengfei Hou, Jiahuan Guo, Wenying Sun, Tao Da, Lin Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history |
title | Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history |
title_full | Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history |
title_fullStr | Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history |
title_full_unstemmed | Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history |
title_short | Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history |
title_sort | applying t-classifier, binary classifiers, upon high-throughput tcr sequencing output to identify cytomegalovirus exposure history |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10043529/ https://www.ncbi.nlm.nih.gov/pubmed/36977685 http://dx.doi.org/10.1038/s41598-023-31013-z |
work_keys_str_mv | AT zhoukaiyue applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT huojiaxin applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT gaocaixia applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT wangxu applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT xupengfei applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT houjiahuan applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT guowenying applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT suntao applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory AT dalin applyingtclassifierbinaryclassifiersuponhighthroughputtcrsequencingoutputtoidentifycytomegalovirusexposurehistory |