Cargando…

Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC

Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60–80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequ...

Descripción completa

Detalles Bibliográficos
Autores principales: Bai, Xin, Jia, Jian-an, Fang, Meng, Chen, Shipeng, Liang, Xiaotao, Zhu, Shanfeng, Zhang, Shuqin, Feng, Jianfeng, Sun, Fengzhu, Gao, Chunfang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5841821/
https://www.ncbi.nlm.nih.gov/pubmed/29474353
http://dx.doi.org/10.1371/journal.pgen.1007206
_version_ 1783304805492457472
author Bai, Xin
Jia, Jian-an
Fang, Meng
Chen, Shipeng
Liang, Xiaotao
Zhu, Shanfeng
Zhang, Shuqin
Feng, Jianfeng
Sun, Fengzhu
Gao, Chunfang
author_facet Bai, Xin
Jia, Jian-an
Fang, Meng
Chen, Shipeng
Liang, Xiaotao
Zhu, Shanfeng
Zhang, Shuqin
Feng, Jianfeng
Sun, Fengzhu
Gao, Chunfang
author_sort Bai, Xin
collection PubMed
description Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60–80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.
format Online
Article
Text
id pubmed-5841821
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-58418212018-03-23 Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC Bai, Xin Jia, Jian-an Fang, Meng Chen, Shipeng Liang, Xiaotao Zhu, Shanfeng Zhang, Shuqin Feng, Jianfeng Sun, Fengzhu Gao, Chunfang PLoS Genet Research Article Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60–80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual. Public Library of Science 2018-02-23 /pmc/articles/PMC5841821/ /pubmed/29474353 http://dx.doi.org/10.1371/journal.pgen.1007206 Text en © 2018 Bai et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Bai, Xin
Jia, Jian-an
Fang, Meng
Chen, Shipeng
Liang, Xiaotao
Zhu, Shanfeng
Zhang, Shuqin
Feng, Jianfeng
Sun, Fengzhu
Gao, Chunfang
Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC
title Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC
title_full Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC
title_fullStr Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC
title_full_unstemmed Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC
title_short Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC
title_sort deep sequencing of hbv pre-s region reveals high heterogeneity of hbv genotypes and associations of word pattern frequencies with hcc
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5841821/
https://www.ncbi.nlm.nih.gov/pubmed/29474353
http://dx.doi.org/10.1371/journal.pgen.1007206
work_keys_str_mv AT baixin deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT jiajianan deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT fangmeng deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT chenshipeng deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT liangxiaotao deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT zhushanfeng deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT zhangshuqin deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT fengjianfeng deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT sunfengzhu deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc
AT gaochunfang deepsequencingofhbvpresregionrevealshighheterogeneityofhbvgenotypesandassociationsofwordpatternfrequencieswithhcc