Cargando…
n-Gram-Based Text Compression
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Hindawi Publishing Corporation
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5124757/ https://www.ncbi.nlm.nih.gov/pubmed/27965708 http://dx.doi.org/10.1155/2016/9483646 |
_version_ | 1782469894827147264 |
---|---|
author | Nguyen, Vu H. Nguyen, Hien T. Duong, Hieu N. Snasel, Vaclav |
author_facet | Nguyen, Vu H. Nguyen, Hien T. Duong, Hieu N. Snasel, Vaclav |
author_sort | Nguyen, Vu H. |
collection | PubMed |
description | We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods. |
format | Online Article Text |
id | pubmed-5124757 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Hindawi Publishing Corporation |
record_format | MEDLINE/PubMed |
spelling | pubmed-51247572016-12-13 n-Gram-Based Text Compression Nguyen, Vu H. Nguyen, Hien T. Duong, Hieu N. Snasel, Vaclav Comput Intell Neurosci Research Article We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods. Hindawi Publishing Corporation 2016 2016-11-14 /pmc/articles/PMC5124757/ /pubmed/27965708 http://dx.doi.org/10.1155/2016/9483646 Text en Copyright © 2016 Vu H. Nguyen et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Nguyen, Vu H. Nguyen, Hien T. Duong, Hieu N. Snasel, Vaclav n-Gram-Based Text Compression |
title |
n-Gram-Based Text Compression |
title_full |
n-Gram-Based Text Compression |
title_fullStr |
n-Gram-Based Text Compression |
title_full_unstemmed |
n-Gram-Based Text Compression |
title_short |
n-Gram-Based Text Compression |
title_sort | n-gram-based text compression |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5124757/ https://www.ncbi.nlm.nih.gov/pubmed/27965708 http://dx.doi.org/10.1155/2016/9483646 |
work_keys_str_mv | AT nguyenvuh ngrambasedtextcompression AT nguyenhient ngrambasedtextcompression AT duonghieun ngrambasedtextcompression AT snaselvaclav ngrambasedtextcompression |