Cargando…

n-Gram-Based Text Compression

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, Vu H., Nguyen, Hien T., Duong, Hieu N., Snasel, Vaclav
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5124757/
https://www.ncbi.nlm.nih.gov/pubmed/27965708
http://dx.doi.org/10.1155/2016/9483646
_version_ 1782469894827147264
author Nguyen, Vu H.
Nguyen, Hien T.
Duong, Hieu N.
Snasel, Vaclav
author_facet Nguyen, Vu H.
Nguyen, Hien T.
Duong, Hieu N.
Snasel, Vaclav
author_sort Nguyen, Vu H.
collection PubMed
description We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
format Online
Article
Text
id pubmed-5124757
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-51247572016-12-13 n-Gram-Based Text Compression Nguyen, Vu H. Nguyen, Hien T. Duong, Hieu N. Snasel, Vaclav Comput Intell Neurosci Research Article We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods. Hindawi Publishing Corporation 2016 2016-11-14 /pmc/articles/PMC5124757/ /pubmed/27965708 http://dx.doi.org/10.1155/2016/9483646 Text en Copyright © 2016 Vu H. Nguyen et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Nguyen, Vu H.
Nguyen, Hien T.
Duong, Hieu N.
Snasel, Vaclav
n-Gram-Based Text Compression
title n-Gram-Based Text Compression
title_full n-Gram-Based Text Compression
title_fullStr n-Gram-Based Text Compression
title_full_unstemmed n-Gram-Based Text Compression
title_short n-Gram-Based Text Compression
title_sort n-gram-based text compression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5124757/
https://www.ncbi.nlm.nih.gov/pubmed/27965708
http://dx.doi.org/10.1155/2016/9483646
work_keys_str_mv AT nguyenvuh ngrambasedtextcompression
AT nguyenhient ngrambasedtextcompression
AT duonghieun ngrambasedtextcompression
AT snaselvaclav ngrambasedtextcompression