Cargando…

Statistical Analysis of the Indus Script Using n-Grams

The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilizat...

Descripción completa

Detalles Bibliográficos
Autores principales: Yadav, Nisha, Joglekar, Hrishikesh, Rao, Rajesh P. N., Vahia, Mayank N., Adhikari, Ronojoy, Mahadevan, Iravatham
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841631/
https://www.ncbi.nlm.nih.gov/pubmed/20333254
http://dx.doi.org/10.1371/journal.pone.0009506
_version_ 1782179142177914880
author Yadav, Nisha
Joglekar, Hrishikesh
Rao, Rajesh P. N.
Vahia, Mayank N.
Adhikari, Ronojoy
Mahadevan, Iravatham
author_facet Yadav, Nisha
Joglekar, Hrishikesh
Rao, Rajesh P. N.
Vahia, Mayank N.
Adhikari, Ronojoy
Mahadevan, Iravatham
author_sort Yadav, Nisha
collection PubMed
description The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail.
format Text
id pubmed-2841631
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-28416312010-03-24 Statistical Analysis of the Indus Script Using n-Grams Yadav, Nisha Joglekar, Hrishikesh Rao, Rajesh P. N. Vahia, Mayank N. Adhikari, Ronojoy Mahadevan, Iravatham PLoS One Research Article The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail. Public Library of Science 2010-03-19 /pmc/articles/PMC2841631/ /pubmed/20333254 http://dx.doi.org/10.1371/journal.pone.0009506 Text en Yadav et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Yadav, Nisha
Joglekar, Hrishikesh
Rao, Rajesh P. N.
Vahia, Mayank N.
Adhikari, Ronojoy
Mahadevan, Iravatham
Statistical Analysis of the Indus Script Using n-Grams
title Statistical Analysis of the Indus Script Using n-Grams
title_full Statistical Analysis of the Indus Script Using n-Grams
title_fullStr Statistical Analysis of the Indus Script Using n-Grams
title_full_unstemmed Statistical Analysis of the Indus Script Using n-Grams
title_short Statistical Analysis of the Indus Script Using n-Grams
title_sort statistical analysis of the indus script using n-grams
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841631/
https://www.ncbi.nlm.nih.gov/pubmed/20333254
http://dx.doi.org/10.1371/journal.pone.0009506
work_keys_str_mv AT yadavnisha statisticalanalysisoftheindusscriptusingngrams
AT joglekarhrishikesh statisticalanalysisoftheindusscriptusingngrams
AT raorajeshpn statisticalanalysisoftheindusscriptusingngrams
AT vahiamayankn statisticalanalysisoftheindusscriptusingngrams
AT adhikarironojoy statisticalanalysisoftheindusscriptusingngrams
AT mahadevaniravatham statisticalanalysisoftheindusscriptusingngrams