Cargando…
Statistical Analysis of the Indus Script Using n-Grams
The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilizat...
Autores principales: | , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841631/ https://www.ncbi.nlm.nih.gov/pubmed/20333254 http://dx.doi.org/10.1371/journal.pone.0009506 |
_version_ | 1782179142177914880 |
---|---|
author | Yadav, Nisha Joglekar, Hrishikesh Rao, Rajesh P. N. Vahia, Mayank N. Adhikari, Ronojoy Mahadevan, Iravatham |
author_facet | Yadav, Nisha Joglekar, Hrishikesh Rao, Rajesh P. N. Vahia, Mayank N. Adhikari, Ronojoy Mahadevan, Iravatham |
author_sort | Yadav, Nisha |
collection | PubMed |
description | The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail. |
format | Text |
id | pubmed-2841631 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-28416312010-03-24 Statistical Analysis of the Indus Script Using n-Grams Yadav, Nisha Joglekar, Hrishikesh Rao, Rajesh P. N. Vahia, Mayank N. Adhikari, Ronojoy Mahadevan, Iravatham PLoS One Research Article The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail. Public Library of Science 2010-03-19 /pmc/articles/PMC2841631/ /pubmed/20333254 http://dx.doi.org/10.1371/journal.pone.0009506 Text en Yadav et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Yadav, Nisha Joglekar, Hrishikesh Rao, Rajesh P. N. Vahia, Mayank N. Adhikari, Ronojoy Mahadevan, Iravatham Statistical Analysis of the Indus Script Using n-Grams |
title | Statistical Analysis of the Indus Script Using n-Grams |
title_full | Statistical Analysis of the Indus Script Using n-Grams |
title_fullStr | Statistical Analysis of the Indus Script Using n-Grams |
title_full_unstemmed | Statistical Analysis of the Indus Script Using n-Grams |
title_short | Statistical Analysis of the Indus Script Using n-Grams |
title_sort | statistical analysis of the indus script using n-grams |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841631/ https://www.ncbi.nlm.nih.gov/pubmed/20333254 http://dx.doi.org/10.1371/journal.pone.0009506 |
work_keys_str_mv | AT yadavnisha statisticalanalysisoftheindusscriptusingngrams AT joglekarhrishikesh statisticalanalysisoftheindusscriptusingngrams AT raorajeshpn statisticalanalysisoftheindusscriptusingngrams AT vahiamayankn statisticalanalysisoftheindusscriptusingngrams AT adhikarironojoy statisticalanalysisoftheindusscriptusingngrams AT mahadevaniravatham statisticalanalysisoftheindusscriptusingngrams |