Total Pageviews

21.2.12

Indus script & computational linguistics - Nisha Yadav, Mayank N Vahia (21 Feb. 2012)


Indus Script & Computational Linguistics

[Article posted on 21-February-2012]
Nisha Yadav Mayank N Vahia
Restoration of missing signs using a bigram model of Indus script
Writing is an epitome of the intellectual creation of a civilisation. It involves comprehension as well as abstraction of symbols that signify specific achievement of human creativity and communication. Renfrew points out that "The practice of writing, and the development of a coherent system of signs, a script, is something which is seen only in complex societies... Writing, in other words, is a feature of civilisations". When a civilisation leaves behind some written records, they are invaluable not only to understand their civic society but also to understand the basic thinking processes that moulded the civilisation.
Decipherment of any script is a challenging task. At times it is aided by the discovery of a multilingual text where the same text is written in an undeciphered script as well as known script(s). Both Egyptian hieroglyphs and Mesopotamian cuneiform texts were deciphered with the help of multilingual texts. In some cases, continuing linguistic traditions provide significant clues and at times interlocking phonetic values are used as a proof of decipherment. In the absence of these, statistical studies can provide important insights into the structure of the script and can be used to define a syntactic framework for the script.
Indus script is a product of one of the largest Bronze Age civilisations often referred to as the Harappan civilisation. At its peak from 2500 BC to 1900 BC, the civilisation was spread over an area of more than a million square kilometres across most of the present day Pakistan, Afghanistan and north-western India. It was distinguished for its highly utilitarian and standardised life style, excellent water management system and architecture. The civilisation had flourishing trade links with West Asia and artefacts of the Harappan civilisation have been found several thousand kilometres away in West Asia.
 
A large unicorn seal from Harappa
The Indus script is predominantly found on objects such as seals, sealings (made of terracotta or steatite), copper tablets, ivory sticks, bronze implements, pottery etc. from almost all sites of this civilisation and in some West Asian sites too. The objects on which the script was written are typically a few square centimetres in size (with the exception of a sign board in Dholavira) and often have multiple components with highly decorated unicorn and other animal motifs with or without a feeding trough. Many of these objects also have geometric designs with multiple folds of symmetry and depiction of scenes involving humans etc. One of the excavators of Mohenjo Daro Sir Mortimer Wheeler says: "At their best, it would be no exaggeration to describe them as little masterpieces of controlled realism, with a monumental strength - in one sense out of all proportion to their size and in another entirely related to it."
The Indus script has defied decipherment in spite of several serious attempts. This is primarily because no multilingual texts have been found, the underlying language(s) is unknown and the script occurs in very short texts. The average length of an Indus text is five signs and the longest text in a single line has only 14 signs.
Through a series of systematic studies (see table below) the TIFR group, in collaboration with colleagues from India and abroad, has been working on understanding the structure of Indus writing. Adopting a novel methodology based on statistical and computational techniques, the group has approached the problem in a manner that makes no assumptions about its underlying content, language or connection to later writing. The study focuses on exploring the structure of the Indus script in unprecedented detail using developments in the fields of machine learning, data mining and information theory. They approach the problem using various techniques of computational linguistics and pattern recognition such as Markov models, n-grams etc. to understand the structure of Indus writing. Using these methods, they first established that the Indus writing has definite rules or a grammatical structure. Having established that the writing is neither random nor disordered, the group is now working on revealing the subtleties of its structure. They have identified specific signs that begin and end the texts. There exist frequently occurring sign combinations (pairs and triplets) which tend to appear at specific locations in the texts. The bigram model of the Indus script can accurately restore the illegible or incomplete texts found on broken or damaged objects with about 75% accuracy. Equally interestingly, the flexibility of sign usage in Indus texts, as measured by conditional entropy, falls within the range of linguistic systems and is distinct from non-linguistic systems such as Protein or DNA sequences or Fortran.
Conditional entropy of Indus inscription compared to linguistic and non linguistic systems
The difference in the pattern of sign sequencing between texts coming from Indus sites and West Asian sites suggests that the script was probably also used for writing West Asian contents. They have also shown that signs that seem to be composite of other signs appear in completely different context from its constituent sign sequences demonstrating that shorthanding was not the purpose of sign merger but that merger of signs changed their context and presumably their meaning.
These studies will eventually help in defining a syntactic framework of the Indus script against which different hypotheses about its content can be tested.

Major Conclusions

Sl. No.Test/ MeasureResultsConclusions
1.Zipf- Mandelbrot LawBest fit for a= 15.4, b =2.6, c = 44.5 (95% confidence interval)Small number of signs account for bulk of the data while a large number of signs contribute to a long tail.
2.Cumulative frequency distribution69 signs: 80 % of EBUDS,23 signs: 80 % of Text Enders, 82 signs: 80 % of Text BeginnersIndicates asymmetry in usage of 417 distinct signs. Suggests logic and structure in writing.
3.Bigram probabilityConditional probability matrix is strikingly different from the matrix assuming no correlations.Indicates presence of significant correlations between signs.
4.Conditional probabilities of text beginners and text endersRestricted number of signs follow frequent text beginners whereas large number of signs precede frequent text enders.Indicates presence of signs having similar syntactic functions.
5.Log-likelihood significance testSignificant sign pairs and triplets extracted.The most significant sign pairs and triplets are not always the most frequent ones.
6.EntropyRandom: 8.70; EBUDS: 6.68Indicates presence of correlations.
7.Mutual informationRandom: 0; EBUDS: 2.24Indicates flexibility in sign usage.
8.PerplexityMonotonic reduction as n-increases from 1 to 5.Indicates presence of long range correlations.
9.Sign restorationRestoraton of missing and illegible signs.Bigram model can restore illegible signs according to probability.
10.Cross validationSensitivity of the bigram model = 74 %Bigram model can predict signs with 74% accuracy.
11.Conditional entropyCloser to linguistic systems than non-linguistic systems.The flexibility of sign usage in Indus texts is similar to closer to that of linguistic systems.
12.Comparison of compound signs with constituent sign sequencesEnvironments in which compound signs appear is very different from that of its constituent sign sequences which rarely appear together.Compound signs are not created for shorthanding but seem to have different function.

Further reading:

A statistical approach for pattern search in Indus writing

Nisha Yadav, M N Vahia, Iravatham Mahadevan and H. Joglekar
International Journal of Dravidian Linguistics, 37, 39 - 52, January 2008

Segmentation of Indus text

Nisha Yadav, M N Vahia, Iravatham Mahadevan and H. Joglekar
International Journal of Dravidian Linguistics, 37, 53 - 72, January 2008

Statistical analysis of the Indus script using n-grams

Nisha Yadav, Hrishikesh Joglekar, Rajesh P.N. Rao, M. N. Vahia, Iravatham Mahadevan, R. Adhikari
PLoS ONE 5(3): e9506., doi:10.1371/journal.pone.0009506, March 2010

A probabilistic model for analyzing undeciphered scripts and its application to the 4500-year-old Indus script

Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, Iravatham Mahadevan
Proceedings of the National Academy of Sciences (PNAS), Dec. 2009106:13685-13690; published online before print August 5, 2009,doi:10.1073/pnas.0906237106

Evidence for linguistic structure in the Indus script

Rajesh P. N. Rao, NishaYadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, Iravatham Mahadevan
Science, 324, 1165, 2009

Network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilisation inscriptions

Sitabhra Sinha, Raj Kumar Pan, Nisha Yadav, Mayank Vahia and Iravatham Mahadevan
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, ACL-IJCNLP 2009, pages 5�13, Suntec, Singapore

Entropy, the Indus script and language: A reply to R. Sproat

Rajesh Rao, Nisha Yadav, M N Vahia, H Jogalekar, R Adhikari and I Mahadevan
Computational Linguistics 36(4), 2010

Harappan geometry and symmetry: A study of geometrical patterns on Indus objects

M N Vahia and Nisha Yadav
Indian Journal of History of Science, 45, 343, 2010

Classification of patterns on Indus objects

Nisha Yadav and M. N. Vahia
International Journal of Dravidian Linguistics, Vol. 40: No. 2, June 2011

Indus script: A study of its sign design

Nisha Yadav and M N Vahia
Scripta, Vol. 3, pp. 133-172, September 2011


http://www.tifr.res.in/newsite/dynamic/TSN/faq.php?telid=25

0 comments:

Post a Comment