Bringing ancient Indian semantic and syntactic theory face to face with contemporary computational linguistics


Formal and computational linguistics was dominated by English at its inception and developed in subsequent decades primarily in the environment of European languages. More recently there has been a concerted effort to undertake formal linguistic analysis of a wide variety of languages, with particular interest in those with dramatically different features, and to enrich linguistic theory to account for linguistic variety. In spite of this effort, analytic structures and procedures utilized in formal linguistics remain dominated by those invented for, and most suitable for, English and other European languages. Linguistic theory remains unduly weighted in favor of European languages even as their extension to the variety of the world’s languages involves undue complication thereby revealing their inadequacy in representing language universally. It would prove particularly useful in developing universally adequate linguistic theory to investigate sophisticated linguistic theories, structures, and procedures developed to describe languages of a very different character from English.

India developed an extraordinarily rich linguistic tradition over more than three millennia that remains under-appreciated and under-investigated. A cursory glance at the long tradition of discussion and argumentation within and between Indian sciences of phonetics (śikṣā), grammar (vyākaraṇa), logic (nyāya), ritual exegesis (karmamīmāṃsā), and literary theory (alaṅkraśāstra) reveals that Indian linguistic traditions have much to offer contemporary linguistic theory in the areas of phonetics, morphology, syntax, and semantics. The proposed project builds a bridge between the ancient and the modern, between a difficult to-penetrate humanistic discipline and a rigorous formal science. The project investigates ways in which Indian linguistics might contribute useful insights to contemporary formal linguistics, and designs ways in which Indian linguistic theories can be formalized and implemented computationally. The project focuses on Indian semantic and syntactic theory and the semantics-syntax interface where computational linguistic work is flourishing. The investigator will draw upon selected major semantic and syntactic treatises in the Indian grammatical tradition and contemporary techniques of formalization and computational implementation to bring ancient Indian theories face to face with contemporary computational linguistic work in a series of ten lectures. On the one hand, the lectures will articulate Indian theories in contemporary terms and offer a critique and insights useful to contemporary linguists. On the other hand, the lectures will suggest ways of modeling ancient Indian theories computationally. The latter will allow computational modeling to clarify those ancient theories and assist in answering difficult questions regarding their principles and historicity. The project will culminate in a seminar on Sanskrit computational linguistics.

Major activities

Lecture series

Construction of a mophologically and syntactically tagged corpus of Sanskrit texts

Syntactic research on Sanskrit is hindered by the fact that there does not exist a mophologically and syntactically tagged corpus of Sanskrit texts. Despite the large numbers of digitized texts now available at various websites, and the significant number that have been partially or fully sandhi-analyzed, only relatively small portions of a small number of texts have been morpholigically tagged. In June Bunker, Huet, and Scharf collaborated to create an interface that allows software-assisted human-validated tagging. Sentences in digital texts in the Sanskrit Library (SL) are fed to the Sanskrit Heritage (SH) parser. The results of possible solutions are summarized in a user-friendly single-page interface that allows a Sanskrit scholar to select among presented words, stems, and morphological tags, or to edit them. The interface also allows the scholar to edit and resubmit the sentence for re-analysis by the SH-parser, or to tag the sentence manually. Results are saved in XML files that can be reviewed with the same interface. The project contracted IIT Bombay to engage two post-doctoral Sanskrit researchers to utilize the interface to tag digital texts.

Scholars interested in contributing to the project by tagging whatever text they happen to want to work on, are warmly invited to contact the project director.

Project personnel

  • Peter M. Scharf, Chaire Internationale de Recherche Blaise Pascal (UP7, INRIA)
  • Gérard Huet, Directeur de Recherche (INRIA)
  • Ralph E. Bunker, Assistant Professor (MUMRI)
  • Pawan Goyal, Post-doctoral Research Associate (INRIA, IITB)
  • Sharon Ben-Dor, Post-doctoral Research Associate (IITB)
  • Anuja Ajotikar, Post-doctoral Research Associate (IITB)
  • Amba Kulkarni, Head, Department of Sanskrit Studies, University of Hyderabad
  • Oliver Hellwig, Mitarbeiter, Universität Düsseldorf

Grant details

  • period: 1 February 2012 -- 30 June 2013
  • funding agency: Chaire Internationale de Recherche Blaise Pascal financée par l’Etat et la Région d'Ile-de-France, gérée par la Fondation de l’Ecole Normale Supérieure
  • funding: €200,000
  • location1: Laboratoire d'Histoire des Théories Linguistiques, Université Paris Diderot (UP7)
  • location2: Institut National de Recherche en Informatique et Automatique (INRIA)
  • location3: Indian Institute of Technology, Bombay (IITB)