Sanskrit Library History: 2001–2010

Whitney's Roots and inflection software

In 2003-2004, Scharf and Hyman collaborated to produce a digital edition of William Dwight Whitney’s The Roots, Verb Forms and Primary Derivatives of the Sanskrit Language. They also created inflection software for both nominals and verbs that allows users to enter a stem and to view a table of the inflected forms. The inflection and sandhi software model the rules of the ancient Indian linguist Pāṇini. These projects were undertaken with assistance from the Consortium for Language Teaching and Learning totaling $4,607.

International Digital Sanskrit Library Integration

Scharf led the three-year International Digital Sanskrit Library Integration project in the Classics Department at Brown University under grants from the National Science Foundation 2006-2009 ($247,350). The project created a digital Sanskrit library through collaboration with Jost Gippert, director of the Thesaurus Indogermanischer Text- und Sprachmaterialien at Johann Wolfgang Goethe Universität, Frankfurt am Main (TITUS), and Thomas Malten, director of the Cologne Digital Sanskrit Lexicon project at Universität zu Köln (CDSL). The aim was to produce an integrated educational and research environment for Sanskrit analogous to what the Perseus Digital Library provides for classical Greek and Latin literature. Peter Freund, librarian of the Vedic Reserve at Maharishi University of Management, joined the Sanskrit Library in 2010 and agreed to contribute his major archive of digital texts. The library has now acquired about 300 digital texts, 131 of which are on-line. The Sanskrit Library contributed validated character data to Venugopal Govindaraju at the Center of Excellence in Document Analysis and Recognition at the University of Buffalo (CEDAR) to provide a test-bed for Sanskrit OCR research.

International Sanskrit Computational Linguistics Consortium

Scharf, Huet, and Kulkarni collaborated with colleagues in planning a series of symposia to encourage research in the natural language processing of Sanskrit. Gérard Huet hosted the First International Sanskrit Computational Linguistics Symposium (ISCLS) at the Institut National de Recherche en Informatique et Automatique (INRIA) in October 2007. Scharf convened the Second ISCLS in May 2008, where participants agreed to establish the Sanskrit Computational Linguistics Consortium under the auspices of the Sanskrit Library. Amba Kulkarni hosted the Third ISCLS at the University of Hyderabad in January 2009, Girish Nath Jha hosted the Fourth ISCLS at Jawaharlal Nehru University in New Delhi in December 2010, and Malhar Kulkarni will host the Fifth ISCLS at IIT Bombay in January 2013. Kulkarni, Huet, and Scharf co-edited papers of the first two Symposia published by Springer in its Lecture Notes in Artificial Intelligence series in 2009. Springer published the proceedings of the 3rd and 4th ISCLS as well.

Sanskrit Library Phonetic encoding and Vedic Unicode

Scharf and (the late) Malcolm Hyman designed the Sanskrit Library Phonetic basic encoding (SLP1), after a thorough investigation of ancient Indian linguistic treatises, that associates each basic Sanskrit sound with a single character and adds modifiers that allow all sounds represented in Vedic texts to be represented digitally. The encoding is described in Appendix B of their book Linguistic Issues in Encoding Sanskrit published in hard copy by Motilal Banarsidass and in PDF by the Sanskrit Library. After an investigation of Sanskrit paleography, Scharf initiated worldwide collaboration, including such partners as the Indian Ministry of Communications & Information Technology, Department of Information Technology, Government of India, the Government of India’s Centre for Development of Advanced Computing (C-DAC) in Mumbai, and the Script-encoding Initiative at Berkeley, to extend the Unicode Standard to include 68 additional characters required for the proper display of the ancient Vedic heritage texts of India. See the Vedic Unicode project under Projects. Unicode Standard version 5.2 incorporated the characters in two code blocks, Devanagari Extended and Vedic Extensions under South Asian Scripts on the Unicode Character Code Charts page.

Inflectional morphology analysis and digital dictionary revision

By running our inflection software on the 170,000 nominal and verbal headwords in Monier-Williams’ (1899) A Sanskrit-English Dictionary (MW), the most complete English language dictionary of Sanskrit, we created a full-form lexicon of eleven million entries that associates each inflected form with its inflectional identifier and headword. The full-form lexicon allowed us to build a morphological analyzer. The analyzer displays all possible analyses of the inflected nominal form entered in the analyzer input field. The analysis consists of the inflectional identifier and stem, the latter of which is a link to the Sanskrit Library multidictionary interface. With our collaborators Malten and Funderburk, we revised and extend tagging of CDSL’s edition of MW. Prior to our work, CDSL's original search interface did not distinguish between upper and lower case, even though the Kyoto-Harvard encoding it utilized does. Besides utilizing the full accuracy of phonetic encoding of Sanskrit and allowing both user entry and display in Roman or Devanāgarī Unicode as well as in other standard encodings, our improved and extended mark-up allows software to differentiate different types of information in the entry by associating differently tagged elements with different text types, such as color and italic, in the HTML used to create the display. Our current interface displays lexical tags in italics, Sanskrit forms in blue, class names and proper nouns in aqua, works cited underlined, and citations in green, making the entry easier to read. The display can be altered just by changing parameters in the display software without further altering the markup of the data. A new display still under development, to be deployed on the web and on hand-held devices, shows a navigable list of dictionary headwords on the left, scrolled to the current headword which is hightlighted. The preference panel allows the user to choose between two formats: to display headwords in strictly alphabetical order or in a hierarchy that shows derivational dependence. The hierarchical list shows, for instance, compound words formed by combining a subsequent element with the sought headword indented beneath that headword.

The revised Sanskrit lexicon displays contain a number beneath the head word that is a link to the digital image file of the page in MW. Clicking it brings up the image file of the entire page. The advanced search form allows one to search not only for exact matches but in addition for all headwords that contain a string at the beginning, interior, or end. The advanced search form also allows English access to Sanskrit terms; it displays all headwords that contain a specific English word in their definition.

Integration of digital texts with linguistic software and lexical resources

The Sanskrit Library obtained three hundred digital editions of texts from TITUS, the Vedic Reserve, the NEH-funded grammatical databank project headed by George Cardona at the University of Pennsylvania in the early 1990’s, and other sources, and displays them in a reader page Scharf and Bunker designed. Each word in sandhi-analyzed texts dynamically links to a morphological analysis window. Clicking a word in the Rāmopākhyāna brings up the analysis done by hand by Scharf; clicking a word in other sandhi-analyzed texts, such as in the padapāṭha sections of Kauṭilya's Arthaśāstra, searches for matches in our full-form lexicon. Direct access to the full form lexicon is provided by our morphological analyzer. Clicking the stem of the word looks up the word fills the the multidictionary interface's search field. Clicking a dictionary looks the word up in the selected lexical source. Clicking the preferences dialogue box in the upper left corner of the Sanskrit Library page allows one to select various input methods and display formats, and to select from numerous lexical sources to include in word-lookup. The system accommodates a variety of encodings, including Roman and Devanāgarī Unicode, and transcoders developed to include the Unicode of eight major Indic scripts used for Sanskrit texts (Devanagari, Bengali, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, and Telugu). The complete set of transcoding software is accessible under Tools.

In texts in which interword phonetic alteration (sandhi) has not been analyzed, such as the Pañcatantra or the continuous text sections of the Arthaśāstra, additional analysis is required. In such texts each sentence is dynamically linked to Gérard Huet’s parser at the Sanskrit Heritage Site at INRIA Paris-Rocquencourt through a dialogue box that displays the editable sentence in SLP1 encoding. Huet’s parser analyzes the sentence using various syntactic criteria and his Sanskrit-French lexicon of about 25,000 words. Unpenalized solutions are selected and displayed. The site allows one to examine penalized solutions and to reedit the sentence and resubmit it for further analysis. Huet's site addionally allows one to submit compounds for further analysis by the compound analyzer built by Amba Kulkarni at the University of Hyderabad and to submit analzyed sentences for syntactic analysis by her dependency tree parser.

Projects

Gérard Huet and Amba Kulkarni joined the Sanskrit Library board of directors this year and we are intensifying the coordination of our resources. Please consult the descriptions of our projects for details.