Character and higher-level encoding.

  • UT102.
  • This course begins by examining methods of character encoding for Sanskrit: the Sanskrit Library Phonetic (SLP) encodings, ASCII meta-encodings, and Unicode. The course then introduces the extensible markup language (XML), and the Text-Encoding Initiative (TEI). Students will learn how to markup text and bibliography, and formalize morphological (lexical and inflectional) tagging in TEI. The course concludes with training in the use of the exceedingly powerful and ubiquitously useful regular expressions. This is the second technical course in The Sanskrit Library's digital humanities programs.
  • Instructors: Tanuja P. Ajotikar and Peter M. Scharf.
  • Schedule: Saturdays, 20 January – 4 May 2024, except 23 March.
  • Course meeting times: 9:00–11:00am U.S. Central Time.
  • Prerequisite: Advanced competency in Sanskrit, fluency reading Devanagari script, and regular access to a computer and basic computer use skills.
  • Course fee: $1,000.
  • Course fee: ₹10,000.
  • Register.
  • Register (residents of India).
  • Reference material:
  • Course materials: Scharf, Peter M. and Hyman, Malcolm D. Linguistic issues in encoding Sanskrit. Providence: The Sanskrit Library, 2010. Available via a link on the Sanskrit Library publications page.


Sequence Topic
1 Sanskrit Library Phonetic (SLP) encoding
2 Optical character recognition (OCR)
3 Extensible markup language (XML)
4 Text-Encoding Initiative (TEI)
5 Regular expressions and replacement expressions
6 Semantic markup with TEI


Week Date Topic, reading and assignment
1 20 January Media transition. Reading: Scharf and Hyman 2011: Ch. 1; Scharf 2014.
2 27 January Unsuitable character encoding: WX, Velthuis, KH, ITrans, IAST, ISO 15919, Unicode Devanagari, obtaining text by OCR. Reading: Scharf and Hyman 2011: Chs. 2–3. Homework: convert Devanagari in a PDF to character data using OCR.
3 3 February The basis for encoding (1hr). Reading: Scharf and Hyman 2011: Chs. 4; Sanskrit phonology (2hrs). Reading: Scharf and Hyman 2011: Chs. 5; Appendices A1–7
4 10 February Principles of constrastive phonology, ideal character encoding: SLP. Reading: Scharf and Hyman 2011: Chs. 6; Appendices B–C
5 17 February Principles of text-encoding, and introduction to TEI. Reading: Scharf 2016, 2018b. Homework: prose and verse text markup.
6 24 February TEI text-encoding of prose and verse
7 2 March TEI bibliography markup and introduction to the teiHeader
8 9 March TEI inflectional tagging. Reading: \cite{}
9 16 March TEI lexical tagging
10 30 March Advanced aspects of the teiHeader: subject classification, tag declaration, rendition and usage
11 6 April Introduction to regular expressions
12 13 April Using regular expressions
13 20 April Using TEITagger
14 27 April Using TEITagger
15 4 May Review