Character and higher-level encoding.
- UT102.
- This course begins by examining methods of character encoding for Sanskrit: the Sanskrit Library Phonetic (SLP) encodings, ASCII meta-encodings, and Unicode. The course then introduces the extensible markup language (XML), and the Text-Encoding Initiative (TEI). Students will learn how to markup text and bibliography, and formalize morphological (lexical and inflectional) tagging in TEI. The course concludes with training in the use of the exceedingly powerful and ubiquitously useful regular expressions. This is the second technical course in The Sanskrit Library's digital humanities programs.
- Instructors: Tanuja P. Ajotikar and Peter M. Scharf.
- Schedule: Saturdays, 20 January – 4 May 2024, except 23 March.
- Course meeting times: 9:00–11:00am U.S. Central Time.
- Prerequisite: Advanced competency in Sanskrit, fluency reading Devanagari script, and regular access to a computer and basic computer use skills.
- Course fee: $1,000.
- Course fee: ₹10,000.
- Register.
- Register (residents of India).
- Reference material:
- Course materials: Scharf, Peter M. and Hyman, Malcolm D. Linguistic issues in encoding Sanskrit. Providence: The Sanskrit Library, 2010. Available via a link on the Sanskrit Library publications page.
Topics
Sequence | Topic |
---|---|
1 | Sanskrit Library Phonetic (SLP) encoding |
2 | Optical character recognition (OCR) |
3 | Extensible markup language (XML) |
4 | Text-Encoding Initiative (TEI) |
5 | Regular expressions and replacement expressions |
6 | Semantic markup with TEI |
Schedule
Week | Date | Topic, reading and assignment |
---|---|---|
1 | 20 January | Media transition. Reading: Scharf and Hyman 2011: Ch. 1; Scharf 2014. |
2 | 27 January | Unsuitable character encoding: WX, Velthuis, KH, ITrans, IAST, ISO 15919, Unicode Devanagari, obtaining text by OCR. Reading: Scharf and Hyman 2011: Chs. 2–3. Homework: convert Devanagari in a PDF to character data using OCR. |
3 | 3 February | The basis for encoding (1hr). Reading: Scharf and Hyman 2011: Chs. 4; Sanskrit phonology (2hrs). Reading: Scharf and Hyman 2011: Chs. 5; Appendices A1–7 |
4 | 10 February | Principles of constrastive phonology, ideal character encoding: SLP. Reading: Scharf and Hyman 2011: Chs. 6; Appendices B–C |
5 | 17 February | Principles of text-encoding, and introduction to TEI. Reading: Scharf 2016, 2018b. Homework: prose and verse text markup. |
6 | 24 February | TEI text-encoding of prose and verse |
7 | 2 March | TEI bibliography markup and introduction to the teiHeader |
8 | 9 March | TEI inflectional tagging. Reading: \cite{huet.cm} |
9 | 16 March | TEI lexical tagging |
10 | 30 March | Advanced aspects of the teiHeader: subject classification, tag declaration, rendition and usage |
11 | 6 April | Introduction to regular expressions |
12 | 13 April | Using regular expressions |
13 | 20 April | Using TEITagger |
14 | 27 April | Using TEITagger |
15 | 4 May | Review |