The Plan
This program knows nothing about Sanskrit or Devanagari and needs to be trained. This usually requires a transcription of the pages used in training. We want to avoid this if at all possible because transcribing is labor-intensive and is most likely slower that manual alignment especially if Ocropus needs to be trained for each manuscript to produce decent results.
Instead of using a transcription we are trying the following approach which we demonstrate using the Bhagavad Gita.
- Convert each line of the critical edition of the Bhagavad Gita to a .png file and save it to a directory along with a text file containing the corresponding line.
- This results in a directory containing 1455 .txt files with their corresponding .png files
- This directory is used to train Ocropus.
- After training we attempted to use Ocropus to recognize the .png file and compared the results with the .txt file and found that the OCR results corresponded very closely with the .txt files containing the lines used to generate the .png files.
- Our next step to to use the training data produced in (3.) to see if it can used transcribe handwritten Devanagari. This requires preprocessing the handwritten images to clean them up and break them into lines. We have not succeeded in doing this yet.
Notes
https://github.com/tmbdev/ocropy https://www.mindmeister.com/192257150/ocropus-overview-published (documentation map) http://nbviewer.ipython.org/url/ocropy.ocropus.googlecode.com/hg/Notebooks/ocropus-steps.ipynb ~/OCR/Feb1515/Ocropus_sans #!/bin/bash -e rm -rf temp #ocropus-nlbin tests/testpage.png -o temp (doesn't work) lines preceded by a *** are my modifications to sample code ocropus-sauvola tests/testpage.png -o temp # sauvola produces the following: # 1. temp/0001.nrm.png # 2. temp/0001.bin.png # Both look like testpage.png ocropus-gpageseg 'temp/????.bin.png' # gpageseg produces the following: # 1. temp/0001.pseg.png (looks just like testpage.png) # 2. temp/0001 directory with one .png for each line of testpage.png (e.g., 010042.bin.png) ***ocropus-rtrain BhGsmall/*.png -o TrainK -F 10000 --ntrain 50000 >t.xxx (note: this was not done in sample code, existing model used) # -o is the name of the model (suppled as -m parameter in ocropus-rpred) # -F is how often to save model. Each model will be save as TrainK-nnnnnnnn.pyrnn.gz where nnnnnnnn is a multiple of -F with leading zeros if necessary # --ntrain is how many total iterations ocropus-rpred 'temp/????/??????.bin.png' >result.txt (this must use existing training data) # ocropus-rpred does the OCR on files in temp/0001 apparently using an existing model and produces one .txt file for each .png ***ocropus-rpred -m TrainK-00050000.pyrnn.gz BhGsmall/line?.png ocropus-hocr 'temp/????.bin.png' -o temp.html # ocropus-hocr collects all the .txt files created by ocropus-rpred into temp.html ocropus-visualize-results temp # creates index.html in directory temp. ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html # creates temp-correction.html (diagnostic information?) echo "to see recognition results, type: firefox temp.html" echo "to see correction page, type: firefox temp-correction.html" echo "to see details on the recognition process, type: firefox temp/index.html" UPenn Bhagavad Gitas 492 0006 recto 555 0001 (barely readable) 559 0004 (red in middle of recto) 569 0002 (does not include first verse) 773 0007 (kim akurvata saṃjaya) 906 0008 (starts at beginning) 2199 0007 (starts at beginning, first line of verso, easy to read) 2233 0006 (starts at beginning, first line of verso, easy to read) 2241 0003 (starts at beginning, first line of verso, easy to read) 2260 0005 (starts at row 24, first line of verso, easy to read) 2336 0006 (starts at beginning, second line of recto, easy to read) 2339 0004 (starts at beginning, fourth line of recto, easy to read) 2340 0006 (starts at row 11, first line of verso, easy to read) 2366 0003 (starts at beginning, waist of verso, easy to read) 2367 0007 (starts at row 255, recto (marked), not so easy to read) 2368 0002 (starts at row 128, verso end of third line from bottom, very small) 2369 0002 (starts at beginning, recto waist, easy to read) 2370 0003 (starts at beginning, recto waist, easy to read) 2390 0003 (starts at beginning, recto top, easy to read) ocropus-sauvola tests/bhg2233_0007.jpg -o tempBhG