Set up Optical Character Recognition

The New Gorhambury Project will identify the most suitable Open Source libraries for Optical Character Recognition (OCR)

Tesseract – an open-source OCR engine that has gained popularity among OCR developers. Even though it can be painful to implement and modify sometimes, there weren’t too many free and powerful OCR alternatives on the market for the longest time. Tesseract began as a Ph.D. research project in HP Labs, Bristol. It gained popularity and was developed by HP between 1984 and 1994. In 2005 HP released Tesseract as an open-source software. Since 2006 it is developed by Google.

Ocular – Ocular works best on documents printed using a hand press, including those written in multiple languages. It operates using the command line. It is a state-of-the-art historical OCR system. Its primary features are:

  • Unsupervised learning of unknown fonts: requires only document images and a corpus of text.
  • Ability to handle noisy documents: inconsistent inking, spacing, vertical alignment
  • Support for multilingual documents, including those that have considerable word-level code-switching.
  • Unsupervised learning of orthographic variation patterns including archaic spellings and printer shorthand.
  • Simultaneous, joint transcription into both diplomatic (literal) and normalized forms.
