Resource: Using Kraken to Train your own OCR Models

From the resource:

Over the summer of 2019, inspired by the promising results in articles like Romanov et al. 2017, I set out to use the Kraken OCR software on a variety of texts. Kraken, see their website or their repository, is open-source command line software that is capable of reaching accuracy rates in the high nineties for Arabic and Persian printed text.

Kraken is not equipped to handle every text – I recommend using it only on works for which you have clear PDFs or page images (300 DPI is the usual recommendation) and in which the text is laid out in one column. If you are starting from a PDF, use your tool of choice to separate the pages into individual image files. I use pdftoppm or ImageMagick’s convert tool, and I have been able to use Kraken with PNG, TIFF, and JPG files.

