How to install OcrMyPdf on Windows

2023-04-06 - Instructions for intalling OcrMyPdf open source OCR programm and its dependencies on Windows

1. About OcrMyPdf

OcrMyPdf is an excellent OCR program that is able to create readable text PDF files out of image PDF files (usually product of scanning) using an excellent open source tesseract library.

While installing it on Linux derivatives is quite easy, since most of the major distros provide it as a package, installing it on Windows requires a few additional steps, since there's no installer available, neither for OcrMyPdf, nor for the all its dependencies.

2. Prerequisites

Since I personally use scoop as a windows package manager, I'll not cover other options. If you don't already have scoop installed, installing it is as simple as opening Windows PowerShell and running the following:

> Set-ExecutionPolicy RemoteSigned -Scope CurrentUser # Optional: Needed to run a remote script the first time
> irm get.scoop.sh | iex

3. Installing OcrMyPdf

Once we have scoop installed, let's first install Python:

scoop install python

Then, we'll install OcrMyPdf using Python pip package manager

pip install ocrmypdf

This will install ocrmypdf python script, but you still won't be able to run it until you install other dependencis:

scoop install tesseract
scoop install pdfquant
scoop install ghostscript

Now, you will be able to use ocrmypdf to OCR existing file:

ocrmypdf file1.pdf output.pdf

However, if you try to use it using some other language than english, it will fail, because you need a langage definitions.

4. Install tesseract support for other languages

In an ideal world, it would be enough to just install tesseract-languages package:

scoop install tesseract-languages

However, at the time of writing thing, tesseract-languages scoop package is broken, so we will need to manually install it.

To install it manually, you can go to Tesseract Fast GitHub page, download language data files for languagues you need, for example deu.traineddata for German, or fra.traineddata for French, and put those files in your tesseract intallation folder, usually ~/scoop/apps/tesseract/current/tessdata/

Now, you should be able to specify langage of the file when converting, like this:

ocrmypdf -l eng+deu+fra file1.pdf outfile.pdf

This command will now try to detect text in English, German or French and create a corresponging PDF file.

5. Misc

When running ocrmypdf you may encounter error saying file not found. To find out exactly what it is about, you can run it with -v (verbose) option, and it will report that jbig2.exe file can't be found.

This is yet another dependency, that is in some casese required if optimization setting are specified. Unfortunatelly, it's not available in any package manager, so you eithre need to compile it from source yourself, or download the precompiled Windows version (that also works on Windows 11) from here.

6. Next steps

Now, you should be able to use ocrmypdf to OCR convert PDF files in Windows directly. I hope at some moment it will be available on scoop package manager, but until then, this insturctino will do.

Keywords: ocr ocrmypdf pdf windows scoop tesseract