How to install OcrMyPdf on Windows

2023-04-06 - Instructions for installing OcrMyPdf open source OCR program and its dependencies on Windows

1. About OcrMyPdf

OcrMyPdf is an excellent OCR program that is able to create readable text PDF files out of image PDF files (usually a product of scanning) using an excellent open source tesseract library.

While installing it on Linux derivatives is quite easy, since most of the major distros provide it as a package, installing it on Windows requires a few additional steps, since there's no installer available, neither for OcrMyPdf nor for all its dependencies.

2. Prerequisites

Since I personally use scoop as a Windows package manager, I'll not cover other options. If you don't already have scoop installed, installing it is as simple as opening Windows PowerShell and running the following:

> Set-ExecutionPolicy RemoteSigned -Scope CurrentUser # Optional: Needed to run a remote script the first time
> irm get.scoop.sh | iex

3. Installing OcrMyPdf

Once we have scoop installed, let's first install Python:

scoop install python

Then, we'll install OcrMyPdf using Python's pip package manager:

pip install ocrmypdf

This will install the ocrmypdf Python script, but you still won't be able to run it until you install other dependencies:

scoop install tesseract
scoop install pdfquant
scoop install ghostscript

Now, you will be able to use ocrmypdf to OCR an existing file:

ocrmypdf file1.pdf output.pdf

However, if you try to use it with a language other than English, it will fail, because you need language definitions.

4. Install tesseract support for other languages

In an ideal world, it would be enough to just install the tesseract-languages package:

scoop install tesseract-languages

However, at the time of writing this, the tesseract-languages scoop package is broken, so we will need to manually install it.

To install it manually, you can go to the Tesseract Fast GitHub page, download language data files for languages you need, for example deu.traineddata for German or fra.traineddata for French, and put those files in your Tesseract installation folder, usually ~/scoop/apps/tesseract/current/tessdata/

Now, you should be able to specify the language of the file when converting, like this:

ocrmypdf -l eng+deu+fra file1.pdf outfile.pdf

This command will now try to detect text in English, German, or French and create a corresponding PDF file.

5. Misc

When running ocrmypdf, you may encounter an error saying "file not found." To find out exactly what it is about, you can run it with the -v (verbose) option, and it will report that the jbig2.exe file can't be found.

This is yet another dependency that is in some cases required if optimization settings are specified. Unfortunately, it's not available in any package manager, so you either need to compile it from source yourself or download the precompiled Windows version (that also works on Windows 11) from here.

6. Next steps

Now, you should be able to use ocrmypdf to OCR-convert PDF files in Windows directly. I hope at some point it will be available in the scoop package manager, but until then, these instructions will do.

Keywords: ocr ocrmypdf pdf windows scoop tesseract