How to install OcrMyPdf on Windows
1. About OcrMyPdf
OcrMyPdf is an excellent OCR program that is able to create readable text PDF files out of image PDF files (usually a product of scanning) using an excellent open source tesseract library.
While installing it on Linux derivatives is quite easy, since most of the major distros provide it as a package, installing it on Windows requires a few additional steps, since there's no installer available, neither for OcrMyPdf nor for all its dependencies.
2. Prerequisites
Since I personally use scoop as a Windows package manager, I'll not cover other options. If you don't already have scoop installed, installing it is as simple as opening Windows PowerShell and running the following:
> Set-ExecutionPolicy RemoteSigned -Scope CurrentUser # Optional: Needed to run a remote script the first time > irm get.scoop.sh | iex
3. Installing OcrMyPdf
Once we have scoop installed, let's first install Python:
scoop install python
Then, we'll install OcrMyPdf using Python's pip
package manager:
pip install ocrmypdf
This will install the ocrmypdf Python script, but you still won't be able to run it until you install other dependencies:
scoop install tesseract scoop install pdfquant scoop install ghostscript
Now, you will be able to use ocrmypdf to OCR an existing file:
ocrmypdf file1.pdf output.pdf
However, if you try to use it with a language other than English, it will fail, because you need language definitions.
4. Install tesseract support for other languages
In an ideal world, it would be enough to just install the tesseract-languages
package:
scoop install tesseract-languages
However, at the time of writing this, the tesseract-languages scoop package is broken, so we will need to manually install it.
To install it manually, you can go to the Tesseract Fast GitHub page, download language data files for languages you need, for example deu.traineddata
for German or fra.traineddata
for French, and put those files in your Tesseract installation folder, usually ~/scoop/apps/tesseract/current/tessdata/
Now, you should be able to specify the language of the file when converting, like this:
ocrmypdf -l eng+deu+fra file1.pdf outfile.pdf
This command will now try to detect text in English, German, or French and create a corresponding PDF file.
5. Misc
When running ocrmypdf
, you may encounter an error saying "file not found." To find out exactly what it is about, you can run it with the -v (verbose)
option, and it will report that the jbig2.exe
file can't be found.
This is yet another dependency that is in some cases required if optimization settings are specified. Unfortunately, it's not available in any package manager, so you either need to compile it from source yourself or download the precompiled Windows version (that also works on Windows 11) from here.
6. Next steps
Now, you should be able to use ocrmypdf
to OCR-convert PDF files in Windows directly. I hope at some point it will be available in the scoop package manager, but until then, these instructions will do.