How to install OcrMyPdf on Windows
1. About OcrMyPdf
OcrMyPdf is an excellent OCR program that is able to create readable text PDF files out of image PDF files (usually product of scanning) using an excellent open source tesseract library.
While installing it on Linux derivatives is quite easy, since most of the major distros provide it as a package, installing it on Windows requires a few additional steps, since there's no installer available, neither for OcrMyPdf, nor for the all its dependencies.
2. Prerequisites
Since I personally use scoop as a windows package manager, I'll not cover other options. If you don't already have scoop installed, installing it is as simple as opening Windows PowerShell and running the following:
> Set-ExecutionPolicy RemoteSigned -Scope CurrentUser # Optional: Needed to run a remote script the first time > irm get.scoop.sh | iex
3. Installing OcrMyPdf
Once we have scoop installed, let's first install Python:
scoop install python
Then, we'll install OcrMyPdf using Python pip
package manager
pip install ocrmypdf
This will install ocrmypdf python script, but you still won't be able to run it until you install other dependencis:
scoop install tesseract scoop install pdfquant scoop install ghostscript
Now, you will be able to use ocrmypdf to OCR existing file:
ocrmypdf file1.pdf output.pdf
However, if you try to use it using some other language than english, it will fail, because you need a langage definitions.
4. Install tesseract support for other languages
In an ideal world, it would be enough to just install tesseract-languages
package:
scoop install tesseract-languages
However, at the time of writing thing, tesseract-languages scoop package is broken, so we will need to manually install it.
To install it manually, you can go to Tesseract Fast GitHub page, download language data files for languagues you need, for example deu.traineddata
for German, or fra.traineddata
for French, and put those files in your tesseract intallation folder, usually ~/scoop/apps/tesseract/current/tessdata/
Now, you should be able to specify langage of the file when converting, like this:
ocrmypdf -l eng+deu+fra file1.pdf outfile.pdf
This command will now try to detect text in English, German or French and create a corresponging PDF file.
5. Misc
When running ocrmypdf
you may encounter error saying file not found. To find out exactly what it is about, you can run it with -v (verbose)
option, and it will report that jbig2.exe
file can't be found.
This is yet another dependency, that is in some casese required if optimization setting are specified. Unfortunatelly, it's not available in any package manager, so you eithre need to compile it from source yourself, or download the precompiled Windows version (that also works on Windows 11) from here.
6. Next steps
Now, you should be able to use ocrmypdf
to OCR convert PDF files in Windows directly. I hope at some moment it will be available on scoop package manager, but until then, this insturctino will do.