OCR with Tesseract

If both tesseract and pdftoppm (generally from the poppler-utils package) are installed, the PDF handler may attempt OCR on PDF files with no text content. This is controlled by the pdfocr configuration variable, which is false by default because OCR is very slow.

The choice of language is very important for successfull OCR. Recoll has currently no way to determine this from the document itself. You can set the language to use through the contents of a .ocrpdflang text file in the same directory as the PDF document, or through the RECOLL_TESSERACT_LANG environment variable, or through the contents of an ocrpdf text file inside the configuration directory. If none of the above are used, Recoll will try to guess the language from the NLS environment.