Recoll and OCR

This is new in Recoll 1.26.5. Older versions had a more limited, non-caching capability to execute an external OCR program in the PDF handler. The new function has the following features:

  • The OCR output is cached, stored as separate files. The caching is ultimately based on a hash value of the original file contents, so that it is immune to file renames. A first path-based layer ensures fast operation for unchanged (unmoved files), and the data hash (which is still orders of magnitude faster than OCR) is only re-computed if the file has moved. OCR is only performed if the file was not previously processed or if it changed.

  • The support for a specific program is implemented in a simple Python module. It should be straightforward to add support for any OCR engine with a capability to run from the command line.

  • Modules initially exist for tesseract (Linux and Windows), and ABBYY FineReader (Linux, tested with version 11). ABBYY FineReader is a commercial closed source program, but it sometimes perform better than tesseract.

  • The OCR is currently only called from the PDF handler, but there should be no problem using it for other image types.

To enable this feature, you need to install one of the supported OCR applications (tesseract or ABBYY), enable OCR in the PDF handler, and tell Recoll where the appropriate command resides. The last parts are done by setting configuration variables. See the relevant section. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).