Parameters for the PDF input script


Attempt OCR of PDF files with no text content if both tesseract and pdftoppm are installed. The default is off because OCR is so very slow.


Language to assume for PDF OCR. This is very important for having a reasonable rate of errors with tesseract. This can also be set through a configuration variable or directory-local parameters. See the script.


Enable PDF attachment extraction by executing pdftk (if available). This is normally disabled, because it does slow down PDF indexing a bit even if not one attachment is ever found.


Extract text from selected XMP metadata tags. This is a space-separated list of qualified XMP tag names. Each element can also include a translation to a Recoll field name, separated by a '|' character. If the second element is absent, the tag name is used as the Recoll field names. You will also need to add specifications to the "fields" file to direct processing of the extracted data.


Define name of XMP field editing script. This defines the name of a script to be loaded for editing XMP field values. The script should define a 'MetaFixer' class with a metafix() method which will be called with the qualified tag name and value of each selected field, for editing or erasing. A new instance is created for each document, so that the object can keep state for, e.g. eliminating duplicate values.