Recoll input handlers

In the end, Recoll indexes plain UTF-8 text, remembering when it came from.

But of course, this is not how the source data looks like. The text content of the original documents is encoded in many fashions (ie pdf, ms-word, html, etc.), and it can also be stored in quite involved ways (inside archives, email attachments …).

For getting to the data and converting it to plain text, Recoll uses a set of modules which it calls input handlers (or filters), which either operate on the storage structure (ie: a zip handler), or the storage format (ie a pdf to text translator), or both. In addition, there is a tentative notion of a higher level storage backend which we will ignore for now (for reference there are currently two of those: the file system and the web history cache).

The basic task of filters is to take a document as input and produce a series of subdocuments as output. The subdocument’s format is defined either dynamically (as part of the output data), or statically, in the filter definition.

Simple filters

These are executed by a the mh_exec recoll module. They are the vast majority.

These filters are very simple. They are designed to perform a simple task with minimal interface, they mostly don’t know anything about each other, and they don’t know much about their context. This makes writing a filter quite easy as there is not much to learn about their environment.

Only one output document is produced and the format is fixed.

In practise the filter, which is most generally a shell-script (but could be any executable program), takes a file name on the command line and outputs an html or plain text document on standard output, then exits.

For example, the pdf filter takes one pdf file name as input on the command line and produces one html document on stdout. The fact that the output is html is statically defined in a configuration file.

For filters which produce plain text, the output character set information is in general defined in the configuration file. Else it will be obtained from the locale (hoping that it makes sense).

Filters that output html can produce metadata information in the html header (ie author etc.). Filters that output plain text can only output main text data, no metadata fields.

Besides the file name, there is one other piece of input information, which is in the form of an environment variable, and can be safely ignored: RECOLL_FILTER_FORPREVIEW. This indicates if the filter is being used for previewing or for indexing data. Some filters will elect to suppress repetitive parts of the output text when indexing to avoid distorting the term statistics. For exemple, the man filter suppresses the section headers (NAME, SYNOPSIS…) when indexing.

Multiple input filters

These filters are more complex, but still quite easy to write, especially if you can use Python, because they can then use a common module which manages the communication with the indexer.

Newer Recoll versions have converted many previously 'simple' filters to this kind as part of the port to Windows.

These filters are executed by the mh_execm Recoll module.

They are persistent (one instance will persist through a whole indexing pass), and will index successive multiple input files (the point being to avoid startup performance penalty), and possibly multiple documents per input file if this makes sense for their input format (ie: zip archive, chm help file).

They use a simple communication protocol over a pipe with the main recoll or recollindex process, with file names and a few other parameters being sent as input, and decoded data and attributes being sent in return.

The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip' or 'rclaudio' for reasonably straightforward exemples.