Creating Python external indexers

The update API can be used to create an index from data which is not accessible to the regular Recoll indexer, or structured in a way which presents difficulties to the Recoll input handlers.

An indexer created using this API will have to do equivalent work as the the Recoll file system indexer: look for modified documents, extract their text, call the API for indexing it, take care of purging the index of documents which do not exist any more.

The index data from such an external indexer should be stored in an index separate from any used by the Recoll internal file system indexer. The reason is that the main document indexer purge pass (removal of deleted documents) would also remove all the documents belonging to the external indexer, as they were not seen during the filesystem walk (and conversely, the external indexer purge pass would delete all the regular document entries).

While there would be ways to enable multiple foreign indexers to cooperate on a single index, it is just simpler to use separate ones, and use the multiple index access capabilities of the query interface, if needed.

There are two parts in the update interface:

  • Methods inside the recoll module allow the foreign indexer to insert data into the index, to make it accessible by the normal query interface.

  • An interface based on scripts execution is defined to allow either the GUI or the rclextract module to access original document data for previewing or editing.