Creating Python external indexers

The update API can be used to create an index from data which is not accessible to the regular Recoll indexer, or structured to present difficulties to the Recoll input handlers.

An indexer created using this API will be have equivalent work to do as the the Recoll file system indexer: look for modified documents, extract their text, call the API for indexing it, take care of purging the index out of data from documents which do not exist in the document store any more.

The data for such an external indexer should be stored in an index separate from any used by the Recoll internal file system indexer. The reason is that the main document indexer purge pass (removal of deleted documents) would also remove all the documents belonging to the external indexer, as they were not seen during the filesystem walk. The main indexer documents would also probably be a problem for the external indexer own purge operation.

While there would be ways to enable multiple foreign indexers to cooperate on a single index, it is just simpler to use separate ones, and use the multiple index access capabilities of the query interface, if needed.

There are two parts in the update interface:

  • Methods inside the recoll module allow inserting data into the index, to make it accessible by the normal query interface.

  • An interface based on scripts execution is defined to allow either the GUI or the rclextract module to access original document data for previewing or editing.