The rclextract module

Prior to Recoll 1.25, index queries never provide document content because it is not stored. More recent versions usually store the document text, which can be optionally retrieved when running a query (see query.execute() above - the result is always plain text).

The rclextract module can give access to the original document and to the document text content (if not stored by the index, or to access an HTML version of the text). Acessing the original document is particularly useful if it is embedded (e.g. an email attachment).

You need to import the recoll module before the rclextract module.

The Extractor class

An Extractor object is built from a Doc object, output from a query.


Extract document defined by ipath and return a Doc object. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The typical use would be as follows:

from recoll import recoll, rclextract

qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing

Passing qdoc.ipath to textextract() is redundant, but reflects the fact that the Extractor object actually has the capability to access the other entries in a compound document.

Extractor.idoctofile(ipath, targetmtype, outfile='')

Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use:

from recoll import recoll, rclextract

qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)

In all cases the output is a copy, even if the requested document is a regular system file, which may be wasteful in some cases. If you want to avoid this, you can test for a simple file document as follows:

not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")