The rclextract module

Index queries do not provide document content (only a partial and unprecise reconstruction is performed to show the snippets text). In order to access the actual document data, the data extraction part of the indexing process must be performed (subdocument access and format translation). This is not trivial in the case of embedded documents. The rclextract module provides a single class which can be used to access the data content for result documents.

The Extractor class

An Extractor object is built from a Doc object, output from a query.


Extract document defined by ipath and return a Doc object. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The typical use would be as follows:

qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing

Passing qdoc.ipath to textextract() is redundant, but reflects the fact that the Extractor object actually has the capability to access the other entries in a compound document.

Extractor.idoctofile(ipath, targetmtype, outfile='')

Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use:

qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)

In all cases the output is a copy, even if the requested document is a regular system file, which may be wasteful in some cases. If you want to avoid this, you can test for a simple file document as follows:

not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")