The rclextract module

Index queries do not provide document content (only a partial and unprecise reconstruction is performed to show the snippets text). In order to access the actual document data, the data extraction part of the indexing process must be performed (subdocument access and format translation). This is not trivial in general. The rclextract module currently provides a single class which can be used to access the data content for result documents.

The Extractor class
An Extractor object is built from a Doc object, output from a query.
Extract document defined by ipath and return a Doc object. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The typical use would be as follows:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing
Extractor.idoctofile(ipath, targetmtype, outfile='')
Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)