Prior to Recoll 1.25, index queries could not provide document
content because it was never stored. Recoll 1.25 and later usually
store the document text, which can be optionally retrieved when
running a query (see query.execute()
above - the result is always plain text).
The rclextract
module can give access to
the original document and to the document text content (if not
stored by the index, or to access an HTML version of the text).
Accessing the original document is particularly useful if it is
embedded (e.g. an email attachment).
You need to import the recoll
module
before the rclextract
module.
- Extractor(doc)
An
Extractor
object is built from aDoc
object, output from a query.- Extractor.textextract(ipath)
Extract document defined by
ipath
and return aDoc
object. Thedoc.text
field has the document text converted to either text/plain or text/html according todoc.mimetype
. The typical use would be as follows:from recoll import recoll, rclextract qdoc = query.fetchone() extractor = rclextract.Extractor(qdoc) doc = extractor.textextract(qdoc.ipath) # use doc.text, e.g. for previewing
Passing
qdoc.ipath
totextextract()
is redundant, but reflects the fact that theExtractor
object actually has the capability to access the other entries in a compound document.- Extractor.idoctofile(ipath, targetmtype, outfile='')
Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use:
from recoll import recoll, rclextract qdoc = query.fetchone() extractor = rclextract.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
In all cases the output is a copy, even if the requested document is a regular system file, which may be wasteful in some cases. If you want to avoid this, you can test for a simple file document as follows:
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")