Unix-like systems: indexing visited Web pages

With the help of a Firefox extension, Recoll can index the Internet pages that you visit. The extension has a long history: it was initially designed for the Beagle indexer, then adapted to Recoll and the Firefox XUL API. The current version of the extension is located in the Mozilla add-ons repository uses the WebExtensions API, and works with current Firefox versions.

The extension works by copying visited Web pages to an indexing queue directory, which Recoll then processes, storing the data into a local cache, then indexing it, then removing the file from the queue.

The visited Web pages indexing feature can be enabled on the Recoll side from the GUI Index configuration panel, or by editing the configuration file (set processwebqueue to 1).

The Recoll GUI has a tool to list and edit the contents of the Web cache. (ToolsWebcache editor)

You can find more details on Web indexing, its usage and configuration in a Recoll 'Howto' entry.

The cache is not an archive

A copy of the indexed Web pages is retained by Recoll in a local cache (from which data is fetched for previews, or when resetting the index). The cache has a maximum size, which can be adjusted from the Index configuration / Web history panel (webcachemaxmbs parameter in recoll.conf). Once the maximum size is reached, old pages are erased to make room for new ones. The pages which you want to keep indefinitely need to be explicitly archived elsewhere. Using a very high value for the cache size can avoid data erasure, but see the above 'Howto' page for more details and gotchas.