orbisvicis writes
It isn’t possible to delete indexed web pages*:
-
Scripts only apply to local file-system documents and not cached web pages
-
Even if they did,
recollindex -e
cannot operate if the real-time indexer is running
Also, I’m not sure if scripts should be used to manage indexed web pages. I can see why’d you hesitate to add that functionality for local documents, as recoll is intended to mirror a filesystem tree - though actually I’m not sure if recoll removes files from the index when they’ve been deleted (does it?). But indexed web pages don’t exist on the filesystem and can’t be managed that way, so recoll needs to act as the frontend.
\* It’s very easy to unintentionally index an incorrect web page.
medoc writes
There is a command-line utility which can do this, it’s not very friendly, but I guess that some kind of front-end could be built over it. It’s not delivered with the Recoll package though, you’d need to build from source (you can use configure --disable-qtgui
to make things simpler and reduce dependancies). After a normal build of recoll 1.21, go to the utils directory and type make trcircache
Then trcircache -d ~/.recoll/webcache
will list the contents, and trcircache -d ~/.recoll/webcache -e someudi
would erase an entry.
orbisvicis writes
What are your plans for this?
medoc writes
I have no plans to do something more at the moment, this is the first time this has ever been asked for.
Did you try to build and run trcircache and check that you could do what you needed with it ?
orbisvicis writes
trcircache
does delete content, however it leaves empty (dicsize/datasize) entries:
Scan: offs 22221 dicsize 0 datasize 0 padsize 1389 flags 0 udi []
I assume this is because of the nature of the circular buffer, and that this space will eventually be reclaimed?
orbisvicis writes
What I’d like to be able to do, however, is hook into the functionality of trcircache
from the GUI, like scripts for cached web pages. Also I’d like to be able to keep the index synchronized with the web-page cache. Either by notifying recoll from the script that the circache changed and the index should be rescanned, or by manually deleting entries from the index (via the script) while the real-time indexer is running. One of the command-line xapian management tools might have this functionality.
medoc writes
The cache is only reindexed by the initial incremental which the monitor does before actually monitoring. It’s possible to re-trigger this pass by touching the configuration file (the indexer will reexec itself).
medoc writes
Recoll 1.22. will have an editor to manage the webcache contents from the GUI. This is implemented in the following commits and a few others before it.