Unknown reporter writes
I have indexed the same page twice via the firefox extension, and per
$ strings webcache/circache.crch | grep -i url
that page is now stored twice in the web-cache. The page has changed, but not content wise - just script ids, etc - inconsequential changes. However only one page is indexed as seen with the query dir:/
(only one result for the duplicated url), and the index references only the latest content as verified by saving a copy via the recoll gui (the copy matches the file last saved by the firefox extension in .recollweb/ToIndex
).
medoc writes
The cache is a circular buffer. The older entry will be overwritten when new data comes over it.
I now think that designing the cache this way was a very bad idea (and the implementation was vastly more complicated than expected), but this is not a bug, I’d need to re-design and re-code the web cache to change this behaviour.
There is a utility somewhere which could be used to compact the file, ask me for details if you need it.