manuelbuser writes

Hello

First of all, recoll is fabulous. I am using large zip containers that hold archived data in a tree structure. It seems that all files inside the zip are indexed, even if there is a "skipped name" or "skipped path" rule that would match a name / path inside the zip. Am I doing something wong? It would be great to have skipping rules work also inside containers.

Recoll 1.19.3 + xapian 1.2.8

medoc writes

Hi,

You are right that skipped paths and names are not respected inside zip files (or other archives), they only work for real file system files.

I agree that it would be nice to have the possibility to use the file selection configuration inside archives. This is not a simple issue, because the code which walks the file system and the one which walks zip archives are totally separate (not even the same language…), and filters currently can’t access the configuration.

Identifiers inside compound documents are not necessarily file-like paths (e.g.: email folder files have message numbers), so there was no real reason initially to extend the path selection mechanism to filters.

I am putting this on the todo, but it will take some time.

Meanwhile, if you can write a little Python, it would probably be quite simple to modify the zip filter for skipping some paths or names (which you could read from some kind of configuration file, or just hard-code inside the modified filter).

You can then tell recoll to use your own filter by having the following inside ~/.recoll/mimeconf:

[index]
application/zip = execm /path/to/my/rclzip;charset=default

medoc writes

Not a bug actually, but a quite desirable enhancement.

medoc writes

Fixed by the new rclconfig.py module and a modification of the rclzip code. To use before the next release:

  • Fetch python/recoll/recoll/rclconfig.py and filters/rclzip from the source tree

  • Copy both to /usr/share/recoll/filters, make rclzip executable

Set a variable named zipSkippedNames inside recoll.conf:

  • This is a space-separated list of patterns which will be passed to python fnmatch, the / characters are not special (matched as any character).

  • You can’t use embedded spaces in patterns (no double-quote quoting for now)

  • This can be redefined for file system directories using the usual section indicators

Example:

zipSkippedNames = *.txt
[/path/to/the/dir]
zipSkippedNames = somedir/*/*.html