piater writes

Xournal .xoj files are gzip-compressed XML files. A bare-bones recoll installation, thanks to "file -i", calls the rcluncomp filter with gzip on them, but gzip quits with an "unknown suffix" error. I fixed this by adding {{{ *.xoj) $uncomp < "$infile" > "$outdir/$sinfile.xml" || exit 1 uncompressed="$outdir/$sinfile.xml" ;; }}}

as a new first "case" in rcluncomp, fix that I hereby submit for your consideration for inclusion. (The XML is still not indexed, but this is a separate issue.)

However, there are quite a bunch of similar, compressed formats out there. Short of writing their specialized filters for each one of them, generic baseline support might be provided by rewriting rcluncomp such that it treats all compressed archives with unknown suffixes the same. That is, instead of the current solution with lists known, non-compressor-extensions (.xoj) as special cases and then expects all other extensions to be compressor extensions (.gz, …), one might handle all known compressor extensions explicitly, and then fall back on the "*)" case.

medoc writes

Does "file" know about uncompressed xoj files ?

If it doesn’t, and if we don’t change the extension while uncompressing, as you propose, we’d have to add an association in mimemap for .xoj, which we can’t do because an xoj file could be either compressed or uncompressed data.

Except if you have an idea how to handle this, I’m afraid that you’ll have to have a special filter for Xournal files, in which you’d test if the file is compressed, uncompress it if needed, then call the xml filter (or a specific xslt program) on the uncompressed data.

Another, more generic, approach would be to modify the xml filter to handle compressed data, and have have handle .xoj files, like the following:

mimemap: {{{ .xoj = application/x-xournal }}}

mimeconf: {{{

application/x-xournal = exec rclxml }}}

I did not test this, but I see no reason why this shouldn’t work.

piater writes

"file" does not know about any xoj files, compressed or uncompressed. But since "file" looks at file contents (and not names), "file -i compressed.xoj" correctly yields "application/x-gzip", and "file -i uncompressed.xoj" correctly yields "application/xml". Thus, I think my proposed generic fallback solution should work on xoj files, as well as any other unknown file extensions, as long as "file" gets the file type right. (Unless I misunderstand how recoll guesses file types.)

medoc writes

No, you are right, this was the sense of my initial question about //file// and uncompressed //xoj//. The answer is that //file// does know about them (as XML), so things work ok.

The only problem I have with this approach is that it forces us to list all known compression suffixes in rcluncomp, with a risk of causing a problem with previously handled data if we miss one (ie: did you know that gunzip supported //myfile-gz//, //myfile-z// and //myfile_z// ? I just discovered it in gzip source…). This would not be a major issue, but for the fact that it could introduce a regression for existing data if I miss a suffix (possibly for some gzip variant).

I am a risk-adverse person :)

Still I like the idea of handling all compressed xml formats in one small change.

Just a little more thinking needed…

piater writes

No, I did not know about the messy world of known compressor extensions; I assumed there were relatively few. One way to provide a fallback would be to detect gunzip’s failure when run on the file, and then retry using gunzip as a filter. However, this would cause gunzip to be run twice on such files, and this silently and indefinitely. Not a good solution.

How about leaving things as they are, without a fallback, and adding unrecognized extensions like .xoj by hand as they are identified? This is easy enough after all. If this is the right way to proceed, then how about catching the decompressor’s failure and reporting this to the user, analogously to the way missing handlers are reported? In fact, one might argue that the current behavior, a filter silently failing, is a bug. If this bug cannot be fixed by avoiding the failure, it can perhaps be fixed by reporting the failure in a user-friendly way.

medoc writes

Editing this as special-casing the known suffixes is not needed at all. So:

Actually the error is not silent, it would leave an error message in the log file (if it’s defined). It’s impossible to explicitely report all errors, there is too much garbage on most significantly sized file systems, the users would be overwhelmed. But the errors are there to see in the log if you look for them (egrep ^:2 somelogfile)

I think your first idea gives the solution. I’ll modify rcluncomp so that: * Decompression will be tried for all suffixes. This will catch any compressed file with a compression suffix. If decompression fails at this step, we retry in filter mode. An error run is fast, as the uncompressor just needs to look at the file name. Actually this is more than a order of magnitude faster than a //file// call. * For suffix-less data we do as currently

Consequences: * Compressed files with known compression suffix: no change except maybe a line of shell code. * Compressed files with unknown suffix: need one failed execution of the uncompressor in excess of what optimal processing would do. If this is a problem, just write a specific filter. They are currently not indexed at all anyway. * Compressed files with no suffix: practically no change.

Except if you see an issue with this approach (it really took two to get to a solution here), I’ll go with it.

piater writes

I think you nailed it. Looking forward to seeing this implemented.

medoc writes

Handle non-standard file name suffixes during decompression. Recoll should now index arbitrary compressed XML formats. Closes issue #93

→ <<cset 9a53eacb0a17 > >

medoc writes

Thanks for working with me on this. I think that you should be able to apply the 3 following changes to 1.17 to obtain the desired behaviour for compressed xml: * https://bitbucket.org/medoc/recoll/changeset/9a53eacb0a17 * https://bitbucket.org/medoc/recoll/changeset/f3ccfe3772dc * https://bitbucket.org/medoc/recoll/changeset/ff5641c87e3c