scasier writes

Hi again,

First of all sorry for the barrage of bug reports. These are all smaller "papercut" type issues that have been bothering me for a while. I was always too lazy to report them, but now that I have a bitbucket account I figured it was time to get this over with. Please feel to take all the time you need to go through these.

Anyway, this report concerns the mimetype detection for files that aren’t listed in mimemap. Right now, Recoll utilizes file -i to get the mimetype information for these files. Unfortunately file isn’t very accurate on my system (Ubuntu 12.04).

One case where it consistently fails are javascript files, which it loves to detect as html:

file -i 'stream.js'
stream.js: text/html; charset=us-ascii

OTOH, if I perform the same query with xdg-mime I always get the correct result:

xdg-mime query filetype 'stream.js'
application/javascript

Is there any way to make Recoll use xdg-mime instead of file? I did find the --with-file-command configuration option in the documentation but I assume that this will only work for variants of the file utility.

Thank you very much in advance. SC.

medoc writes

I’ve added a parameter to change the command to use as a last resort for mime type identification. It’s named systemfilecommand, in recoll.conf, e.g.:

systemfilecommand = xdg-mime query filetype

There may be unexpected consequences to using xdg-mime though: some files which were previously indexed as text/plain may not be indexed at all because their new mime type is unknown (this would be the case for javascript by the way, unless you add something to mimeconf).

I was not aware that xdg-mime and file used a different code base, which is why I never experimented with this. The main reason for having an internal classifier was that file was unable to reliably identify mbox files, which xdg-mime seems to do correctly. Maybe I’ll be able to dump the internal code in time…

scasier writes

Thank you very much for the new option and the pointers to using it correctly! Everything is working perfectly.

Here’s how I modified mimemap and mimeconf to accomodate for the new mimetype detection method:

mimemap

Removed assignment of .php to text/html in favor of detection by xdg-mime:

.php = text/html

Removed assignment of .mp4 to audio/mp4 in favor of detection by xdg-mime:

.mp4 = audio/mp4

This was another issue I would often run into with the indexer: It would identify .mp4 files as audio and then proceed to invoke rclaudio, which would fail. In the result list this would manifest as a video file with a generic icon.

mimeconf

I added the php and javascript MIME-types to the [index] and [icons] sections:

[index]
+application/x-php = internal text/plain
+application/javascript = internal text/plain
[icons]
+application/x-php = source
+application/javascript = source

The new mimetype detection also has the added benefit of making the "include/exclude by mimetype" options far more reliable!

So, once again, thank you very much for implementing systemfilecommand! It has improved my experience with Recoll by quite a bit.

I really think it would be worth looking into using xdg-mime by default and reserve mimemap only for the corner cases where a more specific mimetype assignment is needed.

Edit: Just a thought, but, with a more accurate mimetype detection in place, wouldn’t that also make it possible to use the default system icons for the mimetypes in question? I would imagine that this could save quite a lot of work with choosing and configuring the right icons to ship with a new supported mimetype.

medoc writes

It’s quite tempting to use the freedesktop shared mime database as the primary method of identification. However there are still a few issues:

  • It’s still not too good at identifying mail types (e.g. an mbox with empty lines at the top will go at text/plain, and message/rfc822 files from more "exotic" provenances will not be too well recognized either). Wrong identification as text/plain is a major no-no as it’s not a failure and no further identification will be attempted.

  • There does not seem to be a widely packaged standalone library implementing an API over the shared database: glib has one (based on the xdg code), qt has one (based on?), but no agnostic API apparently. So it’s either always execute xdg-mime, with possibly inconsistent results (because the actual backend used will depend on the desktop environment), and possible performance issues, or use the freedesktop C code or the one from Gnome, and maintain it, which is not too appealing.

About the icons: you are right, and as this is only used in the Qt GUI, there is no problem in depending on Qt, and I’ll look at what Qt interfaces would exist for this, if any.