rdzidlic writes
on some distributions (Fedora) , /tmp is mounted as tmpfs with a 1GB size limit.
Many files (compressed tarballs, mboxes) will easily require much more to be indexed.
Not only will the indexing fail, but also system stability can be affected, system responsibility surely will.
Either it should be detected whether /tmp has enough space or /var/tmp should be used by default.
medoc writes
As far as I know, Recoll uses $RECOLL_TMPDIR, $TMPDIR, /tmp as location for temporary files and directories (and also run-time chdir), in this order of preference.
If you can see cases of hard-wired use of /tmp, I’ll try to fix them, else use the environment variables.
Otherwise, I see no reason to use /var/tmp instead of /tmp as an ultimate default. The official difference between the two is that /tmp is wiped more diligently, which is appropriate for recoll temp files.
rdzidlic writes
Good to know about RECOLL_TMPDIR - does not seem documented? Is it settable in recoll.conf or only in .bashrc?
I do not know what you consider official difference between /tmp and /var/tmp but size limits is also part of the difference on Fedora, Solaris and probably Debian. I thought the limit was 1GB but it was pointed out to me that it is RAM/2 - which may still be easily exceeded when several tmp hungry programs or recoll threads max it out.
Imho it would be still better to use /var/tmp by default, you do not want users to have random failures if it is easy to prevent? /tmp is slightly faster than /var/tmp on these systems - but only when it is guaranteed that it does not cause swapping or system failure which is only the case when used for pretty small tasks.
The mis-feature caused quite some discussions when introduced in Fedora but unfortunately it is here to stay - and packages need to deal with it.
medoc writes
TMPDIR and RECOLL_TMPDIR are both environment variables. You are right that RECOLL_TMPDIR was not documented, thanks for pointing it out, I’ll do it.
The only thing which the Linux file hierarchy says about /var/tmp is that the content should be preserved across system reboots, which does not make sense for Recoll temporary files. http://refspecs.linuxfoundation.org/FHS_2.3/fhs-2.3.html
I really feel that changing the default to /var/tmp would be wrong as /tmp is the traditional place for fully discardable temporary files.
I don’t think that compressed tar files should trigger the problem: the Python filter should process them without using a temporary uncompressed file. Compressed mboxes could certainly be an issue. Email attachements too, but gigabyte attachments are rather rare.
How did you actually hit the problem ?
rdzidlic writes
It is odd that http://refspecs.linuxfoundation.org/FHS_2.3/fhs-2.3.html#TMPTEMPORARYFILES doesn’t say anything about filesize restrictions but they are reality.
Both places are cleared regularly (with different intervalls) - and recoll idealy should not leave any temporary files around or delete them at next opportunity.
Not sure how I hit the problem, Fedora 21 currently still uses recoll-1.20.2 which might not be as good as the last release?
I certainly see compressed files being bunzipped in /tmp and also have MBOXes exceeding gigabytes. Either of this happened and caused a plethora of hard to trace problems for recoll and totally unrelated programs.
medoc writes
Recoll does clean up its temporary files. Nothing changed in this regard between releases (maybe the new release indexes more stuff which could have made the problem apparent). But some of the filters are not so good at cleaning up, so fast turnaround is better.
mbox files indexing don’t need temp space (attachments to individual messages do). Except if the mbox is compressed, but this is very uncommon.
Sorry, I’m not changing this, the risk of unwanted consequences is too high. If your /tmp is too small, resize it or use the environment variables.
Alternatively, if Fedora has special handling of temporary directories, it will be easy enough to add a patch to the package (one line to change in utils/pathut.cpp), talk with the packager (he can ask me for the patch if he agrees with the change of default).
I’ll add a configuration variable to the next release (in addition to the env variables) to make things even easier.
rdzidlic writes
right now I am watching what is happening:
-
recollindex found "samsung23g.image.bz2" which is the compressed image of a 32GB MicroSDHC card
-
it does unpack the image in /tmp which will obviously result in a 32+ GB file
I do not think that this is very unique, so recoll needs to take precautions.
medoc writes
I added .image, image.gz, .image.bz2, .img.xxx etc. to the skipped suffixes list, this will be in the next release. There is not much more I can do, there will always be uncompressed files which will exceed the temporary space limit (and it’s not even possible to determine the uncompressed size from the compressed file).
rdzidlic writes
yes, there is not much that can be done. Some things that come to my mind
-
report the errors when it happens so people notice easily
-
raise awareness - perhaps add a GUI option for setting tmpdir which would also mention possible issues
-
uncompress first few blocks to see if there is anything at all that could be indexed?
medoc writes
I agree that the current code should be improved. I am not reopening the issue because the title is wrong, but improving this issue is on my todo list.