shankargopal writes

I am using recoll on Puppy Linux (version Dpup 484 beta 4 at the moment, but have used it on Puppy 4.1 and 4.1.2 as well). As I am a Linux user but have to work with all Windows users, I have a very large number of document files that are OpenOffice generated but in Word format. This means that antiword is not a useful indexer for me, as it skips many of my files because of the "this text stream is too small" issue. Therefore, I have had to switch to wvWare. wvWare, on the other hand, is much too slow, and recollindex with wvWare ends up taking up most of my CPU time and memory when indexing. Would it be possible therefore to introduce some code into the indexer to use antiword and to fall back to wvWare if it fails?

Many thanks!

medoc writes

By default, recoll now executes antiword directly. It used to do it through the rcldoc script which is slightly slower (additional execs). The rcldoc script does try to use vware if antiword fails, and it is still delivered with the recoll installation. So the only thing you need to to is to customize the mimeconf file.

Edit ~/.recoll/mimeconf, and add the following:

[index]
application/msword = exec rcldoc

This should solve your problem, contact me or add a comment if it doesn’t.

shankargopal writes

I have enabled rcldoc and installed wvWare on my system, so it is now being used. But though antiword is also available, it seems to be using wvWare for all documents, and hence the slowdown. Does rcldoc try antiword first? For instance I know wvWare is running even on some very large documents (created in Microsoft Word), which antiword presumably should not have a problem with.

medoc writes

rcldoc does try to use antiword first if it is available, then wvware only if the exit status from antiword is non-zero. You can check this by executing rcldoc by hand (add traces in there if necessary, this is a trivial shell-script). I just checked and, on my system, it does what it is supposed to. jf

shankargopal writes

Hi,

I’m back on this again, and sorry for again changing it to an open issue, but I think I’ve found the problem. Am now using recoll on a persistent install of Debian Squeeze Live. This is recoll version 1.13.04.

I noticed rcldoc was not falling back to wvWare on many of my OO-generated Word files. Changing this line (line 150):

wvWare --nographics --charset=utf-8 $infile

to

wvWare --nographics --charset=utf-8 "$infile"

did the trick - i.e. the problem was because the file name had spaces in it. I am a somewhat inexperienced Bash scripter, so hope this change would not bring any problems. Attaching a sample file for testing.

medoc writes

Thanks a lot for spending time on studying this issue and reopening it, your fix is correct and is already included in recoll 1.14 (all file names were quoted in filter scripts after finding a similar bug in another filter). It was very bad from me not to quote those file names, I should know better.

So you can keep your current fix for 1.13, and things should keep working when/if you upgrade to 1.14

Regards, jf