837183 writes

Hi Medoc, what’s up?..

For some reason PDF documents that were indexed in the past, are now not indexed. (when rebuilding the index).

Only thing I did is apt-get update ; apt-get upgrade on this PC (Debian Jessie).

Which as far as I’m aware..installed a newer version of recoll (1.23.2) because I had the recoll repository listed, and also installed Python 3.

python --version

reads Python 2.7.9

and

python3 --version

reads Python 3.4.2

Seems related to not recognizing the correct mime type for PDFs?..

medoc writes

There is a new pdf handler in 1.23.2, but it’s not obvious from the log that this is the cause.

What happens if you try to execute the handler on the command line ?

/usr/share/recoll/filters/rclpdf.py -d /path/to/doc.pdf | more

837183 writes

well…the handler works, without the pipe to "more" the whole book is displayed.

medoc writes

Looking at the log there are several very weird things in there, which could probably be explained by a weird config.

Are you using an old configuration directory ? If this is the case, could you please restart the test on an empty one ?

837183 writes

Well..one of these was the offender:

#!c++

thrQSizes = -1 -1 -1
noaspell = 1
indexedmimetypes = application/pdf
indexallfilenames = 0
indexstemminglanguages =

I started a new recoll.conf without these and all is well! (everything is indexed)

medoc writes

I see. The issue would be indexedmimetypes = application/pdf

This is a bug, what happens is that it prevents processing the internal text/html version of the pdf (to turn it into text/plain which is finally indexed). The code needs to discriminate between text/html found on disk (which you may not want to index), and text/html as internal intermediate format. The previous version did not have this issue, I need to see why, and fix the new version.

Meanwhile, I hope that you can live without the indexedmimetype thing. Congratulations, you found (one more) recoll bug :)

medoc writes

This is now fixed in development code and will be in the next release.

A relatively good workaround would be to just add text/html to indexedmimetypes (and possibly use suffix exclusion if you really do not want to index any HTML files).

837183 writes

huh, thanks for the explanation, that’s interesting :) I will use the workaround in the meanwhile, thanks for both the explanation and the workaround!