Unknown reporter writes

Hi!

Thanks for the great software. Trying it out and I’ve run into a problem: PDF index works great on a "normal" pdf. eg: exported from an Office app … however, not if I use a file scanned w/OCR (from a Cannon Image Runner). I can open the file, highlight/copy text (and paste elsewhere) … so I think the scan/OCR is successfull … however the file will not get found if I search for words within the pdf. It does get found based on the title. Log attached of a cmd line index scan on the problem file.

Any thoughts?

Thanks! John

medoc writes

Hi,

What happens probably is that the OCR stores the text in a pdf field which is not extracted by the Recoll converter for pdf files. Could you please attach a sample file so that I can take a look at how to get to the OCR’d text ?

Cheers,

jf

medoc writes

no feedback

ms007 writes

Same problem here, so jumping in.

Please look at the attached document.

Cheers, Monika

ms007 writes

demo.pdf

medoc writes

Hi,

What recoll and poppler versions are you using ? This works for me using either recoll 1.19 or recoll 1.20, and poppler-utils 0.24.5

ms007 writes

Hello,

I am using poppler-utils 0.24.5-2ubuntu4 and recoll 1.17.3-2 on Xubuntu 14.04 LTS.

ms007 writes

Hi again,

Short answer: It works!

Longer answer: I upgraded to recoll 1.19.14p1-1ppa2trusty1. That did not fix the problem. Then I realized that I had put the file demo.pdf in the directory ~/docs/ms/tmp/. Moving the file out of this temporary directory solved the problem.

Thank you.

medoc writes

You are welcome. It’s not clear that having tmp in skippedNames by default is such a great idea, it’s a good thing in some cases, and in others … less so.