jedick writes

I’m using Recoll 1.21.0 and find it to be working great. I noticed a change in behavior of the recollq command recently, where it stopped matching terms in title that contain an apostrophe ('), while the GUI search still gave results. I traced the change (possibly) to a new version of poppler installed on my system (poppler-0.32).

Does the recent version of poppler fix the bug concerning character entities in HTML?

Now when I run pdftotext -htmlmeta on a test document I see the character entitites in the title field, ", ', etc.

medoc writes

poppler is only involved in producing index data, not at query time, so it is unlikely that it would be involved in this.

Are you sure of your command line quoting ?

Please supply the Xapian queries produced by both approaches (recollq prints it before the results, and you can use the show query link in the GUI).

jedick writes

I should have mentioned that I did perform a re-index after upgrading Recoll to 1.21.0; that was some time after applying the poppler update.

To test, I have created a fresh index for a single document with the title containing the word Boltzmann’s. Actually, it looks like the searches in the GUI or CLI are giving the same results, and do not find the word with the apostrophe.

GUI search title:Boltzmann’s [results: 0 query: (Sboltzmann’s:(wqf=11))]

GUI search title:boltzmann’s [results: 1 query: Sboltzmann OR Sboltzmann’s:(wqf=10)]

The title shown in the result list contains the text Boltzmann's

At the command line


$ recollq title:Boltzmann\'s
Recoll query: (Sboltzmann's:(wqf=11))
0 results
$ recollq title:boltzmann\'s
Recoll query: ((Sboltzmann OR Sboltzmann's:(wqf=10)))
1 results

(and Boltzmann's is in the text of the results)

EDITED: clean up formatting GUI and CLI searches

EDIT 2: format Boltzmann's as code, not HTML (that shows Boltzmann’s, which is not what I see on computer screen)

medoc writes

Do you think that it would be possible for you to send the test document to me: jf at ?

jedick writes

OK, just sent it.

medoc writes

Solved by new rclpdf filter which detects the poppler version and acts accordingly