jedick writes

I’ve encountered an indexing problem with some PDFs that appears in the 1.20.0 preview version of Recoll, running with Xapian 1.2.18. It seems to be a rare occurrence, in about 1 out of every 200 PDF files that I have tested.

Here are the DOI links for two journal articles with PDFs that trigger this problem (these are open-access articles and can be downloaded for free from the journal websites):

Both of these PDFs are indexed successfully in Recoll 1.19.14 using the default rclpdf filter.

With Recoll version 1.20.0 preview (compiled from the source tar.gz downloaded from the website), building a fresh index using the default rclpdf, only the first PDF [1] above is indexed successfully (shown as docid 2) and indexing fails on the second PDF [2], with an error:

#!none

:3:recollindex.cpp:404:recollindex: changing current directory to [/tmp]
:3:recollindex.cpp:425:recollindex: starting up
:3:../rcldb/rcldb.cpp:606:Db::add: docid 1 added [/home/jedick/pdf/test|]
:3:../rcldb/rcldb.cpp:606:Db::add: docid 2 added [/home/jedick/pdf/test/journal.pone.0019538.pdf|]
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr

In addition, if I uncomment the "optionraw" option in the rclpdf filter, the error is produced for either PDF [1] or [2] using 1.20.0 preview. However, in 1.19.14, the indexing is still successful if the "optionraw" option is uncommented.

medoc writes

Hi,

Thank you for reporting this. I tried to reproduce the issue, but I failed. Maybe this is dependant on the exact system and utilities versions (I tried on Ubuntu 14.04, which has Xapian 1.2.16). On what system are you running this ? I’ll try to install the same in a VM.

Cheers,

jf

jedick writes

Thanks for the response.

I am using Slackware64-current, with Xapian and Recoll compiled using the slackbuilds.org scripts - edited to use later versions (Xapian 1.2.18 and Recoll 1.20.0). Slackware64-current has pdftotext from poppler 0.24.3. I just compiled and installed poppler 0.26.3 on my machine and now the problem with the above two PDFs [1] and [2] has disappeared - on building a fresh index using the default rclpdf filter they are indexed successfully.

However, my current setup still fails to process some PDFs. Here is an example:

recollindex produces the same error:

terminate called after throwing an instance of std::out_of_range what(): basic_string::substr

I also tried this with a lower version of Xapian (1.2.16) and get the same results. So the problem seems to be sensitive to the version of pdftotext, but the upgrade to poppler 0.26.3 does not completely fix the problem.

medoc writes

Hi,

I tried to switch to poppler 0.24.3, and still no luck.

Could you please attach the output from rclpdf on one of the affected files (with the poppler version which causes the problem) ? Maybe I will see something weird in there.

Else, we’ll have to look at a recollindex stack trace. There are some indications on how to get one here:

jedick writes

I reverted to poppler 0.24.3 because that version is needed by other programs I use. In the first PDF above [1], I found that the problem is caused by something on page 2. The rclpdf output for that page is attached as rclpdf.txt. The other file (rclpdf-raw.txt) is the rclpdf output with the "optionraw" option uncommented — in this case, recollindex crashes. The gdb stack trace is also attached.

medoc writes

I installed Slackware 14.1 64-current in a VM, then built and installed xapian 1.2.18 and recoll 1.20.0 freshly downloaded from the web sites, I could not reproduce the problem on any of the above files, with or without optionraw.

I am beginning to wonder if I might have made a mistake at some point and uploaded 2 different 1.20.0 files in time (this would be very bad of me), because there was a similar bug fixed in the main branch a few months ago. The VERSION file would have contained 1.20.0, so this would not be detectable from recollindex -v

This would be detectable from the tar file checksum though, so you could download http://www.lesbonscomptes.com/recoll/recoll-1.20.0.tar.gz and check this.

Also make sure that you have not two versions installed, for example in /usr and /usr/local

And lastly, but this does not make much sense because I did test with 1.20.0, you could try to download and build the current 1.20.0p1, at least it will report a different version number, and we will be sure of what is running.

I am a bit short of ideas at this point, which is why I suggest the desperation checks…

jedick writes

Thank you very much for trying the installation on the VM. Slackware current has glibc 2.19 and gcc 4.8.3; just to check I installed the older versions from Slackware 14.1 (glibc 2.17 and gcc 4.8.2) and recompiled Xapian and Recoll and found no difference. I also download Recoll 1.20.0p1 and still have the problem. Have checked to make sure that Recoll is installed in /usr and any previous version in /usr/local has been removed.

Previously I accidentally left an -O2 in the compilation (coming from the Slackbuild script for Recoll); this time have made a gdb stracktrace on Recoll compiled without the -O2 (gdb-noO2.txt). I turned on the "cerr" code at the top of TextSplit::words_from_span, and the error output is copied in the attached file (recollindex.txt). Maybe this will help, at least it shows where in the PDF the error happens, at the end of equation 1. This is still using page 2 from PDF [1] above, processed with optionraw. I can also generate the error just by indexing the output from rclpdf (rclpdf-raw.txt).

medoc writes

The missing detail: you are building with --enable-camelcase… I can reproduce the issue now, I’ll look into it.

jedick writes

Sorry about the missing information; my use of --enable-camelcase was quite unintentional and unnoticed by me. The SlackBuild script for Recoll from slackbuilds.org has a faulty test so that --enable-camelcase is used even if the ENABLE_CAMELCASE option for the script is set to "NO" (the default, which I did not change). I’ve sent a message about it to the Slackbuild-users list.

I built Recoll without --enable-camelcase and found that it successfully indexes the PDFs that previously gave errors. I’ll go back to --enable-camelcase to further test this issue.

medoc writes

Yeah, I think that splitting camelcase is mostly a bad idea, which is why it’s only a compile-time option. Don’t bother going back to it except if you actually need it, I have enough data for testing, the following string will crash the splitter:

fATP~YLfLzYMfMzYFAfFA: ð1Þ

and I still have my collection of pdfs for further testing :)

medoc writes