Unknown reporter writes
Recoll 1.14.2, from source (initially from the opensuse RPM from the recoll website, with the same result); Xapian 1.2.3 from opensuse buildservice. Linux x86 (opensuse 11.3).
Starting with an empty .recoll, attempting to index crashes with: [1] 19295 illegal hardware instruction (core dumped) ./index/recollindex -z
(gdb) bt #0 0x08107f14 in next (this=0xbffe6aa0, internal_=0x82df968) at ./common/postlist.h:187 #1 Xapian::PostingIterator::PostingIterator (this=0xbffe6aa0, internal_=0x82df968) at api/ompostlistiterator.cc:36 #2 0x080f8d1a in Xapian::Database::postlist_begin (this=0x82b8528, tname= "Q/home/test/tmp/rt|") at api/omdatabase.cc:147 #3 0x0806ad5b in Rcl::Db::needUpdate (this=0x82b4f98, udi="/home/test/tmp/rt|", sig="581288084046") at ../rcldb/rcldb.cpp:1161 #4 0x080a6820 in FsIndexer::processone (this=0x82df1c0, fn="/home/test/tmp/rt", stp=0xbffe6edc, flg=FsTreeWalker::FtwDirEnter) at ../index/fsindexer.cpp:359 #5 0x0808fc06 in FsTreeWalker::iwalk (this=0x82df1c4, top="/home/test/tmp/rt", stp=0xbffe6edc, cb=…) at ../utils/fstreewalk.cpp:284 #6 0x08090bfb in FsTreeWalker::walk (this=0x82df1c4, _top="/home/test/tmp/rt", cb=…) at ../utils/fstreewalk.cpp:198 #7 0x080a5594 in FsIndexer::index (this=0x82df1c0) at ../index/fsindexer.cpp:119 #8 0x08065e46 in ConfIndexer::index (this=0x82b4f90, resetbefore=true, typestorun=ConfIndexer::IxTAll) at ../index/indexer.cpp:67 #9 0x0804eaad in main (argc=0, argv=<value optimized out >) at recollindex.cpp:360
It looks like Xapian may be holding an invalid iterator, although "illegal hardware instruction" isn’t the usual result of that.
medoc writes
Hello, Is there anything special to the "tmp/tests/rt" file ? If so, can I get a copy ? This is going to be difficult to cure if I can’t reproduce it… The most probable reason for the "illegal instruction" thing is that Recoll has been thrashing the stack. Also exactly which xapian-core package are you using? There seems to be myriads on the build service.
Unknown User writes
Oddly, it doesn’t matter what’s in the to-be-indexed directory. I started with a collection of various files, switched to a single PDF, then an empty directory. Same result each time.
The Xapian packages I used were the ones from: http://download.opensuse.org/repositories/KDE:/Extra/openSUSE_11.3/i586/
though, as noted above, the opensuse 11.2 RPM from the Recoll download page (http://www.lesbonscomptes.com/recoll/suse11.2/recoll-1.14.2-0.i586.rpm) exhibits the same behaviour and is built statically (presumably with a good Xapian version)-I only installed Xapian afterwards, in order to get a meaningful stack trace.
I understand this is probably not an easy bug to find, especially as it isn’t even triggering in the Recoll code. If you can’t reproduce it, let me know if there are any additional tests I can do here.
Sorry about the formatting on the initial report-I obviously missed a trick there.
medoc writes
Thanks for adding more detail. After reading again your message last night I noticed that my idea of Recoll thrashing the stack was not too bright given that you included a stack trace… and also that the stack trace indicated that the mystery file was a directory, so that I did not need a copy.
So if I understand well this time, you get an illegal instruction abort on every recollindex run, starting with an empty .recoll (except for recoll.conf I guess) and independant of what you are trying to index ? And you can reproduce this either with the 11.2 rpm or with a locally compiled recoll 1.14.2 and xapian 1.2.3 ?
I’ve tried to reproduce the issue on a vanilla openSuse 11.3, up to date, using only the standard default repositories, either with the downloaded 11.2 rpm or with a local build, and I can’t .
This is really mysterious, especially because recollindex does not depend on too much external software: I ran ldd on the recollindex from the rpm (statically linked with xapian as you noticed): {{{
ldd recollindex
linux-gate.so.1 = > (0xffffe000)
libSM.so.6 = > /usr/lib/libSM.so.6 (0xb77cb000)
libICE.so.6 = > /usr/lib/libICE.so.6 (0xb77b0000)
libX11.so.6 = > /usr/lib/libX11.so.6 (0xb7675000)
libpthread.so.0 = > /lib/libpthread.so.0 (0xb765a000)
libdl.so.2 = > /lib/libdl.so.2 (0xb7654000)
libstdc++.so.6 = > /usr/lib/libstdc++.so.6 (0xb7564000)
libm.so.6 = > /lib/libm.so.6 (0xb753a000)
libgcc_s.so.1 = > /lib/libgcc_s.so.1 (0xb751c000)
libc.so.6 = > /lib/libc.so.6 (0xb73b1000)
libuuid.so.1 = > /lib/libuuid.so.1 (0xb73ab000)
libxcb.so.1 = > /usr/lib/libxcb.so.1 (0xb738a000)
/lib/ld-linux.so.2 (0xb77f2000)
libXau.so.6 = > /usr/lib/libXau.so.6 (0xb7386000)
}}}
Another point: when you recompiled xapian and recoll, did you make sure that you had only one version on the system, either by using a configure --prefix=/usr or removing the rpm ? By default you’d end up with two versions, one in /usr/local and one in /usr.
I must say that I am a bit at a loss here…
Unknown User writes
Re paragraph 2: correct. Additionally, running the recoll GUI works until it tries to do something with the database. The stack traces are slightly different in a recoll GUI crash (I can post one if you like), but still look like they’re occuring in Xapian container/iterator code. Updates crash in "next", while a search crashes in Xapian::MSet::MSet.
Re compilation: built Recoll as ./configure --prefix=/usr && make static. Did not build Xapian-just used the binary RPMs from OBS. My ldd output differs from yours only in the load addresses.
Is there a reasonably easy way to exercise Xapian outside of Recoll? I notice that the xapian-core RPM has a bunch of binaries, but I’ve no idea what they do; perhaps I can convince one of them to issue an illegal hardware instruction.
If you’ve no other ideas, my plan is to try a 1.0.x release of Xapian, then some other machines, then perhaps an older version of Recoll. I guess I can also try stepping through the Xapian code, but as I’ve no idea what any of it should be doing, it would probably take awhile to get anywhere with it.
medoc writes
What is your hardware architecture ?
I’ve really no idea what’s happening here, but I think that at this point, you should try to remove all xapian and recoll rpms and try to build everything from scratch (inside clean tar extracts) to see what happens. Same procedure for xapian and recoll:
{{{ configure --prefix=/usr make sudo make install }}}
I also tried to download the xapian library rpm from the link above (KDE), and rebuild recoll with this, works ok, no problem !
Woah this is a weird one ! Xapian is usually rock-solid, there must be something obvious that we are missing, it has to be something with the build or the environment.
Unknown User writes
Arch is x86-32, specifically an AMD K7.
Rebuilt Xapian, then Recoll from tarballs: same result. Since both OBS RPMs and locally compiled copies have the same problem, it’s probably not a build environment problem. It could be runtime, of course. As you noted, Xapian and Recoll don’t have any exotic dependencies, so if libz or libstdc++ were corrupted on disk, I probably would have noticed by now.
Still, I should try to reproduce this on another machine; unfortunately I won’t be able to do so for a few days, and there may not be any OS 11.3 32 bit machines around at this time.
Also tried Xapian 1.0.21, with the same result (well, I didn’t actually look at the stack trace). Also the same with Recoll 1.13.04 with either Xapian.
Unknown User writes
I was able to find a couple of machines to test on, after all. Both are OS 11.2. On one, indexing completed, but any search causes SIGILL. This is another K7, though the software environment is obviously somewhat different.
The other 11.2 machine worked fine. It’s some sort of K8, but is running 32-bit SuSE anyway. I think needing an x86-64 chip running 32-bit linux is not a likely requirement for running Xapian, and I doubt we’ve found a new erratum for Athlon XPs after ten years, so….
These were all from the 11.2 RPM, since building it from scratch does not seem to matter. I am somewhat out of ideas that take fewer than some hours.
medoc writes
Hi, I think I know what’s happening: recent versions of Xapian are compiled with a flag enabling sse2 instructions, which do not exist on older x86 processors (see http://trac.xapian.org/wiki/PackagingXapian for slightly more detailed explanations).
Please try to configure xapian with the --disable-sse flag and rebuild everything. I tested this on an older machine (a Duron), and I do get the SIGILL problem, and it does go away with --disable-sse
Sorry I did not think of this faster, I’d seen an email about the issue some time ago, I should have acted on it. I am going to either recompile the packages on the web site or at least add a warning.
Thanks for your patience in spending your time with me to clear this issue (hopefuly :))
Cheers, jf
Unknown User writes
That did it (--disable-sse on Xapian). I’m glad you remembered, because I doubt I would have thought of it-I’d totally forgotten the K7s didn’t implement SSE2, though I doubt I would have even thought of an indexing/search library requiring it by default.
IMO this is a very poor decision on Xapian’s part-if they really want SSE2 enough to make it enabled by default (and thereby breaking a lot of extant machines in a very mysterious way), they need to write an autoconf test for it or something.
In any case, thanks for your help-I’ve now a much less ancient version of Recoll working on my (apparently) ancient CPU.
medoc writes
Glad it’s working now.
While I agree that the great Xapian people may have been a bit aggressive on this one, it’s mainly my responsibility as builder/packager to create compatible binaries or put a warning (both of which I am going to do). This problem is described on the Xapian packaging page, so, my bad.
And yes, the xapian configure should probably print a warning when run without --disable-sse on an incompatible cpu, but this would not solve the cross-compile issue anyway.
Thanks again for helping me clear and fix this!