aperezmendez writes

Hi,

I’ve found a bug in happens when you look for a file by filename, and there is a Document not found. It might happen in other case, though, I haven’t tested.

At this moment, when I look for any document containing "pdf" in the filename, I get like lots of results. However, if I try to sort them by date, it gives me no results.

Log output says that :2:../rcldb/rclquery.cpp:358:xenquire- >get_mset: exception: Document 41429 not found

I guess when it tries to obtain the date of a missing file, it fails. Not sure why the file is missing, but anyway, it should either have the date saved in the recoll index, or omit the file if the document is not found.

What do you think?

For me, this is critical, as I get no results at all!

medoc writes

Hi,

To help me reproduce this, could you please tell me:

  • The recoll version that you use.

  • What search mode you are using: is this a "filename" search, or a filename specified inside a "Query Language" search ?

aperezmendez writes

Hi,

version 1.20.4-1 (Arch Linux)

It happens with "filename" search, and also with the python binding (this is the way I discovered it):

from recoll import recoll
connection=recoll.connect()
search_data = recoll.SearchData()
search_data.addclause(type="filename",  qstring="pdf")
recoll_query = connection.query()
recoll_query.sortby("fmtime", ascending=False)
recoll_query.executesd(search_data)

I’m sure that if I rebuild the database, it will work. But this should be somehow handled. What if I upgrade the index once a week and some files have been deleted in the meanwhile?

medoc writes

I don’t think that this is related to a file disappearing. My best guess at this point is that the document data record is missing inside the Xapian index (meaning it’s corrupted). We can check this with the following command (delve is a low level Xapian command). It should print the term list and data field values for the document number 41429:

delve -r 41429  ~/.recoll/xapiandb -d

Of course, you may need to adjust the path to the index.

In case this does generate an error, I think that the only option is to rebuild the index.

If this works fine, I need more thinking because I have no idea about what could cause the error at this point (the date used for sorting does not come from the file but from data stored in the index).

I do see Xapian index corruption happening from time to time. Not too sure about the reason, one possibility would be that recollindex is not responsive enough during system shutdown, and a hard kill then prevents proper index syncing.

aperezmendez writes

Data for record #41429:

Error: DocNotFoundError: Document 41429 not found.

So I guess you are right. I’m of those that hardy re-start the laaptop. It is always suspended or running. How do I fix the corruption? Do I need to re-index everything?

medoc writes

I’m not sure of the cause of corruption. Possible candidates are recollindex crash, laptop reboot or other system error, or Xapian bug (yes it happens too…), and maybe others I can’t think of at the moment.

You could try to use xapian-compact to copy the index, maybe this could fix it. Or delete and rebuild (I’d prefer rm -rf xapiandb to recollindex -z in this case).

The problem here is that we are still not sure that the index is corrupted. We would get the same error for a genuinely non existing doc. I was hoping for a more explicit message. What does it say if you omit the -d ?

aperezmendez writes

Ups. I was too impatient, and rebuilt the index from the recoll GUI. But I made lots of tests.

  • I did ----delve -r 41429 ~/.recoll/xapiandb ----, I got a similar error (without mentioning the document, I think)

  • I did ----xapian-check .recoll/xapiandb ----, and it said there were an error in the B-trees.

  • I did ----delve -t pdf ~/.recoll/xapiandb ----, and I got a list of documents matching the term (including 41429), then…(next line)

  • I did ----delve -t pdf ~/.recoll/xapiandb -d ----, and I got all the details of the documents, until it reached 41429, and gave an error. I could not detect which file was for sure, but I think I got a glimpse. So… (next line)

  • I tried to re-index the file, and then the folder where the file was, but got no luck. Then… (next line)

  • I installed python2-xapian, and built a miniprogram to try to delete just that document from the DB (http://getting-started-with-xapian.readthedocs.org/en/latest/practical_example/indexing/updating_the_database.html). I was not able to do it though:(.

  • So, I just rebuilt the index. It happily works now :)

I just didn’t want to re-write 500MB on my SSD HD, and "waste" 10m of my life :). But I think it was worthy.

As you mentioned, it seems that xapiandb corruptions are rare. However, I’ve seen this failure at least three-four times in a two-years lapse. It is not so much, but it happens. Nevertheless, I think it is worthless to dig more into it if there is no one else getting this error. If you remember (you probably don’t :)), I use recoll though my pyneedle using the python interface. Besides, I have indexed some files encrypted with encfs, that can be updated by dropbox, and recoll will never realize it (because inotify will not tell you :D). All this together can be dangerous, I guess.

Thanks anyway!

medoc writes

Yes, I do hear about index corruptions quite regularly, it’s not only you. I don’t think that you are doing anything specially dangerous. The 2 possible causes of index corruption are a hard kill of the process while it updates the index (might happen because the computer stops or because recollindex crashes), or a Xapian bug. My feeling is that the first cause is the most probable, but it’s very difficult to make sure that it is the only one, and there is little I can do about it in any case (apart from fixing recollindex bugs…).

Maybe it would be actually worth, in some cases, to periodically stop recollindex (if you are running the real time version), and run xapian-check to verify the index.

I should probably add a section to the doc about this.

aperezmendez writes

I sometimes "killall" recollindex, but I guess that is not a hard kill (I don’t think I have ever used -9 for it). But I might have rebooted the machine the hard way at some moment.

The next time it fails I will make sure xapian-check fails, so at least you can tell. I don’t think rebuilding the index once or twice a year is something so hard. And you make sure everythig is up-to-date to the latest version.

medoc writes

Left on hold, no resolution for the moment, but nothing which can be done either.

BE_Bob writes

I have a similar problem that recently cropped up.

Recoll 1.21.3 + Xapian 1.2.16 with two external indices (on a network server) and one local index. The setup has been running fine for weeks and the external indices update on a cron job overnight.

Today, I searched for a term that I know is in many documents and can be found (my last name). No documents were found in the gui Recoll.

I then searched in one external index on the command line <code >recoll -c . -t -q term</code > and received the following response:

Recoll query: term… -1 results :2:../rcldb/rclquery.cpp:392:enquire- >get_mset: exception: Document 6 not found

I then went through and checked as above (after installing xapian-tools). I ran the <code >xapian-check</code > on both external indices and both had no problems.

I then ran <code >delve -t term ./xapiandb</code > and found a long list of IDs, one of which was 6. I then ran <code >delve -r 6 ./xapiandb</code > and got a long list of terms, which included term and seemed to be reasonable for a document I then ran <code >delve -r 6 ./xapiandb -d</code > and got the following:

Data for record #6:

Error: DocNotFoundError: Document 6 not found

I did a number of searches with the gui recoll interface and most of them failed with no results. a few were successful and seemed to be newer files/projects. Not sure what is going on, but I wanted to report this behavior before I deleted the indices and re-indexed.

I’ll wait until I hear back in case you want more information. Sorry about the poor formatting.

medoc writes

Hi,

Thanks for reporting this. I think that the only possible approach is to run xapian-check on each of the indices and remove/rebuild those which are corrupted.

As written above, there are several possible causes for this issue. The Xapian index is not journalled, and an interruption of the indexing process at a bad moment will lead to file corruption.

I do have an outstanding indexer crash report (an indexer crash would be a prime candidate as cause for index corruption), but I never got enough data to analyze and fix it. Can’t reproduce it either.

So for now, we just have to live with having to rebuild the index from time to time.

As it seems that xapian-check is always able to detect these issues, running it at regular intervals would be a good way to detect the problem soon after it occurs, which may help to find a cause.

BE_Bob writes

Thank you for your fast response. The strange thing about my situation is that xapian-check did not find any errors in the index. Here are the results of xapian-check on the database:

#!

record:
baseB blocksize=8K items=84507 lastblock=3379 revision=157 levels=2 root=12
B-tree checked okay
record table structure checked OK

termlist:
baseB blocksize=8K items=169014 lastblock=24090 revision=157 levels=2 root=5
B-tree checked okay
termlist table structure checked OK

postlist:
baseB blocksize=8K items=8727966 lastblock=66596 revision=157 levels=3 root=113
B-tree checked okay
postlist table structure checked OK

position:
baseB blocksize=8K items=34905667 lastblock=109114 revision=157 levels=2 root=11167
B-tree checked okay
position table structure checked OK

spelling:
Lazily created, and not yet used.

synonym:
baseB blocksize=8K items=255128 lastblock=4844 revision=157 levels=2 root=2
B-tree checked okay
synonym table: Don't know how to check structure

No errors found

The output of the cron job "00 03 * * * recollindex -c /media/bob/CBE_Tech/.recoll-CBE_Tech > >/media/bob/CBE_Tech/.recoll-CBE_Tech/recolltrace 2 >&1" showed numerous instances of multiple documents needing to be updated (which they shouldn’t since they’ve not been touched in years) and not being able to be found (which they shouldn’t be since they’ve not been touched in years). Then they would be found and updated. I’m not sure if this is normal or if there is some issue with timing relative to the network share. Here is an example from the output of the cron job:

#!

...
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 2 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 3 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 4 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 6 not found
:3:../rcldb/rcldb.cpp:604:Db::add: docid 1 updated [/media/bob/CBE_Tech|]
:3:../rcldb/rcldb.cpp:604:Db::add: docid 2 updated [/media/bob/CBE_Tech/Archived cases|]
:3:../rcldb/rcldb.cpp:604:Db::add: docid 3 updated [/media/bob/CBE_Tech/Archived cases/Dir 1|]
:3:../rcldb/rcldb.cpp:604:Db::add: docid 4 updated [/media/bob/CBE_Tech/Archived cases/Dir 1/analysis|]
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 5 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 7 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 8 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 10 not found
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 9 not found
:3:../rcldb/rcldb.cpp:604:Db::add: docid 5 updated [/media/bob/CBE_Tech/Archived cases/Dir 1/analysis/Case analysis.odt|]
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 16 not found
:3:../rcldb/rcldb.cpp:604:Db::add: docid 6 updated [/media/bob/CBE_Tech/Archived cases/Dir 1/analysis/Case notes.odt|]
:3:../rcldb/rcldb.cpp:604:Db::add: docid 7 updated [/media/bob/CBE_Tech/Archived cases/Dir 1/analysis/Case report.odt|]
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 13 not found
:3:../rcldb/rcldb.cpp:604:Db::add: docid 9 updated [/media/bob/CBE_Tech/Archived cases/Dir 1/analysis/Doc 1.pdf|]
...

Note that this happened repeatedly over multiple runs of the job. The output log has been appended to for approximately one month. For example, running "grep -c "Document 6" recolltrace" results in 34 occurrences. This seems to happen on every run. And "grep -c "not found" recolltrace" results in 5407 occurrences.

The stranger thing is that each time that any document is not found initially, it is ultimately found and updated. So the index should be fine and it seemed to check out OK. Perhaps recoll is not updating the index when the document is ultimately found?

medoc writes

The weird thing is that xapian-check does not see a problem. From the log, what seems to happen is that Xapian returns a document number (2,3,4…) matching the term identifying a specific document, and then, when recoll asks for the actual document, Xapian says "not found".

I do think that this signals an inconsistency in the index, but it’s strange that xapian-check does not see it.

Then recoll decides to update the document (as there is no index data for it), causing the "docid x updated" messages, but as far as I can see, the data falls in a black hole, because this repeats next time.

If this at all possible for you (I don’t know how big the index is), it would be great to make a back-up of the index files, and then erase and reindex (so that you can work). I’ll contact the Xapian mailing list with the problem, maybe there is something more which xapian-check should be checking.

Just to make completely sure: the index on which you ran xapian-check and the one on which you ran delve are the same directory ?

BE_Bob writes

Sorry for the delay in responding. I set the loglevel to 4 and ran recollindex with the output redirected to a new file. Here’s the first 30 lines in case it will help:

#!

:4:../common/rclconfig.cpp:459:RclConfig::initThrConf: autoconf requested
:4:../utils/execmd.cpp:400:ExecCmd::startExec: (0|1) sh {-c} {egrep ^processor /proc/cpuinfo | wc -l}
:4:../utils/execmd.cpp:829:ExecCmd::wait: got status 0x0
:4:../common/rclconfig.cpp:511:RclConfig::initThrConf: chosen config (ql,nt): (2, 4) (2, 2) (2, 1)
:3:recollindex.cpp:517:recollindex: changing current directory to [/tmp]
:4:../utils/execmd.cpp:400:ExecCmd::startExec: (0|0) /usr/share/recoll/filters/rclcheckneedretry.sh
:4:../utils/execmd.cpp:829:ExecCmd::wait: got status 0x100
:3:recollindex.cpp:550:recollindex: starting up
:4:../utils/execmd.cpp:400:ExecCmd::startExec: (0|0) /usr/bin/ionice {-c} {3} {-p} {21822}
:4:../utils/execmd.cpp:829:ExecCmd::wait: got status 0x0
:4:../rcldb/rcldb.cpp:756:Db::open: m_isopen 0 m_iswritable 0 mode 1
:4:../rcldb/rcldb.cpp:214:RclDb:: threads: haveWriteQ 1, wqlen 2 wqts 1
:4:../rcldb/rcldb.cpp:794:Db::open: lastdocid: 84523
:4:../index/fsindexer.cpp:137:FsIndexer: threads: haveIQ 1 iql 2 iqts 4 haveSQ 1 sql 2 sqts 2
:4:../index/fsindexer.cpp:201:FsIndexer::index: Indexing /media/bob/CBE_Tech into /media/bob/CBE_Tech/.recoll-CBE_Tech/xapiandb
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 1 not found
:4:../index/fsindexer.cpp:674:processone: needupdate 1 noretry 1 existing 0 oldsig []
:4:../internfile/internfile.cpp:160:FileInterner::init fn [/media/bob/CBE_Tech] mime [(null)] preview 0
:4:../internfile/mimehandler.cpp:249:getMimeHandler: mtype [inode/directory] filtertypes 1
:4:../internfile/mimehandler.cpp:64:getMimeHandlerFromCache: b5b4fc86b06b0e2268fb247d01976123 cache size 0
:4:../internfile/mimehandler.cpp:80:getMimeHandlerFromCache: b5b4fc86b06b0e2268fb247d01976123 not found
:4:../internfile/internfile.cpp:222:FileInterner:: unprocessed mime: [inode/directory] [/media/bob/CBE_Tech]
:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document 2 not found
:4:../index/fsindexer.cpp:674:processone: needupdate 1 noretry 1 existing 0 oldsig []
:4:../internfile/internfile.cpp:160:FileInterner::init fn [/media/bob/CBE_Tech/Archived cases] mime [(null)] preview 0
:4:../internfile/mimehandler.cpp:249:getMimeHandler: mtype [inode/directory] filtertypes 1
:4:../internfile/mimehandler.cpp:64:getMimeHandlerFromCache: b5b4fc86b06b0e2268fb247d01976123 cache size 0
:4:../internfile/mimehandler.cpp:80:getMimeHandlerFromCache: b5b4fc86b06b0e2268fb247d01976123 not found
:4:../internfile/internfile.cpp:222:FileInterner:: unprocessed mime: [inode/directory] [/media/bob/CBE_Tech/Archived cases]
:4:../internfile/internfile.cpp:247:FileInterner:: init ok inode/directory [/media/bob/CBE_Tech]

In answer to your question about the directory that I ran delve and xapian-check on: Yes, I ran them both on the same directory (one of the two server indices, but the same one every time).

I moved all of the files in the recoll directory I’ve been focusing on (/media/bob/CBE_Tech/.recoll-CBE_Tech) to a new folder and copied the recoll.conf back. I made some changes and that config file now reads:

#!

topdirs = /media/bob/CBE_Tech
# loglevel = 4
skippedNames = #* bin CVS  Cache cache* .cache caughtspam tmp \
     .thumbnails .svn \
     *~ .beagle .git .hg .bzr loop.ps .xsession-errors \
     .recoll* xapiandb recollrc recoll.conf _SYNCAPP
unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl ſtft stst

I am letting recollindex run via the regular cron job tonight. I’ll let you know what I find tomorrow. If you’d like me to look for anything in the old recollindex output file, let me know.

medoc writes

Thanks a lot, I am going to send a message to the Xapian devs, they may have some inquiries about the problematic index files, maybe run some other tests, we’ll see.

BE_Bob writes

The clean indexing worked well. No instances of a document not being found. xapian-check on the new database showed no errors.

Seems like this is not a timing issue with my server. I’m happy to have a working indexing system back again! Now a search on cargill produces 773 results.

medoc writes

I’m glad it’s back to work. Please keep the old index files around for the moment, if you can. I have posted to the Xapian mailing list, and maybe a developer will want to run some more tests on the corrupted files.

BE_Bob writes

OK. Will do.

medoc writes

Olly Betts suggests to use:

delve -t '' ./xapiandb

Which should output a long list, to check if the problem documents are listed.

medoc writes

Hi Bob,

I guess that you had the message from xapian-discuss about the updated xapian-check. I think that it would be quite useful that you could run it.

I can easily build it for you if it makes things easier (and your system is not too exotic…). What are you running this on ?

BE_Bob writes

Jean-Francois,

Thank you! I did see the message on xapian-discuss and I’d be happy to run the updated xapian-check.

I am running on Ubuntu 14.04 desktop, 64 bit. I’m not sure what other information you need (I don’t typically build from source and I haven’t programmed in C in over 20 years).

If you build it, I’ll run it!

medoc writes

To add the Xapian backports PPA (it’s maintained by Olly Betts):

sudo add-apt-repository ppa:xapian-backports/ppa
sudo apt-get update
sudo apt-get install libxapian22

This will get you xapian 1.2.21 (the 22 in the package name refers to the lib ABI, it’s not related to the package version).

I am going to send the patched xapian-check binary through email (built on Ubuntu Trusty), it will run with either 1.2.16 or 1.2.21, same library interface and index format.

BE_Bob writes

I added the Xapian backports PPA and updated to 1.2.21 without issue. Thank you. Here is the output of recoll:

#!

Recoll 1.21.4 + Xapian 1.2.21

There was no problem with last night’s recoll run.

I downloaded and ran the newest xapian-check (thanks to Olly for the changes and thank you for the build) and it did detect errors in the old xapian index. Here is the full output on the old (corrupted) index:

#!
record:
baseB blocksize=8K items=84542 lastblock=3379 revision=163 levels=2 root=27
B-tree checked okay
record table structure checked OK

termlist:
baseB blocksize=8K items=169084 lastblock=24091 revision=163 levels=2 root=25
B-tree checked okay
termlist table structure checked OK

postlist:
baseB blocksize=8K items=8728420 lastblock=66596 revision=163 levels=3 root=1078
B-tree checked okay
Document length list has 84188 entries, should be 84542
postlist table errors found: 1

position:
baseB blocksize=8K items=34912211 lastblock=109127 revision=163 levels=2 root=11204
B-tree checked okay
position table structure checked OK

spelling:
Lazily created, and not yet used.

synonym:
baseB blocksize=8K items=255143 lastblock=4729 revision=163 levels=2 root=2307
B-tree checked okay
synonym table: Don't know how to check structure

Total errors found: 1

It did detect that the index was 354 entries short!

I ran the new xapian-check on the current xapian index and had no errors. Although, it had fewer documents (84469 versus 84542). I think this was due to excluding _SYNCAPP from the index. This had previously produced XML errors in the logs. Here’s an example of one of the errors from the old log files if it helps (with the directory name removed):

#!

:3:../rcldb/rcldb.cpp:604:Db::add: docid 139 updated [/media/bob/CBE_Tech/Archived cases/dirname/_SYNCAPP|]
/media/bob/CBE_Tech/Archived cases/dirname/_SYNCAPP/metadata.xml:1: parser error : Start tag expected, '<' not found
[SyncApp packed XML www.allwaysync.com]<{zZ,I;H!N ? 2wzqv,[Id\~^-Y8V
^
unable to parse /media/bob/CBE_Tech/Archived cases/dirname/_SYNCAPP/metadata.xml

I looked through the old recoll output logs (at verbosity level 3) and couldn’t detect anything that would explain the xapian corruption. At that time, I was attempting to setup the cron job and logging, so the log did not include the original indexing. The first run of recollindex that I have logged showed that it tried to reindex those ~350 files and had no "document not found" errors. The very next run (the second that I have in the logs) showed recollindex adding those same documents to the index again and that’s when it began to give the "document not found" errors. I am not sure what was happening or what caused the reindexing of those files in the first place.

medoc writes

Thanks for the xapian-check test, I’ll post it on the xapian list to wrap-up the case. As Olly Betts expressed no interest in seeing the files at all, I think that you can get rid of them.

I’ll also signal that we have no particular data about the issue onset.

About _SYNCAPP, a quick google search seems to indicate that this is a technical db from a file synchronization application (http://allwaysync.com/faq.html), so excluding them is appropriate. metadata.xml is no XML at all as far as I can see from the error output, so no surprise there.

Thanks for your cooperation ! I hope that your index stays up this time, but maybe a pass of xapian-check once in a while might not hurt :)

medoc writes

Nothing more can be done at the moment.