837183 writes

Hi, I have 110,000 PDF’s I’d like to index, each is an eBook. I have indexed ~50,000 and then clicked "stop indexing", I got a prompt that said "indexing failed" and clicked O.K.. then restarted the PC.

How can I verify that Recoll continues to index the remaining 50,000 files? there’s no indication of that in the statues bar.

medoc writes

If you click start indexing, you should see the file names in the status bar after a moment.

You can also use the recollindex command line directly. At level 3, the log will mostly list the file paths being indexed.

Otherwise, you can get an idea of what files are currently inside the index with the following command:

delve -1 -a ~/.recoll/xapiandb/ | egrep '^Q'

delve is a Xapian tool, you may have to install an additional package to get it, depending on the distribution.

837183 writes

In the GUI there’s rebuld/update index. but not start indexing

[This is some of the output](http://pastebin.com/y9jeLquU) I get with delve..but..it’s certainly not files..?

delve -1 -a /home/arled/.recoll/xapiandb | egrep '^Q'

medoc writes

Oops, you are using a case-sensitive index, I’m glad that there is at least one user for this :)

The corrected command would be:

delve -1 -a ~/.recoll/xapiandb/ | egrep '^:Q:'

About the recoll GUI indications: so that I can best answer you, what version are you using ?

In general though if you have rebuild index and update index entries, this means that the indexing daemon is not running.

To start indexing from the GUI, use update index

837183 writes

Huh..I thought that update index rebuilds the entire index while letting you use what’s indexed so far.

Recoll 1.19.14 + Xapian 1.2.12

Yeah, you’re right :-) , when I execute recollindex -m and then open the GUI I have an option to stop indexing instead of the other two.

So using

#!

delve -1 -a ~/.recoll/xapiandb/ | egrep '^:Q:' | wc -l

It seems that all of the files have been indexed.

Thank you.

medoc writes

looks ok

humble_user writes

I found that delve command unhelpful. It seemed to list files in the index. There are a lot of files in the index. So it was going to take forever and was having a good crunch of my hard disk while I was at it. So I killed it.

delve -v …, gave good summary info (in particular, number of documents in the index), but that also went off on one and had to be killed. At least it reported the useful summary info before it went off.

But all this is stuff a GUI user shouldn’t have to do: shouldn’t have to get basic info from the command line, shouldn’t have to consult the manual on how to find it, and certainly shouldn’t have to come here for help.

You know this, of course, because the user documentation does say that you are looking in particular for user feedback on how the GUI might be improved.

The GUI would be improved greatly with two features that would resolve this problem:

(1). A pause / continue indexing <button >

I understand from what you say here that update index continues from where it left off if it is restarted after being stopped. But I had to come here to find out. And although then update and resume/continue indexing are the same functionally as far as the program is concerned, they are very different operations as far as the user is concerned - i.e. if the user has manually paused indexing, then from the user point-of-view update is not the same as resume. This is especially so since the user has no way of knowing that update index also resumes.

Update does not, however, resume indexing. It seems to check over what it has indexed already before continuing where it left off. With a large index, this can take an inordinate amount of time. I presume it’s checking for modified files / directories. That would be good if I wanted to update the index. Not if I merely wanted to resume where I had interrupted it. I can see how this might apply more to build than update: i.e. after interruption, it would be very helpful if Recoll would allow me to resume building the indexing where it was stopped; on resuming an interrupted update, it would be more efficient for the update to pick up from the point where it was interrupted (perhaps if an interruption created a big time difference between portions of the index, where one part had been updated at a different time to another, the user might simply be advised as much).

I’m doing an update/resume index now. It is updating the index for an awful lot of files when it was interrupted only a few hours ago and not that many files have been changed.

A proper resume function would actually help a lot since updating / building the index takes a very long time. I attempted to rebuild the index this weekend. It started it Saturday morning. I had to stop it Monday morning so I could go back to work. I tried resuming but it’s been half an hour and it’s still only half-way back to where it was before I interrupted it. But if it simply had a <resume > button, I could give it a whole hour of indexing at lunchtime. Instead, it would take my whole lunch hour to get to where it was.

(2). Index info

It would be helpful if summary info could be presented in the GUI. It’s too much to ask any but an admin or tech to search the help forums and then to read the man pages and then to try command line instructions to get summary info. The index shows a count of documents when its processing. It would be helpful particularly when dealing with an incomplete indexing operation if the GUI even just delivered that one number.

bradleybradley writes

Thank you Mark Ballard!

Those were EXACTLY my thoughts after trying to use Recoll and struggling to grasp the indexing process in detail.

I signed up only to say "thanks" and am wondering, how this thread can be considered "resolved", without any polishing of the GUI. I do understand that there is a workaround, however the GUI is just not up to the standards of Recoll in general (which is great imo).

Anyhow thanks to the developers of Recoll and thanks to Mark for his remarks.

medoc writes

Reopening this by popular demand and because there are things which can be improved

medoc writes

About the index summary info and until I can put a direct menu entry to reach it, you can get some from the Term Explorer tool, choose the "Show index statistics" in the tool combobox.

I think that it would also be interesting to have a count of documents added/updated/deleted during the last indexing pass, this will be in the next version if it’s not too hard to do.

About resuming indexing: the part where recoll just walks the file tree checking file dates takes a negligible time. It would be very complicated to resume a file tree walk, and there is absolutely no way that I can think of to do it reliably.

So there will be no resume indexing, update index is all we can do.

One issue has arisen lately though, the cause of which is not fully known, but with a suspicion that it arises in a Xapian bug. This damages the index in such a way that some search results are missing, and that many document up-to-date checks can fail. Of course, this makes resuming from the start very expensive because of the re-indexing (for nothing, the index data is lost anyway).

You can see a partial description in issue #257, but part of the discussion happened on the Xapian mailing list.

The problem would signal itself by the following kind of message in the indexer log:

:2:../rcldb/rcldb.cpp:1818:Db::needUpdate: get_document error: Document XX not found

If you get this, the index is damaged, only deleting it and reindexing will do.

Olly Betts, the Xapian index developper, thinks that the origin may be a problem which was fixed in Xapian 1.2.21, so, it would be a good idea to update your version. All Xapian 1.2.x versions are binary-compatible (you can just drop them on recoll), and there are backports repository for several common Linux versions, get in touch if you have a problem.

I’m going to put all this on the web site, but if you are experiencing long reindexing times when nothing has changed, this might be the cause, so you get the early notice.

medoc writes

The updated stats tools and menu entries in the 1.22 GUI should solve most of this. It remains that Recoll is a complex beast, and that a quick look at the appropriate manual section will always make understanding easier. I definitely agree that suggesting the use of delve was a bad idea.