humble_user writes

It would be very helpful if the GUI helped me <pause > an index update or build, and to <resume > the indexing from the point where it stopped.

I read elsewhere in these pages that indexing can be paused / resumed by stopping and updating. But this appears not to be so… unless the three values Recoll shows when updating the index are wrong, or unless I have misunderstood them. (I don’t quite know what they mean in any case).

For example, I told Recoll to rebuild the index on Saturday morning. By Monday morning I had to tell it to stop so I could get back to work. I’ve no idea how close it got to the finish. I don’t know if the rebuild erased the prior index first, or if it has kept for reference what it has not replaced, so I don’t know how much confidence I can have in the index while this rebuild is incomplete. And the rate it is going, it might be a long time before it is complete.

I attempted to resume indexing at lunchtime by telling Recoll to update the index (even though the earlier build had not completed, yah?), but it seemed to take the indexer too long to recover the place it last was when I stopped it. I had to stop indexing again before it was apparent that it had recovered its position.

This evening I started update index again. Five hours later, it appears still not to have regained its stopped position. It shows the number of files that were in the index by the time I stopped the build on Monday morning. It shows two other values, which have steadily increased as the update has progressed: one now about half the size of the built index, and one about 4% the size of the built index and growing slowly relative to the total amount of time it has spent indexing.

I prefer to think that I have misread this. But it does seem like the indexer might take three days just to recover its position, and then God knows how long to actually complete the index.

If the GUI could give me some assurance, or a function that will genuinely pause/resume indexing if it does not exist already, I might have more hope of reaching the end. It would also give me greater confidence in what there is of an index thus far, so I can continue work in the meantime with some degree of confidence or other in what Recoll tells me is in the index.

medoc writes


Maybe that’s what you do already, but you need to use "update index", not "rebuild index" if you want to do it in multiple partial runs. "Rebuild Index" zeroes out the index before starting, so you always restart from the beginning. From the body of your text, I understand that you do use "update index", but please confirm.

Then, it is really strange that indexing is so slow for you. Could you please describe your document set and the hardware you are running the indexing on ?

Implementing a real pause/resume operation is much more complicated than it sounds. There is a lot of state in a file tree walk, and there is no warranty that the tree has not changed when indexing resumes.

So Recoll always restarts the tree walk from scratch when indexing. However, when using "update", files which where already indexed are just checked for modification time and size, and this is quite fast, normally processing many hundreds or even thousands of files per second.

The one case where restarting may be slow is if many files cause indexing errors, typically because of a missing filter helper.

Recoll 1.20 always retries failed files, and this can slow down the walk, especially if some files take a long time to generate an error. For this reason, Recoll 1.21 does not retry failed files if nothing seems to have changed in the bin directories (no new programs installed).

So we need to check your log file to see what recoll is doing during the initial part of the indexing, when it should just be checking on already indexed files.

About monitoring indexing progress: as recoll does not pre-walk the file tree before actual indexing, it has no way to know how much work remains. This is usually not an issue, because a typical index u pdate will complete in a matter of minutes, not hours. However, for people with huge document sets and/or a slow machine, this is inconvenient.

In this case, I would recommend setting the debug verbosity at level 3, and sending the log to some file (you can do this from the indexing preferences). At this verbosity, the log will mostly have one line for each processed file, which makes it reasonably easy to see what is happening, by running "tail -f" on the file in a terminal window.

I think that this is what you should do, because it will also show any errors which occur during the part of the walk where already indexed files are supposed to be processed.

humble_user writes

It is indeed old hardware: P4 3.0GHz. 3.5Gb. The document set is large.

The Recoll process in this instance was initially a rebuild. That had to be interrupted. I resumed it as an update, because I thought it would pick up where it left off, as you suggest, after skipping those files that had been indexed and not since modified. It seems now to have nearly regained its position, only after considerable time (about as much time as it took to get there in the first place - but I can only guess at the meaning of the three values Recoll GUI gives when it’s performing an update).

The errors appear to be sent to stderr and null by default, though at level 3. Perhaps error reporting would be helpful as a GUI option at the point where an index function is initiated, and if given a logfile by default? (An indexing dialogue?).

humble_user writes

I’ve spotted that my GUI preferences menu has "All Languages" checked, as opposed to just "English".

What difference would this make to the indexing?

medoc writes

No difference, these are the languages used at query time. You can set the languages used when indexing in the index configuration section

medoc writes

A few data points about this:

  • An incremental pass on an "almost up to date" tree should be very fast, not very much slower than a find. On my home directory, find needs 0.3 S to explore 64237 files and directories. Initial recoll indexing needs around 15 mn and results in 105K documents (many subdocs). An immediately subsequent incremental pass takes 9 Seconds, almost all of which is spent creating the stemming and aspell dictionaries. This is on an Intel Core i5 750, which is quite obsolete and never was a high-end processor, and using an old spinning disk. Recoll is set not to retry failed files, a function available in version 1.21.

  • In other words, the initial part of resuming an interrupted indexing, which is mostly walking a file tree and doing nothing, is almost free. On the other hand, resuming a suspended file walk by restarting at the same point is a very difficult problem. There is a lot of state in a file walk, and checking that it is still valid when resuming is definitely not a trivial problem.

My conclusion is that the current approach of killing the indexer and restarting from scratch is satisfactory in most configurations, and we should investigate why it does not work for you (which I’m quite willing to help with), instead of designing a solution for a general problem which does not exist.

humble_user writes

Okay, thank you. I suspect its a large number of files and a tendency to rename/restructure top-level folders.