Unknown reporter writes

Both recollindex and thunderbird are pegging one processor each, and this has gone on for several days. According to the scroll at the bottom of Recoll, it is indexing /home/xxx/.thunderbird/xxxxxx during this whole time. There is slow progress, with the indexer moving from one file to the next every five seconds or so. I should mention that the email archives and active files are rather large.

If I shutdown thunderbird, recollindex still uses a lot of CPU.

If I turn off global indexing in thunderbird, both recollindex and thunderbird continue to use high CPU. I can find nothing that says either program is throwing errors.

I shutdown Thunderbird, turned off the indexer from Recoll and turned it back on. It starts over in another directory and again begins pegging one processor even before it gets to the thunderbird directory where all my email is stored. To be fair, it is at low priority, and there is not much competition, so this behavior might be normal. While Recollindex is indexing directories that are not thunderbird, I reopen Thunderbird. Thunderbird remains CPU intensive, while recollindex goes to a few percent CPU. However, it soon returns to high CPU usage, even though it is not indexing the Thunderbird directory. As a result, I am not sure that the two problems are related?

I am now trying to rebuild the whole index to see what happens. I should point out that both rclimg and recollindex are using a bit more reasonable CPU during the rebuild, but will wait til tomorrow to see how things are going.

medoc writes

If we end up looking for a bug in Recoll, it would be useful to know what version you are using and on what platform. Without knowing the version, some of the ideas further down are going to be a little imprecise.

We need to have more data about what exactly the indexer is doing. Setting the verbosity to 5 would probably be appropriate, like this we’ll see the file update events as they come. Also use a real file for the log, as the messages from the monitoring daemon would be lost otherwise. Put the log out of the indexed area…

I am assuming that you are / have been running recollindex -z to reset the index.

Once the index is rebuilt, launch the indexer in monitor mode (the exact method depends on the version), and wait until it completes its initial pass and goes into monitoring mode. This should manifest itself by messages like the following:

:5 :rclmonprc.cpp:492:Monitor: Modify/Check on /y/home/dockes/tmp/t/index.html

Once the initialization is done, on a quiescent or relatively quiescent system these messages should become few, and CPU usage should fall to almost 0.

If you keep having frequent or repeated such messages, you need to check what files these are for, and either exclude them from indexing (some log files are just not interesting), or set an individual timeout for reindexing (you can do this in the config file), I can provide the details if this is the right approach.

Once this is resolved, we can go on to the thunderbird issue. If for some reason some big thunderbird folders are updated frequently, maybe you need to set a timeout for them too.


jschieber writes

After reindexing overnight, Recoll (1.18.1+Xapian 1.2.8 on Kubuntu 12.04) is now running very minimal CPU. However, pdftotext did crash at some point, presumably from indexing. Thunderbird is still running high on CPU, but this is probably unrelated to recoll.

medoc writes

Ok then. If it does it again, instead of reindexing from scratch, please try to just stop it and restart it with a higher log verbosity, so that we know what the process is doing.