Parameters affecting indexing performance and resource usage

idxflushmb

Threshold (megabytes of new data) where we flush from memory to disk index. Setting this allows some control over memory usage by the indexer process. A value of 0 means no explicit flushing, which lets Xapian perform its own thing, meaning flushing every $XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory usage depends on average document size, not only document count, the Xapian approach is is not very useful, and you should let Recoll manage the flushes. The program compiled value is 0. The configured default value (from this file) is now 50 MB, and should be ok in many cases. You can set it as low as 10 to conserve memory, but if you are looking for maximum speed, you may want to experiment with values between 20 and 200. In my experience, values beyond this are always counterproductive. If you find otherwise, please drop me a note.

filtermaxseconds

Maximum external filter execution time in seconds. Default 1200 (20mn). Set to 0 for no limit. This is mainly to avoid infinite loops in postscript files (loop.ps)

filtermaxmbytes

Maximum virtual memory space for filter processes (setrlimit(RLIMIT_AS)), in megabytes. Note that this includes any mapped libs (there is no reliable Linux way to limit the data space only), so we need to be a bit generous here. Anything over 2000 will be ignored on 32 bits machines. The high default value is needed because of java-based handlers (pdftk) which need a lot of VM (most of it text), esp. pdftk when executed from Python rclpdf.py. You can use a much lower value if you don't need Java.

thrQSizes

Task queue depths for each stage and threading configuration control. There are three internal queues in the indexing pipeline stages (file data extraction, terms generation, index update). This parameter defines the queue depths for each stage (three integer values). In practise, deep queues have not been shown to increase performance. The first value is also used to control threading autoconfiguration or disabling multithreading. If the first queue depth is set to 0 Recoll will set the queue depths and thread counts based on the detected number of CPUs. The arbitrarily chosen values are as follows (depth,nthread). 1 CPU -> no threading. Less than 4 CPUs: (2, 2) (2, 2) (2, 1). Less than 6: (2, 4), (2, 2), (2, 1). Else (2, 5), (2, 3), (2, 1). If the first queue depth is set to -1, multithreading will be disabled entirely. The second and third values are ignored in both these cases.

thrTCounts

Number of threads used for each indexing stage. If the first entry in thrQSizes is not 0 or -1, these three values define the number of threads used for each stage (file data extraction, term generation, index update). It makes no sense to use a value other than 1 for the last stage because updating the Xapian index is necessarily single-threaded (and protected by a mutex).

thrTmpDbCnt

Number of temporary indexes used during incremental or full indexing. If not set to zero, this defines how many temporary indexes we use during indexing. These temporary indexes are merged into the main one at the end of the operation. Using multiple indexes and a final merge can significantly improve indexing performance when the single-threaded Xapian index updates become a bottleneck. How useful this is depends on the type of input and CPU. See the manual for more details.