The Recoll indexer, recollindex, is a big process which executes many others, mostly for extracting text from documents. Some of the executed processes are quite short-lived, and the time used by the process execution machinery can actually dominate the time used to translate data. This document explores possible approaches to improving performance without adding excessive complexity or damaging reliability.
Studying fork/exec performance is not exactly a new venture, and there are many texts which address the subject. While researching, though, I found out that not so many were accurate and that a lot of questions were left as an exercise to the reader.
Issues with fork
The traditional way for a Unix process to start another is the
exec() system call pair.
fork() duplicates the process address space and resources (open files
etc.), then duplicates the thread of execution, ending up with 2 mostly
exec() then replaces part of the newly executing process with an address
space initialized from an executable file, inheriting some of the resources
under various conditions.
This was all fine with the small processes of the first Unix systems, but as time progressed, processes became bigger and the copy-before-discard operation was found to waste significant resources. It was optimized using two methods (at very different points in time):
The first approach was to supplement
vfork()call, which is similar but does not duplicate the address space: the new process thread executes in the old address space. The old thread is blocked until the new one calls
exec()and frees up access to the memory space. Any modification performed by the child thread persists when the old one resumes.
The more modern approach, which cohexists with
vfork(), was to replace the full duplication of the memory space with duplication of the page descriptors only. The pages in the new process are marked copy-on-write so that the new process has write access to its memory without disturbing its parent. This approach was supposed to make
vfork()obsolete, but the operation can still be a significant resource consumer for big processes mapping a lot of memory, so that
vfork()is still around. Programs can have big memory spaces not only because they have huge data segments (rare), but just because they are linked to many shared libraries (more common).
|Orders of magnitude: a recollindex process will easily grow into a
few hundred of megabytes of virtual space. It executes the small and
efficient antiword command to extract text from ms-word files. While
indexing multiple such files, recollindex can spend 60% of its CPU time
Apart from the performance cost, another issue with
fork() is that a big
process can fail executing a small command because of the temporary need to
allocate twice its address space. This is a much discussed subject which we
will leave aside because it generally does not concern recollindex, which
in typical conditions uses a small portion of the machine virtual memory,
so that a temporary doubling is not an issue.
The Recoll indexer is multithreaded, which may introduce other issues. Here
is what happens to threads during the
The parent process threads all go on their merry way.
The child process is created with only one thread active, duplicated from the one which called
The parent process thread calling
vfork()is suspended, the others are unaffected.
The child is created with only one thread, as for
fork(). This thread shares the memory space with the parent ones, without having any means to synchronize with them (pthread locks are not supposed to work across processes): caution needed !
|for a multithreaded program using the classical pipe method to
communicate with children, the sequence between the
For multithreaded programs, both
vfork() introduce possibilities
of deadlock, because the resources held by a non-forking thread in the
parent process can’t be released in the child because the thread is not
duplicated. This used to happen from time to time in recollindex because
of an error logging call performed if the
exec() failed after the
(e.g. command not found).
vfork() it is also possible to trigger a deadlock in the parent by
(inadvertently) modifying data in the child. This could happen just
of dynamic linker operation (which, seriously, should be considered a
In general, the state of program data in the child process is a semi-random
snapshot of what it was in the parent, and the official word about what you
can do is that you can only call
exec(). These are functions which are
safe to call from a signal handler because they are either reentrant or
can’t be interrupted by a signal. A notable missing entry in the list is
These are normally not issues for programs which only fork to execute another program (but the devil is in the details as demonstrated by the logging call issue…).
One of the approaches often proposed for working around this mine-field is
to use an auxiliary small process to execute any command needed by the main
one. The small process can just use
exec() with no performance
issues. This has the inconvenient of complicating communication a lot if
data needs to be transferred one way or another.
The posix_spawn() Linux non-event
Given the performance issues of
fork() and tricky behaviour of
a "simpler" method for starting a child process was introduced by Posix:
posix_spawn() function is a black box, externally equivalent to a
exec() sequence, and has parameters to specify the usual
house-keeping performed at this time (file descriptors and signals
management etc.). Hiding the internals gives the system a chance to
optimize the performance and avoid
vfork() pitfalls like the
lockup described in the Oracle article.
The Linux posix_spawn() is implemented by a
exec() pair by default.
vfork() is used either if specified by an input flag or no
signal/scheduler/process_group changes are requested. There must be a
reason why signal handling changes would preclude
vfork() usage, but I
could not find it (signal handling data is stored in the kernel task_struct).
The Linux glibc
posix_spawn() currently does nothing that user code could
not do. Still, using it would probably be a good future-proofing idea, but
for a significant problem: there is no way to specify closing all open
descriptors bigger than a specified value (closefrom() equivalent). This is
available on Solaris and quite necessary in fact, because we have no way to
be sure that all open descriptors have the CLOEXEC flag set.
posix_spawn() for us (support was implemented inside
recollindex, but the code is normally not used).
The chosen solution
The previous version of
recollindex used to use
vfork() if it was running
a single thread, and
fork() if it ran multiple ones.
After another careful look at the code, I could see few issues with
vfork() in the multithreaded indexer, so this was committed.
The only change necessary was to get rid of an implementation of the
closefrom() call (used to close all open descriptors above a
given value). The previous Recoll implementation listed the
directory to look for open descriptors but this was unsafe because of of
possible memory allocations in
No surprise here, given the implementation of
posix_spawn(), it gets the
same times as the
The tests were performed on an Intel Core i5 750 (4 cores, 4 threads).
It would be painful to play it safe and discard the 60% reduction in
execution time offered by using
vfork(), so this was adopted for Recoll
1.21. To this day, no problems were discovered, but, still crossing
The last line in the table is just for the fun: recollindex 1.18 (single-threaded) needed almost 6 times as long to process the same files…