Recoll features
General features
- Easy installation, few dependancies. No database daemon, web server, desktop environment or exotic language necessary.
- Will run on most Unix-based systems, and on MS-Windows too.
- Qt GUI, plus command line, Unity Lens, KIO and krunner interfaces.
- Searches most common document types, emails and their attachments. Transparently handles decompression (gzip, bzip2).
- Powerful query facilities, with boolean searches, phrases, proximity, wildcards, filter on file types and directory tree.
- Multi-language and multi-character set with Unicode based internals.
- Extensive documentation, with a complete user manual and manual pages for each command.
Supported systems
Recoll has been compiled and tested on Linux, MS-Windows 7-10, MacOS X and Solaris (initial versions Redhat 7, Fedora Core 5, Suse 10, Gentoo, Debian 3.1, Solaris 8). It should compile and run on all subsequent releases of these systems and probably a few others too.
Qt versions from 4.7 and later
Document types
Recoll can index many document types (along with their compressed versions). Some types are handled internally (no external application needed). Other types need a separate application to be installed to extract the text. Types that only need very common utilities (awk/sed/groff/iconv, Python etc.) are listed in the native section.
The MS-Windows installer includes the supporting application, the only additional package you will need is the Python language installation.
Many formats are processed by Python scripts. The Python dependency will not always be mentioned. In general, Recoll up to 1.24 expects Python 2.x to be available. Recoll 1.25 and later rely on Python3 (most scripts are actually compatible with both versions). Formats which are processed using Python and its standard library only are listed in the native section.
Some Python scripts need the Python2 'future' module (smoothing the transition to Python3). This is the case, e.g. for the Excel sheet handler.
File types indexed natively
- text.
- html.
- maildir, mh, and mailbox ( Mozilla, Thunderbird and Evolution mail ok). Evolution note: be sure to remove .cache from the skippedNames list in the GUI Indexing preferences/Local Parameters/ pane if you want to index local copies of Imap mail.
- gaim and purple log files.
- Scribus files.
- Man pages (needs groff).
- Mimehtml web archive format (support based on the mail filter, which introduces some mild weirdness, but still usable).
- All the following need Python2 or Python3:
- Dia diagrams.
- Excel and Powerpoint (pre-open-xml).
- Tar archives. Tar file
indexing is disabled by default (because tar archives don't
typically contain the kind of documents that people search
for), you will need to enable it explicitely, like with the
following in your
$HOME/.recoll/mimeconf file:
[index] application/x-tar = execm rcltar
- Zip archives.
- Konqueror webarchive format (uses the tarfile Python standard library module).
File types indexed with external helpers
The XML ones
The following types need
xsltproc from the libxslt package
for
- Abiword files.
- Fb2 ebooks.
- Kword files.
- Microsoft Office Open XML files.
- OpenOffice files.
- SVG files.
- Gnumeric files.
- Okular annotations files.
Other formats
The following need miscellaneous helper programs to decode the internal formats.
- pdf with the pdftotext command, which comes with
poppler,
(the package name is quite often poppler-utils).
Note: the older pdftotext command which comes with xpdf is not compatible with Recoll
New in 1.21: if the tesseract OCR application, and the pdftoppm command are available on the system, the rclpdf filter has the capability to run OCR. See the comments at the top of rclpdf (usually found in /usr/share/recoll/filters) for how to enable this and configuration details.
Opening PDFs at the right page: the default configuration uses evince, which has options for direct page access and pre-setting the search strings (hits will be highlighted). There is an example line in the default mimeview for doing the same thing with qpdfview (qpdfview --search %s %f#%p). Okular does not have a search string option (but it does have a page number one). - msword with
antiword.
There is a very
slightly
improved antiword version on the opensourceprojects.eu site, it
can extract a little extra data in some cases.
It is also useful to have wvWare installed as the handler may use it as a fallback for some files which antiword does not handle. - Wordperfect with the wpd2html command from libwpd. On some distributions, the command may come with a package named libwpd-tools or such, not the base libwpd package.
- Lyx files (needs Lyx to be installed).
- Powerpoint and Excel with the catdoc utilities up to recoll 1.19.12. Recoll 1.19.12 and later use internal Python filters for Excel and Powerpoint, and catdoc is not needed at all (catdoc did not work on many semi-recent Excel and Powerpoint files).
- CHM (Microsoft help) files with Python, pychm and chmlib. Recoll 1.25 and later embed a Python3 version of the CHM package, (this is necessary because the original package was not ported to Python3).
- GNU info files with Python and the info command.
- EPUB files with Python and this Python epub decoding module, which is packaged on Fedora, but not Debian. The packaged version by the original author (0.5.2) is old and suffers from a lot of bitrot, so Recoll now bundles an unpackaged version, updated by Arthur Darcet.
- Rar archives (needs Python), the rarfile Python module and the unrar utility. The Python module is packaged as python3-rarfile by both Fedora and Debian. Note that the free version of unrar ("unrar-free") fails for many files with the message "Failed the read enough data".
- 7zip archives (needs Python and the pylzma module). This is a relatively recent addition, and you need to download the filter from the filters pages for all Recoll versions prior to 1.21.
- iCalendar(.ics) files (needs Python, icalendar).
- Mozilla calendar data See the Howto about this.
- postscript with ghostscript, ps2pdf (part of ghostscript), and pdftotext (from poppler).
- RTF files with unrtf. Please note that up to version 0.21.3, unrtf mostly does not work with non western-european character sets. Many serious problems (crashes with serious security implications and infinite loops) were fixed in unrtf 0.21.8, so you really want to use this or a newer release. Building Unrtf from source is quick and easy.
- TeX with untex. If there is no untex package for your distribution, a source package is stored on this site (as untex has no obvious home). Will also work with detex if this is installed.
- dvi with catdvi.
- djvu with DjVuLibre.
- Audio file tags. Recoll releases 1.14 and later use a Python filter based on mutagen for all audio types.
- Image file tags with exiftool. This is a perl program, so you also need perl on the system. This works with about any possible image file and tag format (jpg, png, tiff, gif etc.).
- Midi karaoke files. Recoll versions up to and including 1.23 use, the midi module, and some help from chardet. There is probably a python-chardet package for your distribution, but you will quite probably need to build the midi package. This is easy but see the notes here. Recoll 1.24 and later have incorporated the midi decoding module (modified and ported to python3), and just need the standard Python 'six' module and chardet.
- MediaWiki dump files: Thomas Levine has written a handler for these, you will find it here: rclmwdump.
Other features
- Can use a Firefox extension to index visited Web pages history. See the Howto for more detail.
- Processes all email attachments, and more generally any realistic level of container imbrication (the "msword attachment to a message inside a mailbox in a zip" thingy...) .
- Multiple selectable databases.
- Powerful query facilities, with boolean searches, phrases, filter on file types and directory tree.
- Xesam-compatible query language.
- Wildcard searches (with a specific and faster function for file names).
- Support for multiple charsets. Internal processing and storage uses Unicode UTF-8.
- Stemming performed at query time (can switch stemming language after indexing).
- Easy installation. No database daemon, web server or exotic language necessary.
- An indexer which runs either as a batch, cron'able program, or as a real-time indexing daemon, depending on preference.
Desktop and web integration
The Recoll GUI has many features that help to specify an efficient search and to manage the results. However it maybe sometimes preferable to use a simpler tool with a better integration with your desktop interfaces. Several solutions exist:
- The Recoll Web UI lets you query a Recoll index from a web browser. The one linked here, from opensourceprojects.eu, is a bit more up to date than the one on GitHub (koniu), and is the one to use.
- The Recoll Gnome Shell Search Provider allows searching from the Gnome Shell.
- The Recoll KIO module allows starting queries and viewing results from the Konqueror browser or KDE applications Open dialogs.
- The recollrunner krunner module allows integrating Recoll search results into a krunner query.
- The Ubuntu Unity Recoll Lens (or Scope for newer Unity versions) lets you access Recoll search from the Unity Dash. More slightly obsolete information here.
Recoll also has Python and PHP modules which allow easy integration with WEB or other applications.
Stemming
Stemming is a process which transforms inflected words into their most basic form. For example, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.
In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.
This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.
Recoll works in a different way. No stemming is performed at query time, so that all information gets into the index. The resulting index is bigger, but most people probably don't care much about this nowadays, because they have a 100Gb disk 95% full of binary data which does not get indexed.
At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.
At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine. The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways:
- It can be selectively turned-off for any query term by capitalizing it (Floor).
- The stemming language (ie: english, french...) can be selected (this supposes that several stemming databases have been built, which can be configured as part of the indexing, or done later, in a reasonably fast way).