Document types

Recoll knows about quite a few different document types. The parameters for document types recognition and processing are set in configuration files.

Most file types, like HTML or word processing files, only hold one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may themselves be compound ones. Such hierarchies can go quite deep, and Recoll can process, for example, a LibreOffice document stored as an attachment to an email message inside an email folder archived in a zip file...

Recoll indexing processes plain text, HTML, OpenDocument (Open/LibreOffice), email formats, and a few others internally.

Other file types (ie: postscript, pdf, ms-word, rtf ...) need external applications for preprocessing. The list is in the installation section. After every indexing operation, Recoll updates a list of commands that would be needed for indexing existing files types. This list can be displayed by selecting the menu option FileShow Missing Helpers in the recoll GUI. It is stored in the missing text file inside the configuration directory.

By default, Recoll will try to index any file type that it has a way to read. This is sometimes not desirable, and there are ways to either exclude some types, or on the contrary define a positive list of types to be indexed. In the latter case, any type not in the list will be ignored.

Excluding file types can be done by adding wildcard name patterns to the skippedNames list, which can be done from the GUI Index configuration menu. For versions 1.20 and later, you can alternatively set the excludedmimetypes list in the configuration file. This can be redefined for subdirectories.

You can also define an exclusive list of MIME types to be indexed (no others will be indexed), by settting the indexedmimetypes configuration variable. Example:

        indexedmimetypes = text/html application/pdf
      

It is possible to redefine this parameter for subdirectories. Example:

      [/path/to/my/dir]
      indexedmimetypes = application/pdf
    

(When using sections like this, don't forget that they remain in effect until the end of the file or another section indicator).

excludedmimetypes or indexedmimetypes, can be set either by editing the configuration file (recoll.conf) for the index, or by using the GUI index configuration tool.

Note about MIME types

When editing the indexedmimetypes or excludedmimetypes lists, you should use the MIME values listed in the mimemap file or in Recoll result lists in preference to file -i output: there are a number of differences. The file -i output should only be used for files without extensions, or for which the extension is not listed in mimemap