Character case and diacritic marks (1), issues with stemming

Case and diacritics in Recoll

Recoll versions up to 1.17 almost fully ignore character case and diacritic marks.

All terms are converted to lower case and unaccented before they are written to the index. There are only two exceptions:

  • File paths (as used in dir: clauses) are not converted. This might be a bug or a feature, but the main reason is that we don’t know how they are encoded.

  • It is possible to specify that some characters will keep their diacritic marks, because the entity formed by the character and the diacritic mark is considered to be a different letter, not a modified one. This is highly dependant on the language. For exemple, in Swedish, å should be preserved, not turned into a.

As a necessary consequence, the same transformations are applied to search terms, and it is impossible to search for a specific capitalization of a word (US is looked for as us), or a specific accented form (café will be looked for as cafe).

However, there are some cases where you would like to be more specific:

  • Searching for US or us should probably return different results.

  • Diacritics are seldom significant in English, but we can find a few examples anyway: sake and saké, mate and maté. Of course, there are many more cases in languages which use more diacritics.

On the other hand, accents are often mistyped or forgotten (résumé, résume, resume?), and capitalization is most often unsignificant, so that it is very important to retain the capability to ignore accent and character case differences, and that the discrimination can be easily switched on or off for each search (or even for specific terms).

This text and other pages which will follow will discuss issues in adding character case and diacritics sensitivity to Recoll, under the assumption that the main index will contain the raw source terms instead of case-folded and unaccented ones.

The following will use the unaccent neologism to mean remove diacritic marks (and not only accents).

English examples are used when possible, but given the limited use of diacritics in English, some French will probably creep in.

Diacritics and stemming

Stemming is the process by which we extend a search to terms related by grammatical inflexion, for example singular/plural, verb tenses, etc. For example a search for floor is normally expanded by Recoll to floors, floored, flooring, …

In practice Recoll has a separate data structure that has stemmed terms (stems) as keys pointing to a list of expansion terms {{{floor → (floor,floors,floorings,…)}}}

Stemming should be applied to terms before they are stripped of diacritics. Accents may have a grammatical significance, and the accent may change how the term is stemmed. For example, in French the âmes suffix generally marks a past conjugation but ames does not. The standard Xapian French stemmer will turn évitâmes (avoided) into an évit stem, but évitames will be turned into évitam (stripping plural and feminine suffixes).

When the search is set to ignore diacritics, this poses a specific problem: if the user enters the search term without accents (which is correct because the system is supposed to ignore them), there is no warranty that the term will be correctly expanded by stemming.

The diacritic mismatch breaks the family relationship between the stem siblings, and this is independant of the type of index: it will happen with an index where diacritics are stripped just as with a raw one.

The simpler case where diacritics in the original term only affects diacritics in the stem also necessitates specific processing, but it is easier to work around.

Two examples illustrating these issues follow.

The simple case: diacritics in the term only affect diacritics in the stem

Let’s imagine that the document set contains the term éviter (infinitive of to avoid), but not évite (present). The only term in the actual index is then éviter.

The user enters an unaccented evite, counting on the diacritics-insensitive search mode to deal with the accents. As évite is not present in the index, we have no way to guess that evite is really évite.

The stemmer will turn evite into evit. There is no way that this can be related to éviter, and this legitimate result can’t be found.

There is a way around this: we can compute a separate stem expansion dictionary for unaccented terms. This dictionary, to be used with diacritic-unsensitive searches only, contains the relationship between evit and eviter (as éviter is in the index). We can then relate eviter and éviter because they differ only by accents, and the search will find the document with éviter.

The bad case: diacritics in the term change the stem beyond diacritics

Some grammatically significant accents will cause unexpectedly missing search results when using a supposedly diacritics-insensitive search mode.

Let’s imagine that the document set contains the term éviter (infinitive of to avoid), but not évitâmes (past). So the stemming expansion table has an entry for évitéviter.

If the user enters an unaccented evitames, she would expect to find the documents containing éviter in the results, because the latter term is a stemming sibling of évitâmes and the search is supposedly not influenced by diacritics, so that evitames and évitâmes should be equivalent.

However, our search is now in trouble, because évitâmes is not in any document, so that there is no data in the index which would inform us about how to transform the input term into something that differs only by accents but would yield a correct input for the stemmer.

If we try to feed the raw user input to the stemmer, it will propose an evitam stem, which will not work, because the stem that actually exists is évit, and evitam can not be related to éviter.

The only palliative approach I can think of would be a spelling correction of the input, performed independantly of the actual index contents, which would notice that évitames is not a French word and propose a change or an expansion to évitâmes, which would correctly stem to évit and allow us to find éviter.

This issue is not specific to Recoll or indeed to the fact that the index retains accent or not. As far as I can see, it is an intrinsic bad interaction between diacritics insensitivity and stemming.

It is also interesting to note that this case becomes less probable when the data set becomes bigger, because more term inflexions will then be present in the index.

We’ll next think about an appropriate interface.