biomimetics writes

in my results list I have three times the same document (and quite often two times the same document) being listed (qt-gui + command line)

mp.vk.files_0286VKs_03574Docs_GZS09.201-09.2013_ZS01.1999-09.2013_0058RVs_0075VVs_0153NAs.csv,

mp.vk.files_0286VKs_03574Docs_GZS09.201-09.2013_ZS01.1999-09.2013_0058RVs_0075VVs_0153NAs.csv,

mp.vk.files_0286VKs_03574Docs_GZS09.201-09.2013_ZS01.1999-09.2013_0058RVs_0075VVs_0153NAs.csv

any idea on how this might have happened and what can be done to check that each document in the database is referenced only once?

medoc writes

How big is the document ? If it’s a big text doc, it’s indexed in slices and may appear several times in the results, but with different ipath values. This is so we can display the right section of a big text file without having to load the whole thing (useful for big logs for example). The way to check this is either to look at the ipath (normally displayed after the url in the result list), or hit preview and see if different sections are displayed for each result.

Or maybe it’s some other problem, but let’s check this first

biomimetics writes

it is as you guessed: it is the same document which apparently was split into parts when being indexed:

on a search result with one document being displayed twice in the results list the first one is 1mb the other one 236kb. if I click on preview I get two different sections of the document displayed.

the whole thing is a bit strange as the weighting function for the relevance takes into account the document length and amounts of terms found. so this results means the weighting is quite off.

medoc writes

You can disable this by setting the "textfilepagekbs" parameters to -1 in the configuration file. As far as I can see in the code, you can probably crash recoll by setting it to 0 :)

I find it quite useful to divide very big text files into manageable sections. If this is a problem for ranking, then you can disable this behaviour. There is also a maximum size for text files which is set at 20 MB by default by the way: textfilemaxmbs.

medoc writes

Normal Recoll behaviour, adjustable by config edit