Unknown reporter writes

pasted from recoll search results window:

1% 1 MB Preview Open ppucd.txt

text/plain 2015-12-14 18:36:01 -0500 file:///i/p2/intl/icu/source/data/unidata/ppucd.txt 0

The file does exist, and its path is: /i/p2/intl/icu/source/data/unidata/ppucd.txt

It is a 1.6Mb (or so) "text file", AFAIK it has character encoding UTF-8(without BOM). Its file permissions are -rw-r—r-- and recall is running "as root".

My indexed local file copy is identical to (as far as I can tell) the version of the file archived online here:

FWIW, the search term in this case (when i FINALLY was able to pin down what the heck is going on) was the string "preparsed". However, the issue I’m describing has occured when various other search terms have been used. Also, the specific file I’m reporting is an example — other files have triggered this behavior:

When "Open" link in recoll search is clicked for any such affected file, (refer to the "ppucd.txt 0" portion of the search result pasted above) it seems like recoll is "recognizing" the file-on-disk as an archive file and is "extracting" them (creating a rcltmpXXXXXX.py or rcltmpXXXXXX.srt file in /tmp and passing path of that tempfile on the commandline to the opener/editor app)(in this case, geany, FWIW).

Examining these "rcltmp……" files which have been accumulating in my /tmp directory across past weeks, today I opened one in editor, guessed that "preparsed" would be a seldom-occcurring wordstring and used that to find the original file, as reported above. Rebuilding the recoll index has not resolved the problem.

I now understand (or believe) that the dozens of rcltmp*.srt and rcltmp*.py represent new copies of a few (5-6) same files which have been opened/viewed numerous times via recoll search results page. As far as I can tell, all others are the "legitimate" result of recoll extracting indexed files which reside, on disk, within archive file(s). In a separate ticket, I’ll explain a related usability issue regarding "opening" these from recoll search results pane.

medoc writes

Recoll treats very big text files in "chunks". This was the easiest way to process them, especially for previewing. You can adjust the size of chunks with the configuration parameter textfilemaxkbs.

There probably would be other ways to address the indexing memory usage and preview performance that big text files pose, but this approach really made my life much easier, and this behaviour is quite useful on "monster" files (e.g. logs).

Maybe 1MB is a bit low as a default nowadays, but the main problem is the one you noted in issue #286 : absence of warning when you "Open" a chunk.

You can use "Open parent" from the right click to edit the actual document, so I think that there is no other problem here than the one which is going to be adressed for issue #286, so I’m closing this one.