humble_user writes

(Just installed Recoll. I must say it looks excellent, and I’m most excited about its potential to enhance my research work / files).

I’ve filed this as a bug because the documention says Recoll should search for strings that include punctuation. But perhaps there something I can do to circumvent this behaviour?

My search, using the advanced search GUI, was to search in proximity of an "inconsequential string" for a specific string: "t." - that is a "t" followed by a full-stop ".".

The search failed, though the user manual says Recoll should be able to search for strings with punctuation in them.

The punctuation in my search string is at the end. This particular search string is one I am going to have to use, unless I change the semantics of my research notes to work with the software. That’s possible, but not ideal. (I could also get round this problem by reading the manual to see how to design a sophisticated search to incorporate the variable options that would usually appear in my research notes after a "t.").

The search described above produces instances any instance where "t" appears standalone to Recoll. This includes my instances of "t.", but without the ".". It also includes instances where "t" ends a word with an apostrophe: i.e. "wasn’t", "hasn’t" and so on. Recoll also displays these instances without the punctuation. It displays them as " t", with a blank space where the apostrophe is in the text. So the results for my search included a jumble of t’s disconnected from the adjacent punctuation that defines them.

So the punctuation is not seen. What Recoll is finding are instances where "t" appears to be standalone, but it only appears that way because the adjacent punctuation isn’t seen.

To confirm that Recoll is overlooking punctuation at the end of a search string, I also tried "clear." followed by a double quote mark. The search string would thus with single quotes be: clear." - and without quotes at all, just to be clear: clear."

Recoll found instances where the word "clear" (and derivatives: e.g. "clearly") appeared within proximity of "inconsequential string". It did not find what was intended, which was only those instances where the word "clear" appears at the end of a quote in a text.

Hope this is of help.

BTW, I searched for the Recoll log file (using Gnome find), which I did assign a name when I did the intitial preferences, but I cannot find it.:) Would you please tell me where I should look to retrieve it, should you need it?

(I have not, incidentally, found a word in the priority list that seemed suitable for this bug report - it’s crucial to me, so I can’t call it "trivial". But I appreciate it’s not "major". So "Blocker"?).

medoc writes

Hi,

I’ll begin by the end: to name the log file, go to the "Preferences- >Indexing configuration" Pane, where you can name the log file. Give it an absolute name (like "/tmp/logrecoll" or such), so you’ll have no trouble finding it.

About punctuation: Recoll almost never keeps final punctuation. I think that there are few exceptions for # and + (c# and c++).

The general rule is that the text splitter generates compound terms with internal punctuation when the compound makes sense as a search term. This avoids generating a phrase search in this situation and yields better performance.

The typical case is an email address: jf@dockes.org will generate both [jf, dockes, org] terms and a compound [jf@dockes.org], so that searching for jf@dockes.org can be done with the compound term and be very fast, as opposed to searching for a phrase: "jf dockes org" which would be slow.

In this context, final punctuation always never makes sense, and it is not kept, I am sorry that this conflicts with your use case.

Also, only one compound term is generated for such a sequence (smith@ic.ac.uk but not ic.ac.uk or ac.uk or smith@ic).

I need to precisely document the whole thing somewhere.

I checked that it would be quite easy to change the text splitter to keep final full stops though. If you can consider working with a locally built recoll, the patch would probably be small enough to be easily reapplied on future releases with practically no knowledge of C++.

Have you considered using something based on "grep" to look for this kind of thing ? it might actually be easier.

jf

medoc writes

Thought about the final full-stop again (especially low mind from mid-december to mid-january). Can’t change this, as it would affect the final word of most sentences.