scasier writes

I was going through the 1.20 changelog when I found the following passage:

> When indexing, we no longer add the top container file-name as a term for the contained sub-documents (if any). This made no sense at all in most cases. However, this was sometimes useful when searching email folders. Complain if you do not like this change, and I'll make it configurable.

One usecase where top container filenames are very important are zipped HTML documents such as .epub files:

With 1.20 the %(filename) result list substitution and the %T substitution for epub results without title metadata have stopped working.

That’s why I think it would be important to make top-container filename indexing configurable.

Thank you for considering this. SC

medoc writes

Hi,

I’ve just pushed a change where I set the container file name on the subdocuments after indexing, which means that it is available for display purposes, but will not cause search matches (I think that the latter is detrimental in a majority of cases).

Could you please try to see if this works for you? I think that this may be actually better than making the change configurable (which could still be done).

scasier writes

Hi,

Thanks again for the swift response!

I just tested the latest release and everything looks to be working very well so far. As far as the display-only change is concerned I think we will have to consider use cases where the user wants to do a filename-specific query. E.g.: You used to be able to search filename:this that and it would look through all HTML subdocuments belonging to epub files with this in their name. That’s not possible anymore.

With this in mind I would suggest reconsidering setting the container file name before indexing.

medoc writes

The problem with indexing the container filename for all subdocs is that you get hits on all the chm, zip or epub pages/chapters/whatever, when what you usually want in this situation is to only get the epub/zip/chm. You could get the result by adding an ext: clause though, and I agree that it may be interesting to be able to filter the results on the container file name in some cases too.

I wonder if the right approach would be to separately index a containerfilename field.

scasier writes

> The problem with indexing the container filename for all subdocs is that you get hits on all the chm, zip or epub pages/chapters/whatever, when what you usually want in this situation is to only get the epub/zip/chm

You’re right. That would certainly be a problem if you were searching for the parent document only.

>I wonder if the right approach would be to separately index a containerfilename field.

This would certainly solve the issue of being able to narrow down results by the name of the container. However, it might also complicate queries quite a bit in cases where the user wants to do a filename-specific search.

Correct me if I’m wrong, but, you would have to use constructs like filename:this containerfilename:this that to narrow down the results, right?

medoc writes

To make any sense this also supposes that I add a class of fields for which the data is not indexed without the field prefix (the current case is that field data is found both with prefixed and unprefixed searches, which is the case for the file name). containerfilename would only be matched for an explicit field search.

Only the embedded documents would get a containerfilename field, and they would also get a filename if the data allows (e.g. if an attachment has a filename attribute, or using the final elements of the path for a zip archive entry).

So, in general, you would not need to use containerfilename at all. The only use case would be when you want to limit a search to embedded documents stored in containers with a matching file name.

So the "normal" search would still be filename:this that. If you want to limit to e.g. chm pages inside a chm file matching this, you could use containerfilename:this that

If this is only found inside the container file name, searching for this would match nothing (different from the 1.19 situation with filename where the this search would return all the chm pages).

scasier writes

Thank you for the explanation.

I have to admit that I am still not quite sure of the benefits of the containerfilename method. It’s probably because I am a bit slow this morning but please bear with me for a moment, just to see if I understood everything correctly. Take the following example:

I have a dozen or so manuals and ebooks on web design, some of them in PDF format, others as epubs. They are all part of a larger document library stretching across a variety of topics.

I am interested in CSS styling so I decide to search through all the documents pertaining to web design. Now, with Recoll 1.19 and prior I would compose my query as follows:

filename:design css

This would yield results both in regular type documents (e.g. PDF) and embedded documents (e.g. HTML pages in an epub file).

But if I understand this passage correctly… :

> Only the embedded documents would get a containerfilename field, and they would also get a filename if the data allows (e.g. if an attachment has a filename attribute, or using the final elements of the path for a zip archive entry).

…with containerfilename in Recoll 1.20 I would have to perform two queries:

  1. filename:design css – for PDF files

  2. containerfilename:design css – for epub-embedded HTML files

Am I correct in this assumption?

medoc writes

Ok, you are right, I am the one not who was not awaken yet.

Then what if I set and index containerfilename also on "simple" files ?

containerfilename would work like a dir: filtering clause: works mostly for filtering, and also as a positive term, but only when prefixed (the path components don’t get indexed with the content terms).

Except if I’m still missing something, which is quite possible, this would seem to work, with the issue that you have to remember to use containerfilename (or a shorter alias) when you want what you describe above.

The alternative would be to go back to the 1.19 situation by default, and add a configuration switch. Any thoughts ?

scasier writes

> Then what if I set and index containerfilename also on "simple" files ?

I like that idea. It’s a good compromise. I think we should give it a try. Though, due to the length of the tag we should probably alias it so something like cfn by default.

medoc writes

If you want to give it a try, I just pushed the update. The latest code also has the "query only" aliases. They are not set by default, because of the (slight risk) of conflict with existing user fields. I have the following in my local fields file:

[queryaliases]
filename = fn
containerfilename = cfn

You will need a full reindex.

Usage will hopefully tell us if this is the right approach finally !

scasier writes

Thanks again for the the swift implementation.

I’ve just tried the latest commit out on a small sample index. Unfortunately I have to report that the filename search seems to have stopped working, after a full reindex that is. The containerfilename tag doesn’t seem to be working, either. I can still search by filename if I switch to the dedicated filename search.

Just to be sure I also went through all of my configuration files and reset them to default, which didn’t help. To confirm that this was actually an issue with the latest commit, I switched back to the previous build I was using and reindexed the sample files. After that everything was working again.

Edit: After resetting recoll.conf as well filename and containerfilename seem to work again. Still trying to track down what went wrong the first time around

scasier writes

OK, I think I identified the regression: I use indexStripChars = 0 to preserve case sensitivity. When I set it to 1, the filename and containerfilename start working again.

Edit: Just to be clear, none of the possible query variations worked in my tests (e.g. filename:this filename:This filename=*this*)

medoc writes

Thanks for finding this out ! I need to widen the "test suite". I had broken case-sensitive indexing. It should be repaired now.

scasier writes

The fix is working great so far on my sample index. Good job!

I will do some more thorough testing with my regular index tomorrow.

scasier writes

After a few hours of testing I can say that everything is working perfectly fine. I really like the queryaliases addition. Has saved me quite a lot of typing.

I think it was the right decision to do move away from how Recoll handled file containers on 1.19 and below. Having all those extra hits on the subdocuments when all you wanted to do is only search for the container file was pretty annoying.

Now we have the luxury of being able to choose between searching for the containers only (fn) or traverse their subdocs (cfn) for hits.

So, all in all, a great addition to Recoll!

As far as I am concerned we can close this now.

medoc writes

Thanks for your help with this, I just need to write a little doc now :)