dnschneid writes

Recoll 1.17.0 + Xapian 1.2.8

This is a bit of a grey area, as malformed HTML is, well, malformed, but:

Some instant messaging programs are broken and send bad HTML in their messages. Pidgin/Purple will happily drop the bad HTML into the log files (which are now just standard HTML files, so they don’t go through any Recoll filters), causing Recoll to silently fail to index the body of those log files. I’ve seen things varying from a "<body >" tag in the message, to a broken HTML tag, like "<HTML >".

While it’s clearly the fault of the sending instant messenging program for sending bad HTML, and it’s partially the fault of libpurple for not fully sanitizing the message text in the logs, I’m sure there are other cases where a program or filter may not generate fully-valid HTML (HTML mail is the first that comes to mind), confusing the user when Recoll doesn’t return it in results.

I’m not too familiar with the internal workings of Recoll, so I’m not sure if this case could be handled by strengthening the HTML parsing against bad files, or if a new filter for Pidgin logs (or more generically, bad HTML files) is necessary as a workaround, in which case, recollindex should at least spit out a message if it fails to parse an HTML file. Regardless, it would make sense for one to expect Recoll to automatically and reliably handle the fairly standard case of indexing instant messaging logs.

I’ve attached a tarball of three simplified log files. One is totally valid, and the other two have messages that came from malformed clients. All three should return a result for the word "Keyword", but only the fully-valid one does. Also attached is the recollindex log when indexing the three files; nothing appears to be wrong.

medoc writes

Html: Just ignore opening and closing <body > and <html > tags. Current browsers show text before or after the body and ignore multiple body tags. Not pushed to 1.17 maint because of possible disruption. Closes issue #92

→ <<cset 66db481f34b8 > >

medoc writes

Thanks for the detailed report, rationale and test files !

jf

dnschneid writes

Thanks for addressing it so quickly!

dnschneid writes

I just tried the new 1.17.2 release (from the PPA), and it still seems to fail my test case. Is the fix not included in that build?

dnschneid writes

Just looked at this again and realized what you meant by not pushing it to 1.17; I confused the "known bugs" page with "bugs resolved."

Looking forward to the fix being applied in 1.18(?).