jedick writes

Hi. I use Recoll to index some PDFs that have titles such as: An article about <i >Homo sapiens</i >

(actually a custom rclpdf filter converts \emph{…} saved in the PDF metadata to <i >…</i >)

In Recoll 1.17.3 using %(title) in the result paragraph format string would produce the italicized version (An article about Homo sapiens), and %T would not - the html tags were displayed verbatim.

Testing Recoll 1.18.0 pre-release, both %(title) and %T display the html tags verbatim.

The verbatim rendering appears to be the intended result, based on the resolution of issue #99.

I was fond of using the %(title) trick (or bug) in 1.17.3 to add formatting to the titles in the result paragraph. Is it possible to generate similar behavior in the newer version?

medoc writes

Hi and thanks a lot for testing the pre-release, I was not quite sure that anybody did that :)

I guess that I could add a flags part to the % conversion, something like %(title|h) to say that the content is in html format and should not be escaped. Hopefully nobody has defined field names with an embedded | … Any thoughts ?

jedick writes

Thanks for the quick response! I haven’t defined field names with |, so a flag like that would suit me.

medoc writes

Thought again about this, and I think that I’ll go with a slightly cleaner and equivalent approach: use another attribute for the meta tag, which I arbitrarily chose to be named "markup". The value "html" for this attribute will desactivate html escaping for the value, ie:

<meta name="title" markup="html" content="this is <it >italics</i >" >

I’m guessing here that your generated documents have no <title > element but a <meta name="title"… one. As far as I can see, any HTML tag inside an HTML <title > element is stripped directly by the HTML parser.

The only gotcha is that Recoll will concatenate multiple occurrences of a field, and will consider the whole as HTML if any of them is marked.

If this looks ok to you, it will be in the next beta package.


jedick writes

It looks very good.

I have just updated my rclpdf to use the <meta … > syntax. It was producing a <title > element via the awk program inherited from the default rclpdf. So I changed this line,

mid = "<title >" mid "</title >"


mid = "<meta name=\"title\" markup=\"html\" content=\"" mid "\" >"

and it doesn’t change the results as far as I can tell, either in 1.17.3 or the current beta; and now it’s primed for using the "markup" attribute.

medoc writes

I have uploaded 1.18.002 to the web site and the PPA. This has this modification and a few other small tweaks. The packages should be available in a few hours on the PPA.

Don’t forget to check for double-quotes inside "mid" :)

jedick writes

Super! With the new package and a rebuild of the index the italicized words in the titles are back. A gsub(/"/, "\\\&quot;", mid) before the meta definition seems to take care of titles with double-quotes, so they show up completely rather than get truncated.

medoc writes

Great, closing the issue then.

medoc writes

fixed in current beta