## jedick writes

Hi. I use Recoll to index some PDFs that have titles such as: An article about <i >Homo sapiens</i >

(actually a custom rclpdf filter converts \emph{…} saved in the PDF metadata to <i >…</i >)

In Recoll 1.17.3 using %(title) in the result paragraph format string would produce the italicized version (An article about Homo sapiens), and %T would not - the html tags were displayed verbatim.

Testing Recoll 1.18.0 pre-release, both %(title) and %T display the html tags verbatim.

The verbatim rendering appears to be the intended result, based on the resolution of issue #99.

I was fond of using the %(title) trick (or bug) in 1.17.3 to add formatting to the titles in the result paragraph. Is it possible to generate similar behavior in the newer version?

## medoc writes

Hi and thanks a lot for testing the pre-release, I was not quite sure that anybody did that :)

I guess that I could add a flags part to the % conversion, something like %(title|h) to say that the content is in html format and should not be escaped. Hopefully nobody has defined field names with an embedded | … Any thoughts ?

## jedick writes

Thanks for the quick response! I haven’t defined field names with |, so a flag like that would suit me.

## medoc writes

Thought again about this, and I think that I’ll go with a slightly cleaner and equivalent approach: use another attribute for the meta tag, which I arbitrarily chose to be named "markup". The value "html" for this attribute will desactivate html escaping for the value, ie:

<meta name="title" markup="html" content="this is <it >italics</i >" >

I’m guessing here that your generated documents have no <title > element but a <meta name="title"… one. As far as I can see, any HTML tag inside an HTML <title > element is stripped directly by the HTML parser.

The only gotcha is that Recoll will concatenate multiple occurrences of a field, and will consider the whole as HTML if any of them is marked.

If this looks ok to you, it will be in the next beta package.

jf

## jedick writes

It looks very good.

I have just updated my rclpdf to use the <meta … > syntax. It was producing a <title > element via the awk program inherited from the default rclpdf. So I changed this line,

mid = "<title >" mid "</title >"

to

mid = "<meta name=\"title\" markup=\"html\" content=\"" mid "\" >"

and it doesn’t change the results as far as I can tell, either in 1.17.3 or the current beta; and now it’s primed for using the "markup" attribute.

## medoc writes

I have uploaded 1.18.002 to the web site and the PPA. This has this modification and a few other small tweaks. The packages should be available in a few hours on the PPA.

Don’t forget to check for double-quotes inside "mid" :)

## jedick writes

Super! With the new package and a rebuild of the index the italicized words in the titles are back. A gsub(/"/, "\\\&quot;", mid) before the meta definition seems to take care of titles with double-quotes, so they show up completely rather than get truncated.

## medoc writes

Great, closing the issue then.

## medoc writes

fixed in current beta