rpremuz writes

Here is the output I get:

$ /usr/share/recoll/filters/rclchm putty.chm
Traceback (most recent call last):
  File "/usr/share/recoll/filters/rclchm", line 288, in <module >
    rclexecm.main(proto, extract)
  File "/usr/share/recoll/filters/rclexecm.py", line 163, in main
    if not extract.openfile(params):
  File "/usr/share/recoll/filters/rclchm", line 244, in openfile
    self.tp.feed(self.topics)
  File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)

Test environment:
Ubuntu 12.04 x64
recoll 1.17.3-1~ppa1~precise1
python-chm 0.8.4-1build2

The putty.chm (attached) is the help file from a well-known SSH client for Windows.

medoc writes

Hi,

I had a look at Topics node inside the chm file (I would attach it but it seems I can’t). The HTML header declares an ASCII encoding, but it contains 8bit characters (\222 \223 quotes), so Python’s HTMLParser’s chokes on it, and there is little I can do.

Either the topics file should declare an appropriate character set (ie: windows-1252), or HTMLParser be more lenient and always decode ASCII as Windows-1252 (which is probably what CHM viewers do), but I’ve no idea what Recoll could do in this situation (beyong possibly catching the exception for a cleaner error handling).

Sorry, I’m powerless here.

Cheers, jf

medoc writes

But thanks for reporting this anyway !

rpremuz writes

I’ve contacted the PuTTY development team and got a reply referring to the HTML 4.01 specification, section 5 (http://www.w3.org/TR/html401/charset.html). The charset parameter in HTML header specifies actually the character encoding of the HTML document and not the character set of the document.

So, a HTML document in US-ASCII encoding can contain characters outside of the US-ASCII character set. In that case these additional characters are encoded as numeric character references (e.g. ‘) or character entity references (e.g. ‘).

The document character set of HTML documents is actually the Universal Character Set (UCS), defined in ISO10646. When a parser extracts text from a HTML documents it generally gets UCS text.

Is it possible that Python’s HTMLParser implements these things incorrectly?

medoc writes

Sorry about the absurdly long delay, I did not get an email when the issue was updated, and I did not check the issues for a long time.

The PuTTY people are right on the theory of course, but what happens is that I get non-ascii characters (not character references, actual byte values) inside the topics file, which is an HTML document out of the chm file, with a declared character encoding of US-ASCII.

I think that there are 3 possible causes:

  • These characters are actually present inside the CHM file, meaning that the PuTTY people are good on theory, a little less on practise. As the offending characters are windows-specific single quotes (character code 0x91 in windows1252), which it is extremely classic to find in supposedly ascii or iso-8859-1 files, I’d put my money on this one.

  • The characters are present in the PuTTY files as character references, but are decoded by libchm, and libchm does not change the charset declaration, as it probably should.

  • Same idea with python-chm

It’s not easy for me to verify which idea is the right one, because my only way to look at the chm file is to use libchm, so I’d have to get into the code and basically debug the issue.

Here follow my ideas about how to make progress with this:

  • Ask the PuTTY team to carefully check that they have no windows single quotes (8-bit characters 0x91 and 0x92) inside their files. The easiest way to check this is iconv -f ascii -t ascii < mysupposedlyasciifile.txt > /dev/null. Iconf will complain if there are non-ascii characters

  • Ask the libCHM people about the issue, they should know if they do any entity-decoding (they should not, except if they are also willing to change the declared charset).

I implemented a workaround to the above issues in the Recoll filter: http://www.lesbonscomptes.com/recoll/filters/rclchm

Hopefully this workaround which is designed to fix incorrect input will not introduce problem with valid one…

If you are still around, I’d be glad to hear if it fixes your problem (I’m going to hook up the email thing this time, so I won’t take a month to answer).

medoc writes

Lacking further input, closing this. Workaround hopefully ok.