beaneater.org.uk Nicholas Wolverson scribbles on his screen

Charsets

Character set madness


07 April 2003
(19:19)

Seem to be having lots of discussions about character sets recently. Which is good, as it's an oft-misunderstood issue. Personally, I have not succeeded in setting irssi up to do UTF-8 IRC (at least through screen and PuTTY, if that makes a difference), and hence I believe that it is right and proper that everybody use ISO-8859-1.

A digression on character sets

As a review for those who know, and a view for those who don't, the traditional character set is ASCII. This takes up 7 bits in each byte, and there have been many extensions (in DOS, Windows, other things, and standerdised) to use the last bit. ISO-8859-1 uses this for characters useful to western europeans, and isn't so different from the standard windows character set (though windows actually generally works with Unicode I believe). So Unicode; this is an encoding encompassing a huge range of languages and scripts, including ideographic scripts like used for Chinese, and UTF-8 is an encoding which represents characters with varying numbers of bytes. In particular, it coincides with ASCII, but must use the 8th bit to indicate more bytes are being used for other characters, and hence conflicts with ISO-8859-1.

So, on IRC I see the benefits of 8859-1 and UTF-8, but have managed to persuade people that other weird charsets shouldn't abound. And I think I now know the areas in which ISO-8859-1, ISO-8859-15, and Windows' codepage 1252 differ enough to avoid them in the appropriate circumstances. I have in the past thought I should move to ISO-8859-1 to ISO-8859-15, for the Euro sign, but then I lose ¬.

The web, XML, and managed content

I have been writing, as is natural, in ISO-8859-1 on this website. This works fine and dandy. Except that when CGI.pm generates XHTML, it always specifies UTF-8 as the charset in the XML declaration. But that's fine because my HTTP header gives ISO-8859-1, which overrides the XML declaration. Also, as I send XHTML under the text/html MIME type, I'm not even sure if mozilla is rendering it as XHTML (references appreciated). I plan to go back and link to the relevant discussions and controversy on that last point.

So that works fine, but then I generate RSS. And again I'm pretending its UTF-8 in the XML declaration, but sending out ISO-8859-1. Recently I've used the character ö, that is o umlaut. Which breaks things, as it isn't valid UTF-8. Oops. So eventually I shall write a script to go through the database converting to UTF-8, but for now I'll give RSS in ISO-8859-1. This should hopefully work, please poke me if my RSS is broken for you (email/comment/IRC).

Also todo: validate entries with an XHTML validator before posting, rejig comments, take over the world.

Plea

Anybody know just how text input works in browsers with respect to character sets and HTML entities?

Comment | Permalink | in categories Log Meta Charsets  
David Oftedal

All these apps have special options to enable Unicode support. Try the following:

  1. Select UTF-8 from your putty settings and start putty

  2. screen -U

  3. irssi

  4. /set term_type utf8