Comments on a CONS is an object which cares: Character encoding is about algorithms, not datastructures

Actually, in practice using UCS-2 was perfectly fi...

2010-11-19T18:39:12.830-07:00

Actually, in practice using UCS-2 was perfectly fine without surrogates until the Chinese government decided to mandate this standard: http://en.wikipedia.org/wiki/GB_18030.

What I meant about 3 bytes per character is packing characters so that half of them would fall on 32-bit word boundaries (1/4 in case of 64-bit words). The Unicode standards people are pretty confident that they're not going to need any code points above 2^21.

This makes it slightly more complicated to get at characters, but not unduly so, and I think the extra bit-banging cost might be offset by having more of the string in cache, at least for some algorithms. And of course you get a 25% space saving over UTF-32 while retaining all the nice "one character one string index" properties.

Right now for example SBCL and CCL do what you describe - use a full 4 bytes for every character, which is wasting a byte.

Indeed. External formats functions do have the key...

2010-11-18T23:33:56.926-07:00

Indeed. External formats functions do have the keyword parameters to specify decode-error and encode-error, but default behavior is still as above.

What is worse, surrogates make all the sequence operations unusable, because there's no longer a 1:1 correspondence between code points and characters.

I don't know why surrogates are necessary, but I certainly would like that 1:1 correspondence back again. It would simplify things a lot, not only for compiler writers, but also users of any kind.

3 bytes per code point/character? I wouldn't mind at all, although I can easily imagine many folks reacting against the "wasted" bytes.

"CMUCL 20b does support surrogates: http://co...

2010-11-18T13:30:51.043-07:00

"CMUCL 20b does support surrogates: http://common-lisp.net/project/cmucl/doc/cmu-user/unicode.html#toc321"

In the sense that it punts on the issue (20b):

(setf foo (coerce '(#\U+D834 #\U+DD1E) 'string)) => "?"

(length foo) => 2

(with-input-from-string (s foo) (princ (read-char s)) (princ (read-char s))) =>
?
#\U+DD1E

(setf (aref foo 0) #\b)
foo => "b?"

This will probably break most existing code that doesn't expect to deal with multi-byte characters.

One thing I'm surprised hasn't been tried is using 3 bytes to encode each character in a string. Right now only code points up to 0x10FFFF are used, and there's no requirement that an implementation has to use UTF-32 internally.

CMUCL 20b does support surrogates: http://common-l...

2010-11-18T09:48:52.600-07:00

CMUCL 20b does support surrogates: http://common-lisp.net/project/cmucl/doc/cmu-user/unicode.html#toc321

UTF-16 (UCS-2) may be a good choice if you want to include efficient processing of asian languages as well. Even using european languages or just something as simple as the vowel e with an accent, é, and single-byte character UTF-8 is out the window, so it doesn't really seem realistic, even in the USA.

Optimizing for the ASCII subset of UTF-8 is not really an option for web applications if you are aiming for international users. From that point of view, UTF-8 is an odd choice as a standard encoding on the web; in fact many asian countries continue to ignore it for good reasons. With UTF-16 you can at least support all the known languages in a simple way, except for a few, mainly historic cases. This is just another trade-off. None of the choices seems to be simple and efficient the way ASCII was or is.

Good point. It's actually more correct to say ...

2010-11-17T17:45:43.516-07:00

Good point. It's actually more correct to say that CLISP and CMUCL use UCS-2 instead of UTF-16 since they don't support surrogates.

And there is UCS-2. Lispworks uses UCS-2. And I pr...

2010-11-17T16:16:47.420-07:00

And there is UCS-2. Lispworks uses UCS-2. And I prefer it over utf-8/utf-16 for internals of the string representation (I'm a Lispwork user).

Best,
Art