November 16, 2010

Character encoding is about algorithms, not datastructures

One thing you might be aware of is that both SBCL and Clozure represent characters using 4 bytes. There's been significant discussion about this already, but I hope I can offer more insight into how you can apply this to your Lisp applications.

First, the one thing most people seem to agree on is that UTF-16 is evil (it should be noted that both CLISP and CMUCL use UTF-16 as their internal character representation).

The important thing about UTF-32 vs. UTF-8 and UTF-16 is that it is not primarily a question of string size, but of algorithms.

Variable-length encodings work perfectly fine for stream algorithms. But string algorithms are written on the assumption of constant-time random access and being able to set parts of a string to certain values without consing. These assumptions can't be satisfied when variable-length encodings are used, and most string algorithms would not run with any level of acceptable performance.

What about immutable strings? Random access is still not constant-time for variable-length encodings, and all the mutator operations are gone. In effect most immutable string algorithms actually end up being stream algorithms that cons a lot.

Chances are, you're already using stream algorithms on your strings even if you're not aware of it. Any kind of search over a string really treats that string as a stream. If you're doing string concatenation, you're really treating your strings as though they were immutable - consider using with-output-to-string to cut down on consing and to simplify your code.

One thing that is special about UTF-8 is that it is the de-facto character encoding standard of the web. When you're reading UTF-8 into a Lisp string, what you're really doing is decoding and copying each character.

Most web application patterns are based around searching for and extracting some string from the HTTP request, and either using that string as a retrieval key in a database or splicing it together with some other strings to form the reply. All these actions can be modeled in terms of streams.

One of the things that makes John Fremlin's tpd2 fast is that it dispenses with the step of decoding and copying the incoming UTF-8 data into a Lisp string (this is also what antiweb does). Using some compiler macros and the cl-irregsexp library, all the searching and templating is done on byte arrays that hold the UTF-8 encoded data (Lisp string literals are converted to UTF-8 byte arrays by compiler macros). The result is a UTF-8 byte array that can be sent directly back to the web browser, bypassing another character encoding and copying step.

I read somewhere that the Exokernel operating system permitted extending this concept further down the software stack by allowing applications to send back pre-formatted TCP/IP packets, although I don't know if that's actually possible or how much of a speedup it would give.

In addition to skipping the overhead of copying and re-encoding characters, working on UTF-8 byte sequences directly means you can use algorithms that depend on working with a small alphabet set to achieve greater performance (an example of this is the Boyer–Moore–Horspool string searching algorithm).

6 comments:

Anonymous said...

And there is UCS-2. Lispworks uses UCS-2. And I prefer it over utf-8/utf-16 for internals of the string representation (I'm a Lispwork user).

Best,
Art

Vladimir Sedach said...

Good point. It's actually more correct to say that CLISP and CMUCL use UCS-2 instead of UTF-16 since they don't support surrogates.

Anonymous said...

CMUCL 20b does support surrogates: http://common-lisp.net/project/cmucl/doc/cmu-user/unicode.html#toc321

UTF-16 (UCS-2) may be a good choice if you want to include efficient processing of asian languages as well. Even using european languages or just something as simple as the vowel e with an accent, é, and single-byte character UTF-8 is out the window, so it doesn't really seem realistic, even in the USA.

Optimizing for the ASCII subset of UTF-8 is not really an option for web applications if you are aiming for international users. From that point of view, UTF-8 is an odd choice as a standard encoding on the web; in fact many asian countries continue to ignore it for good reasons. With UTF-16 you can at least support all the known languages in a simple way, except for a few, mainly historic cases. This is just another trade-off. None of the choices seems to be simple and efficient the way ASCII was or is.

Vladimir Sedach said...

"CMUCL 20b does support surrogates: http://common-lisp.net/project/cmucl/doc/cmu-user/unicode.html#toc321"

In the sense that it punts on the issue (20b):

(setf foo (coerce '(#\U+D834 #\U+DD1E) 'string)) => "?"

(length foo) => 2

(with-input-from-string (s foo) (princ (read-char s)) (princ (read-char s))) =>
?
#\U+DD1E

(setf (aref foo 0) #\b)
foo => "b?"

This will probably break most existing code that doesn't expect to deal with multi-byte characters.

One thing I'm surprised hasn't been tried is using 3 bytes to encode each character in a string. Right now only code points up to 0x10FFFF are used, and there's no requirement that an implementation has to use UTF-32 internally.

Anonymous said...

Indeed. External formats functions do have the keyword parameters to specify decode-error and encode-error, but default behavior is still as above.

What is worse, surrogates make all the sequence operations unusable, because there's no longer a 1:1 correspondence between code points and characters.

I don't know why surrogates are necessary, but I certainly would like that 1:1 correspondence back again. It would simplify things a lot, not only for compiler writers, but also users of any kind.

3 bytes per code point/character? I wouldn't mind at all, although I can easily imagine many folks reacting against the "wasted" bytes.

Vladimir Sedach said...

Actually, in practice using UCS-2 was perfectly fine without surrogates until the Chinese government decided to mandate this standard: http://en.wikipedia.org/wiki/GB_18030.

What I meant about 3 bytes per character is packing characters so that half of them would fall on 32-bit word boundaries (1/4 in case of 64-bit words). The Unicode standards people are pretty confident that they're not going to need any code points above 2^21.

This makes it slightly more complicated to get at characters, but not unduly so, and I think the extra bit-banging cost might be offset by having more of the string in cache, at least for some algorithms. And of course you get a 25% space saving over UTF-32 while retaining all the nice "one character one string index" properties.

Right now for example SBCL and CCL do what you describe - use a full 4 bytes for every character, which is wasting a byte.