tag:blogger.com,1999:blog-5728814948530385321.post6390197856603420475..comments2018-03-27T14:36:23.084-06:00Comments on a CONS is an object which cares: Character encoding is about algorithms, not datastructuresUnknownnoreply@blogger.comBlogger6125tag:blogger.com,1999:blog-5728814948530385321.post-949418429909271152010-11-19T18:39:12.830-07:002010-11-19T18:39:12.830-07:00Actually, in practice using UCS-2 was perfectly fi...Actually, in practice using UCS-2 was perfectly fine without surrogates until the Chinese government decided to mandate this standard: <a href="http://en.wikipedia.org/wiki/GB_18030" rel="nofollow">http://en.wikipedia.org/wiki/GB_18030</a>.<br /><br />What I meant about 3 bytes per character is packing characters so that half of them would fall on 32-bit word boundaries (1/4 in case of 64-bit words). The Unicode standards people are pretty confident that they're not going to need any code points above 2^21. <br /><br />This makes it slightly more complicated to get at characters, but not unduly so, and I think the extra bit-banging cost might be offset by having more of the string in cache, at least for some algorithms. And of course you get a 25% space saving over UTF-32 while retaining all the nice "one character one string index" properties.<br /><br />Right now for example SBCL and CCL do what you describe - use a full 4 bytes for every character, which is wasting a byte.Vladimir Sedachhttps://www.blogger.com/profile/16250437982151203601noreply@blogger.comtag:blogger.com,1999:blog-5728814948530385321.post-51855998217418564972010-11-18T23:33:56.926-07:002010-11-18T23:33:56.926-07:00Indeed. External formats functions do have the key...Indeed. External formats functions do have the keyword parameters to specify decode-error and encode-error, but default behavior is still as above.<br /><br />What is worse, surrogates <i>make all the sequence operations unusable</i>, because there's no longer a 1:1 correspondence between code points and characters.<br /><br />I don't know why surrogates are necessary, but I certainly would like that 1:1 correspondence back again. It would simplify things a lot, not only for compiler writers, but <i>also users</i> of any kind.<br /><br />3 bytes per code point/character? I wouldn't mind at all, although I can easily imagine many folks reacting against the "wasted" bytes.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5728814948530385321.post-74073233670594329442010-11-18T13:30:51.043-07:002010-11-18T13:30:51.043-07:00"CMUCL 20b does support surrogates: http://co..."CMUCL 20b does support surrogates: http://common-lisp.net/project/cmucl/doc/cmu-user/unicode.html#toc321"<br /><br />In the sense that it punts on the issue (20b):<br /><br />(setf foo (coerce '(#\U+D834 #\U+DD1E) 'string)) => "?"<br /><br />(length foo) => 2<br /><br />(with-input-from-string (s foo) (princ (read-char s)) (princ (read-char s))) =><br />?<br />#\U+DD1E<br /><br />(setf (aref foo 0) #\b)<br />foo => "b?"<br /><br />This will probably break most existing code that doesn't expect to deal with multi-byte characters.<br /><br />One thing I'm surprised hasn't been tried is using 3 bytes to encode each character in a string. Right now only code points up to 0x10FFFF are used, and there's no requirement that an implementation has to use UTF-32 internally.Vladimir Sedachhttps://www.blogger.com/profile/16250437982151203601noreply@blogger.comtag:blogger.com,1999:blog-5728814948530385321.post-70457113499060896542010-11-18T09:48:52.600-07:002010-11-18T09:48:52.600-07:00CMUCL 20b does support surrogates: http://common-l...CMUCL 20b does support surrogates: http://common-lisp.net/project/cmucl/doc/cmu-user/unicode.html#toc321<br /><br />UTF-16 (UCS-2) may be a good choice if you want to include efficient processing of asian languages as well. Even using european languages or just something as simple as the vowel e with an accent, é, and single-byte character UTF-8 is out the window, so it doesn't really seem realistic, even in the USA.<br /><br />Optimizing for the ASCII subset of UTF-8 is not really an option for web applications if you are aiming for international users. From that point of view, UTF-8 is an odd choice as a standard encoding on the web; in fact many asian countries continue to ignore it for good reasons. With UTF-16 you can at least support all the known languages in a simple way, except for a few, mainly historic cases. This is just another trade-off. None of the choices seems to be simple and efficient the way ASCII was or is.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5728814948530385321.post-66351322471583149662010-11-17T17:45:43.516-07:002010-11-17T17:45:43.516-07:00Good point. It's actually more correct to say ...Good point. It's actually more correct to say that CLISP and CMUCL use UCS-2 instead of UTF-16 since they don't support surrogates.Vladimir Sedachhttps://www.blogger.com/profile/16250437982151203601noreply@blogger.comtag:blogger.com,1999:blog-5728814948530385321.post-37458352950646350222010-11-17T16:16:47.420-07:002010-11-17T16:16:47.420-07:00And there is UCS-2. Lispworks uses UCS-2. And I pr...And there is UCS-2. Lispworks uses UCS-2. And I prefer it over utf-8/utf-16 for internals of the string representation (I'm a Lispwork user).<br /><br />Best,<br /> ArtAnonymousnoreply@blogger.com