Why Unicode stinks

Happy New Year! Cliché, but awesome nonetheless. I hereby resolve to blog more. Nah. Scratch that. Who keeps their New Year's resolutions anyway? My family recently discovered that my grandmother, for instance, had a diary that she started every January in the early 1940's and the farthest she ever got was June (about mid-1940's - after that it looks like she permanently gave up).

Anyway, onto the actual topic of discussion. What follows is a summary - it isn't accurate, but that won't really matter too much.

A byte, for all intents and purposes of this discussion, is 8 "bits". A "bit" is a 0 or a 1. 8 bits offers 256 combinations (2^8 = 256). The smallest logical unit that can be used in a computer program is a byte.

A computer screen is made up of a whole bunch of pixels, typically arranged in a 4:3 ratio (e.g. 800x600).

With that knowledge and assuming you were designing a computer, how would you display a letter of the alphabet to the user?

The first computer engineers didn't think very far ahead. In fact, I'm pretty sure they were fairly brain-dead when they created what is now known as the ASCII character set. The ASCII character set was supposed to be temporary until something better came along. Guess what? ASCII is unfortunately still in extensive use. English is easily represented in 256 characters. Many Latin-based languages are as well.

Instead of fixing the problem, it was exacerbated when some idiot came up with the concept of "code pages". As I just said, most Latin-based languages fit comfortably in 256 characters. The problem arises is that there are more than 256 characters in all Latin languages for all the characters that need to be represented (or something like that). ASCII is actually two parts - the Basic character set and the Extended character set. The first 128 characters are part of the former and the latter 128 characters are in the latter. So, when you booted up the computer, you would load the code page for your language, which would display different characters.

Why was this a problem? Well, let's say you wrote a document using the Spanish code page and wrote an English document and sent it to someone who uses the English code page. When the person opens the document, anything outside of the Basic character set would appear as jibberish. To read (and edit!) the document properly they would have to first load the Spanish code page.

After about a dozen code pages had been cooked up for various languages, someone finally got the idea to create multibyte character sets (MBCS). A MBCS requires more than one byte to represent a single character. The first MBCS implementations were poorly designed and got a bad rap for being difficult to manage and hard to program for. However, for non-Latin (in particular, Eastern) languages, MBCS was the only way to actually encode characters.

After many years of struggling, some people formed a committee and Unicode was born. Unicode was supposed to be the magic bullet for all problems ever involving the representation of all known languages. I say "supposed" because Unicode stinks.

For instance, you can't represent Klingon in Unicode. This is a silly case, but what happens when we encounter intelligent alien life and we want to send them a document? Unicode isn't cut out for that. Or what about the zillion dead languages no one uses that exist? Or how about the characters of the Chinese language that can't be included because they aren't "common enough"?

And don't get me started on programming Unicode stuff. Writing programs that don't chop up a grapheme cluster (I'm still trying to wrap my head around that one) into little pieces is a nightmare. There are about a dozen different Unicode Transformation Formats (UTF) and strings may or may not start with a Byte Order Mark (BOM). Joel Spolsky, while I respect most of his views, is just plain wrong about Unicode.

Of everything I've seen in Unicode only one positive stands out. UTF-8. UTF-8 is on the right track but still misses the mark by a long shot. It is still difficult to program for as it is a MBCS (but then so is the rest of Unicode if you don't want to destroy grapheme clusters) but it uses single byte characters - no big/little-endian mess - and that makes it a zillion times better than any other incarnation of Unicode. But even UTF-8 is still limited in the number of characters it can represent.

Here is what I want to see: Unlimited character set definitions. So if I want to display Klingon on my computer screen, I can. And the implementation is blindingly simple:

Bit 7 - Grapheme cluster
Bit 6 - Continuation
Bits 0-5 - Type and/or Data

That's a byte broken down into bit groups. All a programmer will be interested in is Bit 7. If Bit 7 is a 1, they move on to the next byte(s) until they find a byte with Bit 7 set to 0. That is a complete grapheme cluster. A grapheme cluster is a complete character - the boundaries of which you can slice and dice to your hearts content. You could, for instance, copy and paste or concatenate two of these or delete it entirely from a string. LOTS of possibilities. With Unicode, finding the boundaries of a grapheme cluster in a string of bytes is complex.

Bits 6 is a little more complex to understand. Basically, it indicates if the current byte is a continuation of the previous byte (and type) or if it is a new type.

0 = New Type
1 = Continue current Type

For the Type, there are two bits means there are four possibilities for values. I'd like to see these as such:

0 (00) = Code Point
1 (01) = World
2 (10) = Metadata
3 (11) = Custom

The Type is an implementation-specific view. A grapheme cluster without a World type is assumed to be the current World default (e.g. Unicode). A Code Point is implementation defined (e.g. a Unicode Code Point). The Metadata type is implementation-defined metadata that gets applied to the grapheme cluster as a whole or the following Code Point. The Custom type will probably never get used but it is implementation-dependent. It could be useful as an application callback to display some custom character that is very specific to the application.

The remaining 5 bits are used for data. Need more space for data? Well, then continue with the next byte. An example is in order. Let's say Unicode is World 0 and Klingon is World 1 and Code Points 154 and 22500 represent a single grapheme. This could be encoded as such:

10010001 (Select the Klingon World)
10001010 (Last 4 bits of Code Point 154)
11001001 (First 4 bits of Code Point 154)
10000100 (Last 4 bits of Code Point 22500)
11111110 (Previous 6 bits of Code Point 22500)
01010101 (First 5 bits of Code Point 22500)

This is one possible encoding. Code Point encoding is entirely up to the World designers.

This approach allows the Unicode World to coincide alongside any other World. From my perspective, programming becomes really easy. I don't have to care about the actual data I'm manipulating. Here's an example of what I mean:

size_t my_strlen(const unsigned char *str)
{
size_t x = 0;

for (; *str; str++)
{
if (!(*str & 0x80)) x++;
}

return x;
}

For those who don't read C code, strlen() determines the length of a string of characters. The code above does the same but on a grapheme cluster level using the definitions I've provided. Doing the same thing in Unicode is about 50 times more complex.

Let's say, for instance, you want to convert a grapheme to "lowercase". When you start messing with anything lower than the grapheme level (as a programmer), the waters become really muddy real fast. Converting a string to lowercase is not clear-cut and has all sorts of unintended consequences. For this reason, World designers MUST declare all transformations in clear-cut, dead-simple transformation sequences. Such transformations, for instance, could come in the form of a Standard set of APIs or libraries that are required to be used for conformance.

The approach above is superior to UTF-8. It uses a little more storage space to represent the same characters, but let me introduce you to this wonderful thing called "compression". My recommended approach will still compress quite nicely and you can do things with this you will never be able do in Unicode. But, as I've worked with Unicode, I've come to the realization that it doesn't matter how many bytes there are as long as the text gets displayed properly in the end.

Just a thought. It may turn out that this is a bad idea too. 20/20 hindsight is funny like that.

Cubic

Search This Blog

Why Unicode stinks

Labels

Comments

Post a Comment