Bytes, Code Points and Grapheme Clusters

Let’s talk about text. Specifically, text encoded using utf8.

How many characters would you consider to be in the string 'a'? What about 'é'?

$ node
      > 'a'.length
      1
      > 'é'.length
      1

I’d say that this lines up pretty closely with what I expect. There’s some trickery going on here through, which can be revealed if we use the Buffer object in node, which is populated with each of the bytes in a string:

$ node
      > Buffer.from('a').length
      1
      > Buffer.from('é').length
      2

While é may be a single character, it’s not represented using a single byte, like a is. It certainly could be – all you have to do is declare some byte to be the byte that represents é, as is the case for a and 0x3d. This wouldn’t be practical though – you only have 255 bytes to choose from. This is serviceable if you speak english & don’t care about symbols outside of the latin letters, the numbers and a few symbols. This however excludes a whole lot of people who don’t speak english from being able to use your software in their native language. This is what you’re doing if you use ascii as your encoding.

So what next? Use more bytes? To cut a long story short, utf8 is a defacto solution to this problem. It uses a variable number of bytes to represent characters – 1 through 4 bytes. That means any “character” could be represented using 1, 2, 3, or 4 bytes. Conveniently, the designers of utf8 set it up in such a way that ascii is a valid subset of utf8. Nice one.

This is why we got 2 as the length of Buffer.from('é'), internally, in a utf8 encoded string, it’s stored using two bytes.

> [...Buffer.from('é')].map(cc => `b${cc.toString('2')}`)
      [ 'b11000011', 'b10101001' ]

This works as we’d expect a utf8 string to behave, despite the fact that Javascript stores strings using the utf16 encoding. This is because Buffer.from takes an encoding as a second argument, the default value for which is utf8, so the string is parsed as though it were a utf8 string. I’ll be honest, I still don’t really have my head wrapped around how this works, but onwards we go.

With utf8 we have four bytes with 8 bits each, which is 232 = 4, 294, 967, 296 (4.3 billion) So, end of story right? Surely this is enough room for every single character humans could conceive for the rest of history? Not quite. Look at this nonsense:

> '👨‍👩‍👧‍👧'.length
      11
      > Buffer.from('👨‍👩‍👧‍👧').length
      25

WTF (paste that line in your terminal – it will probably mess it up)

11 characters is wild.

The reason this is surprising to me is because I have an english-centric understanding of text. Some Hindi words, for example, are constructed of serveral characters to form a single “ligature”:

> 'अनुच्छेद'.length
      8

These “ligatures” are referred to as grapheme clusters. There’s an algorithm that you can use to figure out where the boundary between two grapheme clusters (see reading material).


Reading material