Unicode examples

This post is mostly inspired by facts summarised at utf8everywhere.org.

Unicode is somehow complicated. There is a common misunderstanding about these things:

  • there is not a simple direct mapping between graphemes (the user-perceived characters) and code points (the “Unicode”characters)
  • UTF-16 is not a fixed width encoding - some code points are encoded into 2 and some into 4 bytes

There are some examples of this:

GraphemesCode pointsUTF-8 bytesUTF-16 bytesUTF-32 bytes
1. Simplea616100 6100 00 00 61
2. FullwidthFF41EF BD 81FF 4100 00 FF 41
3. Diacriticč10DC4 8D01 0D00 00 01 0D
4. Diacritic - separate63 30C63 CC 8C00 63 03 0C00 00 00 63 00 00 03 0C
5. Ligatureij133C4 B301 3300 00 01 33
6. Separateij69 6A69 6A00 69 00 6A00 00 00 69 00 00 00 6A
7. Same graphemeΩ3A9CE A903 A900 00 03 A9
8. Same grapheme2126E2 84 A621 2600 00 21 26
9. Non-BMP𝓃1D4C3F0 9D 93 83D8 35 DC C300 01 D4 C3

I looked up the information about the concrete characters using unicode-table.com. The code points in the table link to the information about the character on that site.

Each code unit is underlined separately. Each code point is encoded as:

  • UTF-8 - 1-byte code units, from 1 to 6
  • UTF-16 - 2-byte code units, 1 or 2
  • UTF-32 - 4-byte code unit

UTF-16 and UTF-32 can be encoded as big or little endian. The table shows only the big endian variant (the most significant byte first).

All numbers are shown in hexadecimal form (even the code point numbers, where it maybe does not make any sense).

1. Simple

One grapheme as one code point encoded as 2 UTF-16 bytes. As we would expect.

2. Fullwidth

The same grapheme, but looks slightly different (is wider). This is used with some East Asian characters to occupy the same width in fixed-width fonts.

3. Diacritic

The same, but for a less common and more complicated grapheme (with caron). We would expect this, too.

4. Diacritic - separate

Or we can use 2 separate code points to encode characters with diacritic:

  • one for the base character (c) - blue color
  • one for the diacritical mark (the caron) - red color

Note that this is the same grapheme (as users perceive it), but encoded in completely different ways. This can cause problems when comparing texts - the user would perceive them as equal, but the comparison of code points would tell that the texts are different. A process called normalization exists to convert the code points to the same form, so they can be correctly compared.

5. Ligature

These are 2 graphemes, which correspond to only one code point.

6. Separate

Of course, these graphemes (i and j) can be encoded separately, too - as two code points (blue and red). The separate characters look almost the same as the ligature.

7./8. Same grapheme

These look exactly the same, but are different code points, with different meanings:

  • Greek letter Omega
  • Ohm sign

9. Non-BMP

This is the case where one code point is encoded as 4 UTF-16 bytes (instead of 2).

All characters which do not belong to the Basic Multilingual Plane (BMP) are encoded like this.

You can find other commonly used non-BMP characters in this post at Stack Overflow - see the “trans-BMP code points” part.

Written on July 21, 2015