Vishnu's Pages

UTF-8 is a Brilliant Design

The first time I learned about UTF-8 encoding, I was fascinated by how well-thought and brilliantly it was designed to represent millions of characters from different languages and scripts, and still be backward compatible with ASCII.

Basically UTF-8 uses 32 bits and the old ASCII uses 7 bits, but UTF-8 is designed in such a way that:

Designing a system that scales to millions of characters and still be compatible with the old systems that use just 128 characters is a brilliant design.

Note: If you are already aware of the UTF-8 encoding, you can explore the UTF-8 Playground utility that I built to visualize UTF-8 encoding.

How Does UTF-8 Do It?

UTF-8 is a variable-width character encoding designed to represent every character in the Unicode character set, encompassing characters from most of the world's writing systems.

It encodes characters using one to four bytes.

The first 128 characters (U+0000 to U+007F) are encoded with a single byte, ensuring backward compatibility with ASCII, and this is the reason why a file with only ASCII characters is a valid UTF-8 file.

Other characters require two, three, or four bytes. The leading bits of the first byte determine the total number of bytes that represents the current character. These bits follow one of four specific patterns, which indicate how many continuation bytes follow.

1st byte Pattern # of bytes used Full byte sequence pattern
0xxxxxxx 1 0xxxxxxx
(This is basically a regular ASCII encoded byte)
110xxxxx 2 110xxxxx 10xxxxxx
1110xxxx 3 1110xxxx 10xxxxxx 10xxxxxx
11110xxx 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Notice that the second, third, and fourth bytes in a multi-byte sequence always start with 10. This indicates that these bytes are continuation bytes, following the main byte.

The remaining bits in the main byte, along with the bits in the continuation bytes, are combined to form the character's code point. A code point serves as a unique identifier for a character in the Unicode character set. A code point is typically represented in hexadecimal format, prefixed with "U+". For example, the code point for the character "A" is U+0041.

So here is how a software determines the character from the UTF-8 encoded bytes:

  1. Read a byte. If it starts with 0, it's a single-byte character (ASCII). Show the character represented by the remaiing 7 bits on the screen. Continue with the next byte.
  2. If the byte didn't start with a 0, then:
    • If it starts with 110, it's a two-byte character, so read the next byte as well.
    • If it starts with 1110, it's a three-byte character, so read the next two bytes.
    • If it starts with 11110, it's a four-byte character, so read the next three bytes.
  3. Once the number of bytes are determined, read all the remaining bits except the leading bits, and find the binary value (aka. code point) of the character.
  4. Look up the code point in the Unicode character set to find the corresponding character and display it on the screen.
  5. Read the next byte and repeat the process.

Example: Hindi Letter "अ" (open in UTF-8 Playground)

The Hindi letter "अ" (officially "Devanagari Letter A") is represented in UTF-8 as:

11100000 10100100 10000101 Here:

The first byte 11100000 indicates that the character is encoded using 3 bytes.

The remaining bits of the three bytes: xxxx0000 xx100100 xx000101 are combined to form the binary sequence 00001001 00000101 (0x0905 in hexadecimal). This is the code point of the character, represented as U+0905.

The code point U+0905 (see official chart) represents the Hindi letter "अ" in the Unicode character set.

Example Text Files

Now that we understood the design of UTF-8, let's look at a file that contains the following text:

1. Text file contains: Hey👋 Buddy

The text Hey👋 Buddy has both English characters and an emoji character on it. The text file with this text saved on the disk will have the following 13 bytes in it:

01001000 01100101 01111001 11110000 10011111 10010001 10001011 00100000 01000010 01110101 01100100 01100100 01111001

Let's evaluate this file byte-by-byte following the UTF-8 decoding rules:

Byte Explanation
01001000 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1001000 represent the letter 'H'. (open in playground)
01100101 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1100101 represent the letter 'e'. (open in playground)
01111001 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1111001 represent the letter 'y'. (open in playground)
11110000 Starts with 11110, indicating it's the first byte of a four-byte character.
10011111 Starts with 10, indicating it's a continuation byte.
10010001 Starts with 10, indicating it's a continuation byte.
10001011 Starts with 10, indicating it's a continuation byte.

The bits from these four bytes (excluding the leading bits) combine to form the binary sequence 00001 11110100 01001011, which is 1F44B in hexadecimal, corresponds to the code point U+1F44B. This code point represents the waving hand emoji "👋" in the Unicode character set (open in playground).
00100000 Starts with 0, so it's a single-byte ASCII character. The remaining bits 0100000 represent a whitespace character. (open in playground)
01000010 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1000010 represent the letter 'B'. (open in playground)
01110101 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1110101 represent the letter 'u'. (open in playground)
01100100 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1100100 represent the letter 'd'. (open in playground)
01100100 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1100100 represent the letter 'd'. (open in playground)
01111001 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1111001 represent the letter 'y'. (open in playground)

Now this is a valid UTF-8 file, but it doesn't have to be "backward compatible" with ASCII because it contains a non-ASCII character (the emoji). Next let's create a file that contains only ASCII characters.

2. Text file contains: Hey Buddy

The text file doesn't have any non-ASCII characters. The file saved on the disk has the following 9 bytes in it:

01001000 01100101 01111001 00100000 01000010 01110101 01100100 01100100 01111001

Let's evaluate this file byte-by-byte following the UTF-8 decoding rules:

Byte Explanation
01001000 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1001000 represent the letter 'H'. (open in playground)
01100101 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1100101 represent the letter 'e'. (open in playground)
01111001 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1111001 represent the letter 'y'. (open in playground)
00100000 Starts with 0, so it's a single-byte ASCII character. The remaining bits 0100000 represent a whitespace character. (open in playground)
01000010 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1000010 represent the letter 'B'. (open in playground)
01110101 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1110101 represent the letter 'u'. (open in playground)
01100100 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1100100 represent the letter 'd'. (open in playground)
01100100 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1100100 represent the letter 'd'. (open in playground)
01111001 Starts with 0, so it's a single-byte ASCII character. The remaining bits 1111001 represent the letter 'y'. (open in playground)

So this is a valid UTF-8 file, and it is also a valid ASCII file. The bytes in this file follows both the UTF-8 and ASCII encoding rules. This is how UTF-8 is designed to be backward compatible with ASCII.

Other Encodings

I did a quick research on any other encoding that are backward compatible with ASCII, and there are a few, but they are not as popular as UTF-8, for example GB 18030 (a Chinese government standard). Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.

The siblings of UTF-8, like UTF-16 and UTF-32, are not backward compatible with ASCII. For example, the letter 'A' in UTF-16 is represented as: 00 41 (two bytes), while in UTF-32 it is represented as: 00 00 00 41 (four bytes).

Bonus: UTF-8 Playground

When I was exploring the UTF-8 encoding, I couldn't find any good tool to interactively visualize how UTF-8 encoding works. So I built UTF-8 Playground to visualize and play around with UTF-8 encoding. Give it a try!.