UTF-8 is a Brilliant Design

2025-09-12

The first time I learned about UTF-8 encoding, I was fascinated by how well-thought and brilliantly it was designed to represent millions of characters from different languages and scripts, and still be backward compatible with ASCII.

Basically UTF-8 uses 32 bits and the old ASCII uses 7 bits, but UTF-8 is designed in such a way that:

Every ASCII encoded file is a valid UTF-8 file.
Every UTF-8 encoded file that has only ASCII characters is a valid ASCII file.

Designing a system that scales to millions of characters and still be compatible with the old systems that use just 128 characters is a brilliant design.

Note: If you are already aware of the UTF-8 encoding, you can explore the UTF-8 Playground utility that I built to visualize UTF-8 encoding.

How Does UTF-8 Do It?

UTF-8 is a variable-width character encoding designed to represent every character in the Unicode character set, encompassing characters from most of the world's writing systems.

It encodes characters using one to four bytes.

The first 128 characters (U+0000 to U+007F) are encoded with a single byte, ensuring backward compatibility with ASCII, and this is the reason why a file with only ASCII characters is a valid UTF-8 file.

Other characters require two, three, or four bytes. The leading bits of the first byte determine the total number of bytes that represents the current character. These bits follow one of four specific patterns, which indicate how many continuation bytes follow.

1st byte Pattern	# of bytes used	Full byte sequence pattern
0xxxxxxx	1	0xxxxxxx (This is basically a regular ASCII encoded byte)
110xxxxx	2	110xxxxx 10xxxxxx
1110xxxx	3	1110xxxx 10xxxxxx 10xxxxxx
11110xxx	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Notice that the second, third, and fourth bytes in a multi-byte sequence always start with 10. This indicates that these bytes are continuation bytes, following the main byte.

The remaining bits in the main byte, along with the bits in the continuation bytes, are combined to form the character's code point. A code point serves as a unique identifier for a character in the Unicode character set. A code point is typically represented in hexadecimal format, prefixed with "U+". For example, the code point for the character "A" is U+0041.

So here is how a software determines the character from the UTF-8 encoded bytes:

Read a byte. If it starts with 0, it's a single-byte character (ASCII). Show the character represented by the remaining 7 bits on the screen. Continue with the next byte.
If the byte didn't start with a 0, then:
- If it starts with 110, it's a two-byte character, so read the next byte as well.
- If it starts with 1110, it's a three-byte character, so read the next two bytes.
- If it starts with 11110, it's a four-byte character, so read the next three bytes.
Once the number of bytes are determined, read all the remaining bits except the leading bits, and find the binary value (aka. code point) of the character.
Look up the code point in the Unicode character set to find the corresponding character and display it on the screen.
Read the next byte and repeat the process.

Example: Hindi Letter "अ" (open in UTF-8 Playground)

The Hindi letter "अ" (officially "Devanagari Letter A") is represented in UTF-8 as:

11100000 10100100 10000101 Here:

The first byte 11100000 indicates that the character is encoded using 3 bytes.

The remaining bits of the three bytes: xxxx0000 xx100100 xx000101 are combined to form the binary sequence 00001001 00000101 (0x0905 in hexadecimal). This is the code point of the character, represented as U+0905.

The code point U+0905 (see official chart) represents the Hindi letter "अ" in the Unicode character set.

Example Text Files

Now that we understood the design of UTF-8, let's look at a file that contains the following text:

1. Text file contains: `Hey👋 Buddy`

The text Hey👋 Buddy has both English characters and an emoji character on it. The text file with this text saved on the disk will have the following 13 bytes in it:

01001000 01100101 01111001 11110000 10011111 10010001 10001011 00100000 01000010 01110101 01100100 01100100 01111001

Let's evaluate this file byte-by-byte following the UTF-8 decoding rules:

Byte	Explanation
01001000	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1001000` represent the letter 'H'. (open in playground)
01100101	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1100101` represent the letter 'e'. (open in playground)
01111001	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1111001` represent the letter 'y'. (open in playground)
11110000	Starts with `11110`, indicating it's the first byte of a four-byte character.
10011111	Starts with `10`, indicating it's a continuation byte.
10010001	Starts with `10`, indicating it's a continuation byte.
10001011	Starts with `10`, indicating it's a continuation byte. The bits from these four bytes (excluding the leading bits) combine to form the binary sequence 00001 11110100 01001011, which is `1F44B` in hexadecimal, corresponds to the code point `U+1F44B`. This code point represents the waving hand emoji "👋" in the Unicode character set (open in playground).
00100000	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `0100000` represent a whitespace character. (open in playground)
01000010	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1000010` represent the letter 'B'. (open in playground)
01110101	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1110101` represent the letter 'u'. (open in playground)
01100100	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1100100` represent the letter 'd'. (open in playground)
01100100	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1100100` represent the letter 'd'. (open in playground)
01111001	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1111001` represent the letter 'y'. (open in playground)

Now this is a valid UTF-8 file, but it doesn't have to be "backward compatible" with ASCII because it contains a non-ASCII character (the emoji). Next let's create a file that contains only ASCII characters.

2. Text file contains: `Hey Buddy`

The text file doesn't have any non-ASCII characters. The file saved on the disk has the following 9 bytes in it:

01001000 01100101 01111001 00100000 01000010 01110101 01100100 01100100 01111001

Let's evaluate this file byte-by-byte following the UTF-8 decoding rules:

Byte	Explanation
01001000	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1001000` represent the letter 'H'. (open in playground)
01100101	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1100101` represent the letter 'e'. (open in playground)
01111001	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1111001` represent the letter 'y'. (open in playground)
00100000	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `0100000` represent a whitespace character. (open in playground)
01000010	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1000010` represent the letter 'B'. (open in playground)
01110101	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1110101` represent the letter 'u'. (open in playground)
01100100	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1100100` represent the letter 'd'. (open in playground)
01100100	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1100100` represent the letter 'd'. (open in playground)
01111001	Starts with `0`, so it's a single-byte ASCII character. The remaining bits `1111001` represent the letter 'y'. (open in playground)

So this is a valid UTF-8 file, and it is also a valid ASCII file. The bytes in this file follows both the UTF-8 and ASCII encoding rules. This is how UTF-8 is designed to be backward compatible with ASCII.

Other Encodings

I did a quick research on any other encoding that are backward compatible with ASCII, and there are a few, but they are not as popular as UTF-8, for example GB 18030 (a Chinese government standard). Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.

The siblings of UTF-8, like UTF-16 and UTF-32, are not backward compatible with ASCII. For example, the letter 'A' in UTF-16 is represented as: 00 41 (two bytes), while in UTF-32 it is represented as: 00 00 00 41 (four bytes).

Bonus: UTF-8 Playground

When I was exploring the UTF-8 encoding, I couldn't find any good tool to interactively visualize how UTF-8 encoding works. So I built UTF-8 Playground to visualize and play around with UTF-8 encoding. Give it a try!.

Read an ocean of knowledge and references that extends this post on Hacker News.

You can also find discussions on OSnews, lobste.rs, and Hackaday

Some excellent references on UTF-8:

Joel Spolsky's famous 2003 article (still relevant): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
"UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so 1992." - Rob Pike on designing UTF-8 with Ken Thompson
An excellent explainer by Russ Cox: UTF-8: Bits, Bytes, and Benefits

#tech #history #programming