A Brief Introduction to Unicode for Everybody

Published in

Daftcode Blog

10 min readApr 19, 2017

Have you heard of either Unicode or UTF? I can tell you that by reading this article you’re using UTF-8 right now (well, unless you have this text printed on paper). You may even have a subconscious feeling that this whole UTF is somehow related to the Unicode. Even if nothing rings a bell — don’t worry, I’ll explain everything soon. So what is this whole Unicode?

Spoilers alert: we’ll be talking about letters, so here is a photo of a typewriter — a clever device for making letters apear on paper

A look at the past

To have a good understanding of what Unicode is and why it exists, we have to look back on PCs history. Actually we have to look even further than the first computer you’ve ever used or seen. Let’s begin in the telegraph era.

For the sake of simplicity, let’s say that the telegraph works by sending letters from one machine to another (quite like the Internet). And to do this, it uses electricity. But how do you “send a letter” using something as simple as an electric signal, which either is “ON” or “OFF”? That’s what the Morse Code was for.

The very first encoding

I’m sure you know what the Morse Code is. A series of dots and dashes, or beeps and boops, somehow making sense to people who know it.

-.. .- ..-. - -.-. --- -.. .

For example, if you know the Morse Code, you can read that the sequence above is “daftcode”. Otherwise you may be wondering what’s happening here. Well, the Morse Code is nothing more then a set of rules — or a code if you want:

a = .-
b = -...
c = -.-.
d = -..
e = .and so on

With these rules, you can change a text written with letters into text written with symbols — a process called encoding. And the good thing about the Morse Code is that it works both ways. You use the same rules (but the other way around) for decoding.

Now, it get’s clear how to send the text with a telegraph. First, you use the Morse Code to encode your message. Then you “beep” it with short and long signals, waiting between letters. This way the person on the other end can decode it using the very same Morse Code.

How do you show the Morse Code on a photo if not with a radio? (they didn’t actually do “beeps” using the microphone)

Fast-forward 100 years

We’re moving forward to the year 1960. But in the meantime, people have figured out how to encode telegraphs automatically. Now, all you need to communicate through electricity is a special device called the teleprinter. Think of it as a typewriter that can receive and type text sent by other people.

But it was hard to make it work using the Morse Code. You see, letters encoded with the Morse Code have a different length. For example, the letter “e” only uses 1 signal, while “b” needs 4 of them.

So people have invented a new code which makes all the letters the same length (5 in this case).

Now all the letters are the same length, so you don’ have to wait between them. Count the signals and you know when a new letter begins.

So we’re left with a constant stream of long and short signals. Instead of sending long signals let’s send nothing (an “OFF” state) for the same duration we send the short one. Now, we don’t need any kind of “silences” in-between. What we have is a stream of “ONs” and “OFFs”, or 1 and 0, or “bits” as they’re usually referred to.

And that’s more or less what have happened before 1960.

Say hello to ASCII

Here we are — the year 1960. People are using teleprinters, but are often dissatisfied with the limited available characters. At the time you could only send capital letters, numbers and some symbols. The problem is that there are only 5 bits available. Since a single bit can be either one or zero there are only 2⁵ = 32 possible characters you can send. Well, actually people use a clever trick (which is not important for this article) to send a little above 50 characters. But English alphabet has 26 letters and 10 digits, so there aren’t many characters left out of those 50.

That’s the reason why the American Standard Code for Information Interchange, better known as ASCII, got created. It used the same concept as previous encodings but was 7 bits long. 7 bits mean 2⁷ = 128 available characters. With so many characters available, they were able to fit both lowercase and capital letters, numbers, and all kind of special characters in there.

Just look at all those brand-new letters available

It was a great standard and became a foundation of everything that was about to come.

Going international

Despite having such an extensive character range, there was one serious issue with ASCII. It was designed with English alphabet in mind. And what about German ä, ë and the rest of the umlauts? Or Polish ź, ł, ę and all the others? Or any other European language which definitely uses some special characters?

That’s when ISO joins the game

By the way, did you know that ISO stands for International Organization for Standardization, which in no way seem to shorten to ISO — pretty ironic as far as dealing with standards go.

Anyway, ISO issued a group of standards named ISO/IEC 646. These standards explained how to adapt ASCII to regional needs. It indicated 10 ASCII characters that can be replaced with special, regional characters. There were also two characters dedicated for currency symbols. Now you could have used your local currency instead of a “$”. But good luck trying to fit all 18 Polish letters in there. Or imagine you are a software developer writing in C, which was a prevailing programming language back then, and trying to write some code.

That’s how C code looks like. Quite a lot unsupported characters here…

The thing is your language standard doesn’t support any of the following characters:

# { } [ ]

And it only got worse from here…

Welcome to the 80s

Do you know what happened in the eighties? Well, probably a lot of things, but one of them was 8-bit Personal Computers getting more and more popular. From now on we can call 8 bits a byte. Anyway, this one extra bit made PCs quite different from earlier 7-bit devices. And it was pretty convenient as far as text encoding goes. Every extra bit doubles the number of available characters. There were now 128 new empty spaces to fill. And computer manufacturers didn’t wait for any standard to arise. Each of them took initiative and invented their own encoding system. We saw (well, not me personally, but I’m sure some of us did) rise of encodings such as HP Roman, Mac OS Roman, and Code Page 437 — used by IBM and in DOS.

It desperately called for help.

ISO once again

The call for help was once again heard by the guys from ISO. And what they did was (most likely) say:

Enough of this! Here is the one and ultimate standard for character encoding called ISO-8859–1.

And it was great as long as you were using Western European languages. If you were more into Central European languages there was a standard for this as well, called ISO-8859–2. Actually, ISO-8859 was a whole group of standards all the way up to the number 16.

Some people say, there’s a relevant XKCD comic for everything. 927 — Standards is relevant here.

It looks like this whole thing with creating new standards was quite fun. And Microsoft didn’t want to miss on that, so they’ve joined the bandwagon as well. They created Windows-1250, which was almost like ISO-8859–2, but not quite. And Windows-1252, which resembled ISO-8859–1, but with some additions here and there. And it didn’t stop here. More and more encoding were created, often changing and remixing existing ones.

That looks like fun. It’s probably why Microsoft joined the bandwagon.

But let’s stop for a moment and ask ourselves one important question: “why?”

I mean there is an encoding which supports western languages, there is also an encoding for eastern ones. Why would you create some strange hybrids which support a few characters from here, a few from there and throw in a few entirely new ones? Imagine you have an encoding which supports French letters. You also have another one which supports Greek alphabet. Using only one of them (you can’t use two encodings in one file) write a text in French about Greek language.

Now you see why.

And the fun didn’t quite stop here

Up to this point, we’ve been mostly working with languages using Latin letters with some regional modifications. But as you may know, there is a bunch of languages that don’t do this —people from China, Japan and Korea were all using computers as well. But how were they supposed to encode tens of thousands of Chinese characters with only 256 characters available in an 8-bit encoding?

They all agreed (that’s something new as far as encodings go) that it’s necessary to use 16 bits (otherwise known as 2 bytes) to encode all those characters. It might not look like a big difference, but keep in mind that instead od 2⁸ = 256 possibilities, you now get 2¹⁶=65536 possible characters. 65 thousand characters were just enough for encoding Asian languages. But don’t be fooled. Agreeing on code length didn’t mean agreeing on one encoding standard. The same encoding frenzy that was happening in Europe and the USA was happening here.

Can we all agree to never repeat a “war of standards” like that one?

Unicode to the rescue

It was at this moment people saw the ultimate solution to all their problems… Or did they?

To resolve all encoding issues, there was an idea to create one global, universal characters map. Unlike previous ISO encodings, it would cover all existing languages. This idea resulted in creation of the Universal Character Set (created as a norm called ISO/IEC 10646). But as you might have guessed, it was not the only implementation of such idea. Anyway, the first thing they did was create a Basic Multilingual Plane. It was a map of most common Chinese characters and letters for all other languages.

But they didn’t only create a list of all those characters. They also suggested a way to encode them, which they called UCS-2. The Chinese language needs most of the mapped characters out there and they did fine with 2 bytes. For this reason (among many others) UCS-2 also used 2 bytes — 16 bit code length (hence 2 in name). But it quickly turned out that Basic Multilingual Plane was actually quite incomplete.

And it all started with sending electrical “beeps” two centuries ago

If 2 bytes are not enough for you to encode all the characters, what do you do? That’s right. You create a new encoding with a longer code. And so the idea for UCS-4 was born.

Birth and rule of the Unicode

To turn this idea into something real, IEEE suggested a character encoding which met UCS-4 requirements. But for some reasons, which are out of the scope of this article, their encoding was not accepted as a standard. That’s when the other player took the initiative.

This other player was Unicode. Unicode is an organization (Unicode Consortium) creating the Unicode Standard. Unicode Standard is more or less the same thing as Universal Character Set with some extra rules. Unicode had their own 16-bit encoding as well, and was creating a 32-bit one. It turned out, that Unicode Standard was easier to use than Universal Character Set. More and more companies were leaning towards the simpler alternative. At this point, ISO noticed what’s happening. Once again, many competing standards were starting to exist. To prevent this, ISO negotiated with Unicode. And miraculously, they both agreed on one common solution.

This way Unicode’s encodings, known as UTF (short for Unicode Transformation Format), became a standard way to encode texts in the modern world. There are a few variants like UTF-8, UTF-16 and UTF-32 (yup, those numbers are code length) which are all used nowadays. Each of them has its advantages and disadvantages, but all of them can support millions of characters, which should be enough for many years to come…

*One Encoding to rule them all, One Encoding to find them, One Encoding to bring them all and in the darkness bind them all*

That was a long way humanity had to come to decide on one universal way to send a text using electricity. I had to simplify a lot of things to make it fit in one article, but the main story remained intact. The Internet nowadays is a much happier place with almost 90% of all websites using a common standard for character encoding — UTF-8.

Woah, wait! But how do they fit millions of characters into an 8-bit UTF-8? Wasn’t 8-bit ASCII only supporting 256 characters?

Well, that’s a different story, which happens to be a second part of the one you’ve just read. We’ll get into technical details of Unicode there.