Characters, Symbols and the Unicode Miracle - Computerphile

May 02, 2020

UTF-8 is perhaps the best trick, the best thing you can write on the back of a napkin, and that's how it was created. The first draft of UTF-8 was written on the back of a napkin in a restaurant and it's such a neat trick that solved a lot of problems and I love it. In the 1960s, we had teleprinters, we had simple devices where you type a key and send some numbers and the same letter comes out the other side, but there needs to be a standard, so by the mid-1960s 1960 in the United States, in At least, they decided on ASCII, which is the American standard code for information exchange, and it is a 7-bit binary system, so each letter you type is converted into 7 binary numbers and It is sent by cable.

That means you can have numbers from 0 to 127. They sort of moved the first 32 for control codes and less important things to type, things like "go down a line" or go back. And then they made the rest of the

characters

. They added some numbers, some punctuation marks. They did something really clever: they made 'A' 65 which, in binary (find 1, 2, 4, 8, 16, 32, 64), in binary, 65 is 1000001, which means 'B' is 66, which it means you have 2 in binary right here. C, 67, 3 in binary. So you can look at a 7-bit binary character and simply remove the first two digits and know what its position is in the alphabet.

More Interesting Facts About,

characters symbols and the unicode miracle computerphile...

Even smarter than that, they started with lowercase 32 later, meaning the lowercase 'a' is 97—1100001. Anything that doesn't fit in that is probably a space, which will conveniently be all zeros or some kind of punctuation mark. A brilliant, clever, wonderful, excellent way of doing things, and that became the standard, at least in the English-speaking world. As for the rest of the world, some of them made versions of that, but you start to get into other alphabets, into languages that don't really use alphabets at all. Everyone came up with their own coding, which is fine. And then computers come along and, over time, things change.

We moved on to 8-bit computers, so now we have an extra number at the beginning just to confuse things, which means we can go up to 256! We can have twice as many

characters

! And of course everyone settled on the same standard for this, because that would be perfect. No. None of them did. All Nordic countries begin to include Norwegian and Finnish characters. Japan simply doesn't use ASCII at all. Japan creates its own multibyte encoding with more letters, more characters, and more binary numbers for each individual character. All of these things are hugely incompatible. Japan actually has three or four different encodings, all of which are completely incompatible with each other.

If you send a document from one old-school Japanese computer to another, it will come out so garbled that there's even a word in Japanese for "garbled characters," which is (I'm probably pronouncing this wrong) but it's "mojibake." "It's a bit of a nightmare, but not bad, because how often does someone in London have to send a document to a completely incompatible and unknown computer at another company in Japan? In those days, it's rare. You printed it and sent it by fax. And then the World Wide Web came along and we had a problem, because suddenly documents were being sent from all over the world all the time.

Then something called the Unicode Consortium was created. In what I can only describe as a

miracle

, during the In the last two decades, they have come up with a standard. Unicode now has a list of over a hundred thousand characters that covers everything you could want to write in any language: English alphabet, Cyrillic alphabet, Arabic alphabet, Japanese, Chinese and Korean characters. What What you have at the end is the Unicode Consortium that maps over 100,000 characters to 100,000 numbers. They haven't chosen binary digits. They haven't chosen how they should be represented. All they've said is THAT Arabic character over there, which is the number 5,700-something , and this linguistic symbol here, which is 10,000-something.

I have to greatly simplify here because of course there are five or six incompatible ways to do this, but what the web has more or less settled on is something called "UTF-8". There are a couple of problems with doing the obvious, which is saying, "OK. Let's go to 100,000. That's going to take, what... to be safe, it's going to take 32 binary digits to encode it." They encoded the English alphabet exactly the same way ASCII did. 'A' is still 65. So if you have just one English text string and you encode it at 32 bits per character, you'll have about 20-something... 26?

Yes. 26, 27 zeros and then some ones for each character. That is an incredible waste. Suddenly, each English text file takes up four times as much disk space. So, problem 1: you need to get rid of all the zeros in the English text. Problem 2: There are many older computer systems that interpret 8 zeros in a row, a NULL, as "this is the end of the character string." So if you ever send 8 zeros in a row, they'll just stop listening. They assume that the string has ended there and is cut off, so there can't be 8 zeros in a row anywhere. 'K.

Problem number 3: it has to be backward compatible. You have to be able to take this Unicode text and put it into something that only understands basic ASCII, and have it work more or less for English text. UTF-8 solves all these problems and is simply a wonderful trick. Start by simply taking ASCII. If you have something less than 128, which can simply be expressed as 7 digits, you write a zero and then put in the same numbers you would otherwise use, so let's have that 'A' again - here we go! That's still 'A'. That's still 65. It's still valid for UTF-8 and it's still valid for ASCII.

Bright. OK. Now let's say we go above that. Now you need something that works more or less for ASCII, or at least doesn't break things, but is still understandable. So what you need to do is start by typing “110”. This means that this is the beginning of a new character and this character will be 2 bytes long. Two ones, two bytes, one byte has 8 characters. And you say in this one, we're going to start with "10," which means this is a continuation, and in all these blank spaces, of which you have 5 here and 6 here, you fill in the other numbers, and then when you do you calculate, you just remove those headers and it's understood as whatever number it turns out to be.

There are probably hundreds. That will be enough for the first 4,096. What happens above that? Well, above that goes "1110", which means there are three bytes in this (three ones, three bytes) with two bytes of continuation. So now you have 1, 2, 3, 4, 10, 16 spaces. Do you want to go beyond that? Can. This specification goes up to "1111110x" with as many continuation bytes after. It's a cool trick that you can explain on the back of a napkin or a piece of paper. It is backwards compatible. Avoid waste. In no time will it send 8 zeros in a row and really, the most important thing, what made it beat all the other systems is that you can go back and forth very easily.

It is not necessary to have an index of where the character begins. If you're halfway through a string and want to go back one character, simply look for the previous header. And that's it, and it works, and, as of a few years ago, UTF-8 beat out ASCII and everything else as, for the first time, the dominant character encoding on the web. We don't have that mojibake that the Japanese have. We have something that almost works, and that's why it's the most beautiful trick I can think of that is used around the world every second of every day. (BRADY HARAN) -We would like to thank Audible.com for their support of this Computerphile video, and if you sign up for Audible and visit audible.com/

computerphile

, you can download a free audiobook.

They have a wide variety of books on Audible. I would like to recommend "The Last Man on the Moon" by Eugene Cernan, who is the eleventh of twelve men to set foot on the Moon. but he was the last man to land on the Moon, so I'm not sure if he is "the last man on the Moon" or not. It kind of depends on how you define it. But his book is really good, and what I really like is that Cernan himself reads it, which I think is great. Again, thanks to Audible. Go to audible.com/

computerphile

and get a free audiobook. (TOM SCOTT) - "...an old system that hasn't been programmed well will take those nice quotes that Microsoft Word has put in Unicode, look at them, and say, 'That's three separate characters...' "

Watch Video & Subscribe

If you have any copyright issue, please Contact