Unicode Encoding! UTF-32, UCS-2, UTF-16, & UTF-8!

Jun 07, 2021

Hi, this is something I've been working on for a while. I think Unicode

encoding

s are much simpler than people think. I've been putting together this kind of educational diagram that explains the whole concept and I'm going to post this separately but I wanted to do a video because maybe it can be a little more informative here now, Unicode

encoding

s first of all, what Unicode actually is, well , Unicode is kind of a collection of symbols that you have, you know, the number three, a semicolon. carrot, this hangal character, right arrow, poop emoji, basically all the symbols you see online are part of the Unicode standard, at least for the most part, and so far include about 144,000 more characters, but Unicode itself , the standard allows for about 1.2 million, so there are about a million empty slots that we haven't used yet and could use in the future.

Now every symbol in Unicode has an associated code point. We're going to talk a lot about code points here. A code point is its index or its ID. increases linearly so that you have 62 63 64 65 and they correspond to certain symbols. I can tell you that the letter a, the uppercase letter a has an id of 65, that is its code point, the lowercase a is 97. I have not memorized all of them. These are just some important ones, now computers use binary, so while we we use the decimal system, other people can use the hexadecimal system, which usually starts with 0x just for display and a binary sometimes starts with 0b, but usually you just have the numbers themselves and the computer handles the one on the right, the binary column further to the right, just a bunch of ones and zeros, and you can represent the number 62 with one one one one zero and that's how the computer handles these code points, so let's say you're sending a text message to a friend and the friend makes some kind of very funny and self-deprecating jokes, you say ha and a poop emoji h a space poop emoji, so everyone has the code points 72 97 32 and then the poop emoji that was added More recently id 128 169 is a bit bigger than the others, but of course the others were added much earlier so they have much smaller code points.

More Interesting Facts About,

unicode encoding utf 32 ucs 2 utf 16 utf 8...

The uppercase letter h is different from the lowercase letter h and uppercase letters come before lowercase letters. cases, so 72 97 space is before all of them, so it's code point 32 and then the poop emoji, so how do we store this text message in binary? Well, if we only had binary and nothing really to differentiate it, we could store this blob on something. like this and I've color coded it so that h is orange, a is pink, space is green and so on and you can see it would look like this. This is probably the most efficient you could package it into if you knew exactly when these numbers started and ended, if you could color code them in real life this would be the best way to store them, but if you're a computer and you can't color code this colors and you just see a bunch of ones and zeros when you start and end when you look at this message and say okay, that's a letter, that's a letter, that's a letter because they're not the same size and if you start and stop at different places, you get completely different messages the first one? one here we get dollar sign zero zero question mark and whatever that symbol is we get both with a circle around it the same and then copyright you get completely different symbols depending on where you start and end so when and how? know when to start and stop how we solve this is with a fixed chunk size what that means is that each letter will have a very specific number of bits now these ones and zeros are called bits that's what we refer to as i don't in actually they are integers or digits just bits, so if we say that each letter will always have, I don't know, 50 bits, then we will know that after 50 we will move on to the next letter and again and again and again and the Unicode stump as the standard. for code points up to 21 bits and just to show you what that means very quickly um, let's say one and we're going to repeat this, wow, repeat this for 21 spaces and then we're going to parse this, wow, parse int uh comma uh two so this is the largest code point possible um I didn't do it correctly 20 it should be 1 million oh 2 million oh because it is um yeah, it's mostly zeros except one, so zero repeats 20 I wonder why that happened and then one more here we go this is the number I was looking at about 1.2 million um and sorry the multitrack says um so if the largest is about 21 bits why not use well chunks of 21 bits to solve that?

Out with ucs4 or utf-32, now the reason we use 32 and not 21. It's a bit larger because computers are generally faster with powers of two, so if we use fairly large chunk sizes anyway, we're just going to be using 32-bit ucs4 is the 4-bit 4-byte Unicode standard, so 4 bytes each byte is eight bits four times eight, that's 32. That's why it's called ucs4, also known as utf -32, so let's take the exact same message and encode it. uses 32 bits each, the h space and the poop emoji are all 32 bits now, the box here is spaced pretty evenly, but now it's easy for computers to read, but we're wasting a ton of space if we take them all these zeros and we make them. red, look, how much space we are wasting, more than the number of characters we are encoding, this is absurd, so look even further, 21 bits is not really much better, we need a way to encode text efficiently if we have sizes with just a normal poem which is absolutely huge, we need to save space, we can't encode text like this, this may be easy to read but it is too big so if we want to encode text efficiently we need to use a smaller size to 32. we need to reduce it to 21.

Let's go to 16 bits by introducing utf-16 and ucs2. Now these are not the same unlike last time so let's talk about ucs2 first before the year 2000 only the first 65,000 code points were in use because of course we have added more over time so ucs2 says : hey, two, two bytes, let's use two bytes, just 16 bits for each code point, so we said okay and if you only get up to 65,000, you actually only need 16 bits, so that was perfectly fine, everything used 16 bits, but notice the poop emoji which is quite far from the first 65,000, which with a code point around 128,000 doesn't fit into 16 minutes and the project didn't exist then but it does now.

Nowadays we can't just use ucs2, utf16 allows us to still do this, so again utf-16 is not exactly the same as ucs2, although they both try to use 16 bits, so utf-16 to remedy this came up with a plan called surrogate pairs very fancy name it's not extremely difficult and I'll go over it in a second cgf16 said, let's do it this way, let's imagine that this big purple block contains the first 65,000 code points just inside this orange stripe, right here they are the surrogate pairs and these are approximately the numbers 55,000 to 57,000 and again this is just a collection of numbers when we say code points we actually mean numbers this block is a collection of the first 65,000 numbers and these numbers specifics we'll call them surrogate pairs now we can divide these 2000 code points into two groups, groups of a thousand each, these are called high surrogates and low surrogates, high circuits because they come first, low circuits because they come second, I guess, and you know it again. two different number ranges fifty-five thousand fifty-six and then fifty-six to fifty-seven well, why do we care so much?

These are reserved for utf-16, in fact you can't use these code points for anything that doesn't represent anything. real letters, what we do is when we find something that is longer than 16 bits, we represent it with surrogate pairs, we say, well, we can't represent that number with a long 17-bit one, we can represent this number with two separate 16-bit ones. code points these code points are specifically within the range of the surrogate pair, one high and one low, well let's see how we do that with the vomit emoji first we get its code point and this code point is 128 169 or this representation hexadecimal but we subtract the lower end of the range from the surrogate pair to find how much is left, so the first number here we get 62 63333.

Alright, nothing too crazy so far, let's move on to binary, so we take this number and again exactly the same representation , but this time. is in binary, well we get the first and right half of this primary number with 61 and then 169 and then we add each number to the start of its substitute range, so again we have high and low circuits, so we take the first bit and we actually bite first. byte fragment and we add it to the pi surrogate pair and then we take the second fragment and we add it to the low surrogate pair and that's our representation and we can reverse this whole process to get the exact same code point that we started with.

It's a little bit of a laborious process, but it's a little bit faster if we use certain programming concepts, like bitwise operators, and you have this kind of high circuit, low circuit number, very fast. I wanted to stop and this is a point. I didn't really do that in this diagram, but I want to explain why we have separate high substitutes and low substitutes. If you think about it, there is no good reason why we have to have two separate high and second low substitute pairs. instead set aside only a couple thousand and add both to the high surrogate pair, why is it added to the low circuit pair?

The reason is that if we ever stop in the middle of a text file and this is important Basically, for every Unicode standard that you're going to see, you're going to see these Unicode standards and you're going to think, oh, I can optimize, why doesn't it work like that? Again, there's a very good reason for that if you're given this massive text file and you're given a code point in the middle of that text file, just a certain code point, so a certain number if you're using utf-16, a certain 16 bit integer, by pressing utf-32, a certain 32 bit integer, you should be able to tell if I am currently in the middle of a character, if you are doing utf-16 or ucs2 then you will never be in the middle of a character if you're just starting out. at the beginning of a 16-bit chunk, you're at the beginning of a character, but if we're dealing with surrogate pairs and we start, let's say right here, we could be in the middle of a character, so if we can look up this character and look , wait a second, it's actually a low substitute, we're in the middle of a character, but if we see that we're actually in a high substitute, we know we're at the beginning of a character, so with this power of using different high substitutes and lows, if we use exactly the same number, then we can only see that we are at the beginning of a high circuit without knowing what the next number will be if you are at the beginning. from a high substitute, you know that the next number will be a low substitute and you can combine the current and the next code point to form a letter and this is a pretty good system, I think it's quite interesting.

JavaScript actually uses this for its code points, so a lot of times it's quite difficult to index something, so I'll use my system here, let's take this poop emoji and if we try to index it we'll get two strange symbols, these are actually the highest . and low surrogates, so if we do this, let me do a code point at zero and then the same thing here, code point at zero, we'll get two separate high and low circuits. Remember I said it's around 55 to 57,000. Look. exactly that's beautiful, this is the representation that javascript uses utf-16 to represent this poop emoji but of course for letters less than 65,000, you don't really need to do that, so if you index them, you will.

I don't get any kind of weird fancy code or question mark symbols, but again, this is a bit inefficient, still JavaScript uses it for whatever reason, but if you're writing something like an article or just poems or normal texts, usually you'll be using characters that are within what's called bmp or basic multilingual playback within the first 65,000 characters, so you can probably get away with ucs as well, but if you're using ucs2 this has no idea how to represent characters that exceed 65,000. So how do we enable a system that allows very large characters but conserves space as if it were optimized primarily for smaller characters because that's generally what you're dealing with if you're programming most of your code probably should it be smaller? than that because look at these three symbols here, they don't need all these 16 bits, look at them as red zeros, well, we're done with things especially like the Latin alphabet, we can do utf-8, this is really cool. one, what if we were only limited to eight bits?

What would happen if youboss say? Hey, sorry, new restrictions on the project, you can only use 8 bits per character. What do you do well? Take exactly the same message. This is something peculiar. The poop emoji is 17. bits long, how do we represent that in eight bits, maybe the first few, but let's say okay, let's try to use the surrogate pair system, let's reserve a certain range of characters? What character range are you reserving here? We only have eight bits, that's 255 numbers, what range of numbers could it reserve, so what do we do right? utf-8 works a little differently, this time we don't reserve characters, we use a certain type of system based on patterns, a series of steps or rules and cases, so check them out, let's see how it works and later I will make the system a little simpler, but to start, let's follow the steps for characters with a code point of less than seven bits, that is, seven bits or less, less than eight bits. again, utf-8 has eight bits per thing, but let's say only the first character we're dealing with has seven bits or less, which are most english characters like h a b c even numbers or spaces, so if you only have less than seven bits for seven bits or less add a zero at the beginning to make it a full byte it's a full byte eight bits and go about your day so with the question mark which is 33 that's the number here we just add two zeros to the start, okay, add it until it's a byte and that's done with g lowercase g which is 103.

It's a little bit bigger number, but it's still less than seven bits, so anything that's less than 128 , which is seven bits, we can do it by adding zeros to the beginning, that's simple, okay, that's what we're doing anyway. What do we do when we have numbers that are greater than that? Well, it gets a little more complicated. Let's say we have a number that is greater than 128 or 128 or greater but less than 2048. Well, we lead with zeros until it is 11 bits. have this original number like they did with g, so keep padding with zeros until it's 11 bits.

The system only works with 11 bits so make sure it's 11 bits first by adding zeros to the beginning once you do that take the first five. bytes and add one, one, zero, then the next six bits, sorry, I said, so far, bits, I meant, take the next six bits and add a zero to it, this is a bit mysterious, there doesn't seem to be a rhyme. or reason now, but later you'll see the pattern, so let's take this symbol here that has 163. Well, if we do um one zero one zero zero zero one one, we add zeros until it has eleven bits, we take the first five one two . three four five we add one one zero and the next six and we add a zero again exactly exactly what I said before with the step do uh, but it doesn't really seem to make much sense why we're doing this, so let's continue with what passes if greater than 2048 or 2048 up to 65,536.

Seems a bit strange ranges. They are not exactly. You'll see why again in the future, but these are very specific ranges. It is based on the fact that, unlike decimal systems, when you have binary or hexadecimal systems, these increase much faster than decimals with one more digit, they increase by a higher power, so in our third case that we are going to do is fill with your zeros until they are 16 bits, a little bit bigger than before and then the first four bits and before we said the first five bits now in the first four bits now we have had one one one zero, so we decrease the number of bits the ones we started with Before we said the first five bits but we added another one just keep that in mind and for the next second and third byte the same thing just a zero and then the next six bits so let's take an example with our arrow which is the number 65515.

It's a little bit smaller than this number here, it's mostly the ones with uh 0 1 0 1 1 at the end and we can't really expand it. This is actually already 16-bit, so it's perfect, but you'd usually expand it if it wasn't. we take the first four and we add a one zero to it, so the first four, then we take the next six and we add a zero and then the next six and we add a zero. Well, now you might be seeing a little more of a pattern. go to the last step, which is case four, if it's 65,536 until and I didn't put a number here just to show you, but this would technically last up to the 21-bit range, so again, if it's like that pad with zeros until be 21 bits okay, great, so the first three bits again we're going to go down again we said the first four five bits then four then three the first three bits start with one one one one zero again we add more ones at the beginning but we decrease the number of bits because we have byte fragments utf-8 works with bytes, each fragment is eight bits long, so each fragment has to add up to eight bits, then for the second, third and fourth bytes it is a zero and then the next ones six bits, so let's finally use the poop emoji as our example of how we could use and work the system, so this is 17 bits, it's not completely 21 bits, but let's add a little bit more zero zero zero zero, then let's add the first one one one one zero for the first three bits here and the first three bits in this case were just the numbers that we added just to the zeros that we added later, for the second, third and fourth, exactly the same system, but again a zero, so in conclusion, how could you think of it? utf-8 system well, if you have less than seven bits or seven bits or less, start with zeros and keep adding them until you have eight bits.

That's simple, if it's greater than seven bits, start with as many ones as bytes you need and then add. a zero, so in the four-byte system where we had one, two, three, four bytes, we had four leading ones, which means this is part of a four-byte chunk, then you add a zero at the end and then you add as many bits as possible. you can complete it and then you continue with um six, the next six digits, the six, six, six and you add a zero to them consistently. Now if the number itself doesn't fill in evenly, you add zeros to the beginning until it does and that's basically all the rest and if you think about this and look at it for yourself for a second you might think, hey, we can moving on, yes, we will end up in 21 bits for sure because that is what Unicode allows. but let's say in the future we extend the standard we can still use utf-8 youtube you can continue and yes you would be right we can keep adding more ones but again make sure you add a zero and let's say a whole byte. they could all be ones, if you want to express this massive number, you say, okay, the first byte is one one one one one one one one one one one one zero, which ends up taking up like two bytes, but yeah, we could potentially increase utf -8 to allow these massive code points, at least not now, but we could allow that and that's essentially the difference for utf youtube32 utf-16 utf-8 ucs4 in general, so our message now looks like this, so It looks like this a bit much, but it generally ends up being more efficient than most other methods due to the type of textual data it relies on.

I had an article going over a couple of areas so I took a Moby Dick by Herman Melville and in utf-16 we had 2.39 megabytes instead of 1.19 megabytes so utf-8 was about half that because It's a lot of wasted space if all you're dealing with are English characters, commas, exclamation points, and numbers, all within seven. bits, so utf-8 will generally be half as small as utf-16 when we have, however something that might be in a completely different language, like simplified chinese, utf-16 might be a bit better, so I picked up Dan and Brahma's article in my decade. interview, it has a simplified Chinese translation, so I took this file or this full article, I downloaded it with utf-16 and utf-8, utf-8 is a little bigger with 36 kilobytes instead of 28.8 kilobytes or kilobits Sorry, is that actually kilobytes?

I don't know, maybe it's bits or bits. I actually don't remember. I probably removed bytes, which just so happened to be the way I wrote it. But we see different applications of why utf-16 versus other formats could be more efficient than others. ones, um, I made a little ugf8 encoder just to show you how you can use things like um uh bitwise operators which I also talked about a little bit more in the article by the way, I made a little mistake in the previous part of this article with my photo at the beginning uh the surrogate pairs aren't here, I don't think they're actually inside the bmp um, but overall this might be an article that would be good to go over if you want to specifically understand how utf-8 works.

I have another video where I talk a little bit more about utf-16 and how interesting it is and how I can play with it and how the length property doesn't exactly line up perfectly with how much text there is if you use utf-8 for the javascript text encoding, it would be much worse, you can never rely on the dot length property, um, but we use utf-16, so it's mostly accurate up to some cases, and that's the idea of

unicode

and encodings, and not is that. It's hard once you get down to it, but you can't really trust the size of anything you get if you're a web server and you're receiving text, you know whatever, you can't really trust it to be utf. -8 you can't trust it to be utf-16 you can't trust that utf-32 so many times the headers of some type of document will tell you oh by the way the way we are using this utf-16 we are using ucs2 um , but a lot of times we have to kind of guess and check, just look at it and say, oh hey, these are valid symbols and if they're not, oops, I'm using the wrong encoding, but it's really frustrating.

It's something you have to guess if you haven't been told beforehand, but that's how they work. I hope you learned something here. If you want any clarification, ask for the comments. I'll be more than happy to explain any kind of little aspect of this whole system to you. This is something I think I spent about a month reviewing in Unicode. This is something I didn't really know personally until very recently, so I hope you learned something. See you later.

Watch Video & Subscribe

If you have any copyright issue, please Contact