Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach You #2

May 29, 2021

Hello friends, I'm Scott Hanselman and these are things

they

didn

teach

you. Some of the people in the last video commented in the comments that maybe

they

were actually taught these things. I don't care, that's the title, so maybe. you learn these things and maybe this isn't the video for you, but it's a fun title and I like it, so the idea is things that maybe you forgot or

didn

't know or didn't have a class on this, maybe you do. You don't learn it, there are a lot of people who are learning in boot camps, they are learning to be programmers or IT people by learning it and sometimes you just don't learn these things as you go through your life.

So in episode 1 we talked about carriage returns and line feeds and just touched on a little bit of

character

encoding

. I thought it would be interesting to talk a little bit about

character

encoding

right here so that every once in a while you find yourself on the Internet and you can go and find a character like this, like a Chinese character or a non-Latin character, and then you go to your pad. of notes or maybe you opened a text file from somewhere, you paste it in and say, oh God. you know it's happening, it's all nonsense and then you get frustrated or maybe you open a text file and it's all black squares and you don't know what's going on, which brings us to the question of character encoding in our last episode .

More Interesting Facts About,

code pages character encoding unicode utf 8 and the bom computer stuff they didn t teach you 2...

I talked a little bit about the ASCII character table in that episode, in the last episode I said the ASCII keyword ASCI I, the American Standard Code for Information Interchange, and I saw it all casually, as if anyone knew what that meant, here's the deal. In the past we had only a few bytes and every byte mattered, in fact every bit mattered, so someone came up with a way to do it and put the first 127 characters you might need into seven bits, not 8 bits, but in 7 bits. They made all of that fit into 7 bits, that's 2 to the power of 7, which is 128 if you add an extra bit if you had all the space in the world and you'd get 256, but let's talk about 2 to the power of 7, it's a deal that this fight means this and this bite means that so they went around it and said you know a will mean 65 base 10 well or it will be 41 in hexadecimal let's take a look at this the way I'm going to

teach

you.

I'm going to do something stupid. I'm going to write a

computer

program. Let's create a directory and call it. Write some bytes. It doesn't matter what language you use. I'm going to use C sharp, but you can use whatever. it makes you happy, so I'll say dotnet new console, let's create a slightly stupid console application. I want to make this point very, very clear and open a Visual Studio

code

with the dot

code

. I'm going to open this up and We're going to go to our program right here and we have Hello world, we're going to get rid of that, we're going to say, hey, we're going to need some bytes, we're going to call them bytes and we're going to make 128 of them, so we have I have an array of 128 bytes bytes, okay, and what we're going to do is create a for loop from the number 0 to 128 mm, okay, so we'll say bytes, hey, bytes put in what What we're going to do is take I, which is an integer, instead we're going to convert it to a byte, we're going to put it in there, so we're going to create a bunch of bytes from 0 to 127 and then we're going to go, we're going to say hey, we need some I oh, and we're going to say dududu file dot, we'll write out all the bytes, we'll call it iron 27 28 fights dot text and then we'll give it our bytes boom pow, that's cool.

That is our application, very simple. Now I'm going out to the terminal. I mean, run it and we're going to look at those bytes, so look at the left side because if you did it right, boom. 128 bytes. I'm going to open it and this says it's using unsupported text encoding. What's going on? Good because? Because the first byte is 0 and the second part is 1, etc., etc., so if we go back into the table here on the right zero is not Knowles, it's not interesting, but let's open it up anyway, okay? ?, it's a bunch of nonsense until we get to recognizable characters, in fact if I delete all this

stuff

and press save and then right click and look at it in a hex dump we can see that the interesting bits started around 21 and when we talk that there are 41 for uppercase, there are 61 hexadecimal for lowercase, what would happen if we took 256 bytes 256 bytes 256 bytes 256 look, it's bigger than a byte?

We're going to be too big, so we're going to change this back to an int and then when we're done here, we're going to put it in a byte, so we're going to go from the number 0 to 255. I'm going to put it in here and say dotnet run, do a little trick, okay, it got bigger, but it's all in one line, what's all this crap? Look at all this

stuff

here, okay, what's going on? Let's look at it in a hex dump. I can see that we went from 0 to FF, but does this on the right really reflect reality?

Who decides that this number means it depends? That's called character encoding. There are many different character encodings now, all this time, when it said ASCII. It's just a 7 bit character encoding, which means it's bits up to 127, there are many code

pages

, code

pages

with Windows code page, for whatever reason, it's called code page 1250, it's the one of graphical applications and windows and Code page 437 is for console applications and there are many other Agis codes that are identical until you go beyond 127, so for example a code page might say it is a euro character and me I wonder if I could say it's that cool. and another might say it's a non-separating space, it could be an a with an accent on top, it all depends on the fact that these cool DOS looking console offerings will only show up like that if you apply the correct code page . you have to have a source that supports it and you have to know what the code page is, so another way to think about this is that the string you have doesn't mean anything unless it has a code page associated with it.

Now, if we take one of these files and open it in Notepad, what is that? It looks like shit. What happened here is Notepad, he guessed. Notepad guessed and said: I think this is what it is. I think it is utf-16. We'll talk about that in a second and you were wrong, let's open it in notepad in a different Note Planet app. Also guessed, guessed, said and I see if I double click on there is a notepad with a C. code page actually 1250 - and the one we saw that was for the console is called om 437 who knows why they named them those numbers, it's nonsense, here's the deal, although these are all different views on how to present these things, another common one is ISO 8859 - 1 if I click on it it will say "wait a second" if I change it things could go wrong if we find any character here that we don't recognize, we will convert it to something else, we will make it default characters now, in this case nothing happened which is good, but what if I change it to something like Unicode and we take that Chinese character again?

I will only take the character for mother. I'll throw it here. A couple of things right and then what we're going to do is change this to C or in this case to the more basic Windows 8-bit ASCII encoding. We'll warn you, hey, it all went wrong. What if I just make a new file put C look, I can't even paste it there. What happens if I create a new file? I'm going to click here where it says dancing. I'm going to say Unicode utf-8. I'm going to paste the character for. mother, I'm going to hit save when I put it on my desktop, go to the command line, look at that, in this case here I have three bytes, it doesn't look Chinese and it's wrong, but what could we do to ensure that?

People understood well, what happens if we keep a signature in front of it? A byte order mark. We will save it again. I want to point out something. I'm going to go ahead and say this is a nine character file right now. I'm going to hit save now, it's a 6 character file, okay, if I change it to Unicode, save it again it's a three character file, change it back to Unicode signature back to six characters, still wrong in the dosbox because that's how things are going to work. a while but there are three characters in front of it that give me information that maybe I didn't know, what if I said I want to change the code page i.e. daus and what I said was display this character using this code page? oops and you could say change the code page to 1252 which doesn't look good.

I could change it to 437, that's where we were at the beginning. This is the default code page or you could change it to Unicode which enables it, but what is the first character? What is happening there? What is this here? Remember when we saved that stuff, we said: save it with a signature, let's open it, let's find out what's going on, let's go to a hex dump, those three characters are called bomb, the byte order mark is the Unicode byte order mark the idea was that if you had this magic string here it would tell you what to expect, it says wait for things to look like this and for the bytes to be in this order from now on so that the byte order mark is transmitted and then once I have that byte order mark in my text file, everything from that point on is supposed to be stored as a Unicode code point, which is a two, three, or six-byte magic number that expresses a point on a map that It could be any character that Unicode supports, in fact Unicode has this lovely website where you can go and find all of these characters.

If you have Windows needs like Windows, if you write a character map, you can get this wonderfully fabulous old application and choose. any font, I'll just choose a normal font like Arial and click on it and you can see the Unicode code point for that character and this is interesting, it says alt key press 233. If I run notepad, I download it here and I go to use the numeric keypad on my keyboard here. I'm going to hold down the Alt key with my left finger and then with my right hand I'm going to type, for example, 0 2 3 3 and I just type that symbol if I want to type the restricted one. trademark alt 0 174 makes sense when you get further down, you can't type them yourself, but if you're looking for a character that you can't type, you can grab it, select it, hit copy and paste it in there, okay, but again.

If you don't pay attention to your encoding when you save it, you're potentially going to lose information because anything over 255, anything over 241, anything, that's your limit, what you need to understand is once you have this bomb, this byte. order mark, you see it works immediately and I can go and put ASCII before and after it actually drops ABC ABC I look at the file here, we can see that it was loaded correctly with utf-8 with BOM. I can right click on it. and I can see the ABC byte order mark, the Chinese character. This is interesting.

Look right there, my hex inspector, it actually points to the string at the bottom, then ABC again. Great, without that byte order flag, things would go wrong. The last thing we'll do is what if I make some space and make 256 bytes with a bomb? This is not how you would do it, just to be clear, and what I'm going to do is code the first three. bytes I'm going to say zero bytes. I will do it EF. Byte 1 and bi 2 will be BB and bf. I'm coding the bomb, then I'll do my other 256 characters and then we'll run this program.

Up here, boom, is my byte order mark. Now let's open this with notepad or a 128 byte text file. They're just control characters to do things, but 256 bytes with bomb watch right here where it says ANSI when I let him see how it says utf-8 signature, he recognized that we wrote that byte order mark and was smart enough to even give we the characters for those top level bytes we write those longer than 128 bytes so everything from here down is fine so I realized there are a lot of different ways to express this information and maybe this wasn't the right one. easier for you.

I'm doing the best I can, but I want people to have a general idea of character encoding, what it means, how it works and that they need to know it because when you get a string and you don't know the encoding of the string. the best you can do is guess that if you have a byte order mark then you have a lot more to do but not all bytes are created equal and if you have any more comments or questions please put them in the comments below and if you have an idea for a future video please call me and I will do my best to make one.

Thank you very much and subscribe.

Watch Video & Subscribe

If you have any copyright issue, please Contact