Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach You #2May 29, 2021
Hi guys I'm Scott Hanselman and these are things you weren't taught some of the people in the last video commented in the comments that maybe you were taught these things I don't care that's the title of the thing so maybe you learn these things and this might not be the video for you but it's a fun title and I like it so the idea is things that maybe you forgot or
didn't know or
didn't have a class on this maybe
theydid Don't pick it up, there are many people who are learning from boot camps. They're learning to be a programmer or an IT person learning and sometimes you just don't learn these things as you go through life. so in episode 1 we talked about carriage return and line feed and just touched on a bit of
I thought it would be interesting to talk a little bit about
encodingright here, so every once in a while it can be found on the internet and you could go and find a character like this, like a Chinese. character or a non-latin character and then you go into your notepad or maybe you've opened a text file from somewhere and you paste it in and you're like, oh you know it's happening, it's all nonsense and then you get frustrated or such time you open i opened a text file and it's all black squares and you don't know what's going on ok that brings us to the question of character encoding in our last episode we talked a little bit about the ASCII character table in that last episode episode I said the ASCII
codeword ASCII I, the American standard
codefor information interchange, and I saw it all casually, like anyone knew what that meant, so here's the deal back in the day that we only had a few bytes and every byte mattered in fact everything mattered so someone came up with a way to do it and put the first 127 characters you might need in seven bits not 8 bits but 7 bits. you added a little bit more chant if you had all the space in the world and you would get 256 but let's talk about 2 to 7 it's a deal this fight means this and this bite means that so
theyturned around said you know it's going to mean 65 base 10 well or is it going to be 41 in hex let's look at this the way I'm going to
teachyou this is I'm going to do something dumb I'm going to write a
computerprogram let's make a directory and call it write some bytes no matter what language i use i'm going to use C sharp but you can use whatever makes you happy so i'm going to say dotnet new console let's do some stupid console app i want to clear this point up and open some visual studio code with dot of code I'll open this up and we'll go to our program here and we've got hello world we're going to get rid of that we're going to say hey we're going to need some bytes we'll call them bytes and we'll make 128 of them so we have a byte array of 12 8 bytes, okay, and what we're going to do is create a for loop from number 0 to 128mm, okay, so we're going to say bytes hey bytes, put it in I what we're going to do is we're going to take I, which is an integer we're going to convert it to a byte we're going to stick it in there so we're going to make a bunch of bytes from 0 to 127 and then we're going to go we're going to say hey we need something oh and we're going to say dududu file dot write all bytes we'll call it iron 27 28 fights dot text and then we'll give it our bytes boom pow that's great now that's our application very simple now I'm going to output to the terminal here I mean run it and we're going to go and see those bytes so look at the left hand side here because if i got it right boom 128 bytes i'm going to open it up and this says you're using an unsupported text encoding what's happening right why is that because the first byte is 0 and the second part is 1 and the etc etc etc so if we go back to the table here the correct zero is not It's Knowles isn't interesting, but let's open it up anyway, shall we?
More Interesting Facts About,
code pages character encoding unicode utf 8 and the bom computer stuff they didn t teach you 2...
It's a bunch of bullshit until we get to recognizable characters, in fact if I delete all of this and hit save and then right click and look at it in a hex dump we can see the interesting bits started around 21 and when we talk that there's 41 for uppercase, there's 61 hexadecimal for lowercase, okay now what if we take 256 bytes 256 bytes 256 bytes 256 look, it's bigger than a byte what are you going to do? go too big so we're going to change this back to an int and then when we're done here we're going to push it to a byte so we're going to go from number 0 to 255 we're going to push here and I'm going to say dotnet run do a little hoax well it got bigger but it's all in one line what's all this crap? look at all this
stuffhere ok what's going on let's look at it in a hex dump we can see we go from 0 to FF but does this on the right really reflect reality? decide this number means it depends on that it's called character encoding now there are many different character encodings all this time when you were saying ASCII is just a 7 bit character encoding that means it's bits up to 127 there's a lot code
pagesout there code
pageswith windows code page for whatever reason it's called code page 1250 - it's the one for graphical applications and windows and code page 437 is one for console applications and there's a bunch of other code Agis that are identical until you get up to beyond 127 ok so for example one code page might say it's a euro character and I wonder if it might say it's cool and another might say it's a non-breaking space that could be an a with an accent on top it all depends on these cool DOS looking console offerings will only show up like this if you apply the co code page Right, okay, so you have to have a font that supports it, and you have to know what the code page is. another way of thinking about this is that the string you have means nothing unless you have a code page associated with it ok now if we take one of these files and open it in notepad what is that? looks like crap what happened here is notepad took notepad guessed guessed and said i think this is what it is i think it's utf-16 we'll talk about that in a second and he messed up let's open it in the notepad to a different app from note planet which also took i guess i needed to guess what it said and see if i double click that a notepad to a C is in fact code page 1250, and the one we saw that was for the console is called om 437, who knows why they called them those numbers. its silly this is the deal though these are all different views on how you can present these things another common one is ISO 8859 - 1 if i click on that it will say wait a second if i change it things could go wrong if we find some character here we don't recognize let's convert that to something else let's convert to default characters now in this case nothing happened which is a good thing but what if i change it to something like
unicodeand take that chinese character again? here a couple of times and then what we're going to do is change this to a C or, in this case, the more basic Windows 8-bit ASCII encoding.
I just made a new file put C look I can't even paste it in there what if I make a new file? I'm going to click here where it says dancing, I'm going to say
unicodeutf-8, I'm going to paste the mother character I'm going to press save when I put it on my desktop go to the command line look that in this case here I have three bytes It doesn't look Chinese and it's wrong but it's what we can do to ensure people got it right, what if we put a signature in front of it, a byte order mark?
Let's save it again. I want to point something out. I'm going to go ahead and say this is a nine character file right now I'm going to hit save now it's a 6 character file ok if I change it to Unicode save it again it's a three character file change it Unicode Signature back to six characters, still wrong in the dosbox because that's how things are going to work for a while, but there are three characters in front of it that give me information that maybe I didn't know well, what if I said I want? to change the code page iemon daus and what i said was to display this character using this code page i could go and i could say change the code page to 1252 which doesn't look good i could change it to 437 that's where we were at the beginning member that is the default code page or you could change it to Unicode which enables it, but what is that first character?
What is going on there? What is this here? passing let's go to a he x dump those three characters are called the bomb the byte order mark is the unicode byte order mark the idea was if you had this magic string here it would tell you what to expect says expect things to look like this and have the bytes be in this order from this point on for the byte order mark to take and then once I go and have that byte order mark in my text file it's supposed to be all from from then on it is stored as a Unicode code point, which is a magic number of two, three, or six bytes expressing a point on a map that could be any Unicode-compliant character; in fact a Unicode has this beautiful website where you can go and find all these characters if you have windows needs like windows. type character map you can get this wonderfully fabulous old app and choose any font.
I'll just choose a normal font like Arial and click on it and you can see the Unicode code point for that character and this is interesting it says keypress. e alt 233 if i ran notepad i bring it down here and i'm going to use the numpad on my keyboard here. I'm going to hold alt with my left finger and then with my right hand I'm going to type and say 0 2 3 3 and I just typed that symbol if I want to type the restricted trademark alt 0 174 makes sense when you go much further down, you can't type them yourself, but if you're looking for a character you can't type you can grab it, select it, hit copy and paste it there, that's fine, but again, if you're not attentive to its encoding when you save it, you might lose information because anything over 255, anything over 241, anything there, that's your limit.
What you need to understand though is that once you have this bomb, this byte order mark, you see that it works right out of the box and I can go and put ASCII before and after actually we go ABC ABC I look at the file here we can see which is was successfully loaded with utf-8 with BOM. I can right click on it and I can see the byte order mark ABC the chinese character this is interesting look right there my hex inspector actually points to the string at the bottom there then ABC again that's great without that mark byte order things would go south the last thing we're going to do if I make a little bit of space and I did 256 bytes with a bomb that's not how you'd do this just to be clear and what I'm going to do is encode the first three bytes I'm going to say zero bytes I'm going to do it EF byte 1 and bi 2 are going to be BB and bf I'm coding bomb so I'll do my other 256 characters and then we'll run this appears here boom there's my byte order mark now let's open this with notepad or the 128 byte text file got confused it's 256 one got confused open as ANSI looks pretty decent though remember the first 27 odd characters are a bit of garbage they're just control characters for doing co sas but 256 bytes with bomb clock right here where it says ANSI when I let it see how it says The utf-8 signature recognized that we wrote that byte order mark and was smart enough to even give us the characters for those top level bytes that we write those bytes larger than 128 so everything from here down is fine so I realized there are many different ways to express this information and maybe this wasn't the easiest for you.
I'm doing the best I can, but I want people to have a general idea of character encoding, encoding, what it means, how it works and
stuff. you need to know because when you get a string and you don't know the encoding of the string the best you can do is guess if it has a byte order mark then you have a lot more to go on but not all bytes are made equal and if you have any more comments or questions put them in the comments below and if you have an idea for a future video please let me know and i will do my best to make one thank you very much and please subscribe
If you have any copyright issue, please Contact