Booting faster

Jun 03, 2021

I have Stuart-smith with a

faster

start. Thank you. I'm Stewart. I work at IBM. I came up with a good way to describe my work and that is to bring the F word more to the open source communities and that F word is firmware. so I work on open power firmware mainly opal and everyone should have open source firmware stacks and I'm going to talk about how we're trying to make machines boot

faster

so a wise person once said premature optimization is the root of all evil and when you combine this with the idea of trying to activate a new machine or a new process and you realize that boot is a feature, you might think that you could try to activate it first, make the machine boot before do it quickly. and that's definitely what we do with our hardware and I had this thought process going into this Hayne talk saying, hey, what's the fastest boot machine I've ever used?

It looks like this, it's like you press the power button before turning it on. the power switch on the keyboard everything was working and you could go about your life it was very fast the processor starts executing instructions and compensates directly from the ROM the work process continues and this is a little bit slower because we are doing all this. /or right, you hear the word disk drive and that just brings back memories, right? For little kids we had this, yes, and the Apple 2 is slower, but you know we're reading things from the disk, so this is kind of like that. which involves a lot of I/O and possibly an operating system and everything that's in there, and there's probably a difference between a firmware type and an operating system, there really wasn't, but once we got the stuff that was They were, unless things still took longer because they were doing more. things and one thing I started thinking about what users care about is the total correct boot time, so if you look, you know a PC or something there, you might get all this time through the BIOS and then finally start

booting

an operating system.

More Interesting Facts About,

booting faster...

The user doesn't care if the OS only takes a second or not, they take the total time they care about when they can get to the point of accessing a login prompt or whatever their workload is, for so this talk will work backwards from here, so I thought, let's go from the moment your computer is ready to do something useful, let's write the back to the point where you know the power cord goes in, press the button when you hear which has user space, other fancy weird lands where you have things like memory protection and prompts and graphical things and it looks somewhere like this, we have user space, that's your actual workload before that, we have to boot an operating system kernel before that we have to boot some kind of bootloader that will work where you load an operating system and there will be some firmware and really before that you have something else if you have a server type of hardware you have a service processor, you have a BMC and that is what allows you to turn on your computer, so it is the little computer inside your computer that allows you to turn on your computer or rather it turns on the smaller computer that is inside your processor the one that powers up the other computer to make it look like a computer is all the Turtles the way down, so for all of our machines, our open power machines, they all have a service processor and I'm going to talk about the service processor.

Well, let's say it's already running and if we look, you know an OS boot process of What's the full path to do that and this is just running in a virtual machine and my laptop obviously loaded because you had a web browser open and that uses all your memory, so the operating system brought to your space is really fast?

booting

into a graphical environment is actually very fast and even when we run it on our machine you know our machine so this is booting on a powerful server just boot Ubuntu like many things are done in parallel now we have wonderful programs that have dependencies. and figure out what to start and in what order and what you can do in parallel and start things very quickly, you can use a space and a message and quote your services very quickly, so I wish it were faster, but it's already pretty fast. and if we look at the total boot time we start to see things a little bit different so this is a power 8 machine boot and you may already realize you know this doesn't look like a linux base, It's not firmware and can probably work, this is taking a decent amount of time, probably longer than it took the Fedora system to vote within that 1/2 system to vote.

I remember what I was saying about total boot time like we don't care as a user whether it's the operating system's fault or it's the family's fault, we care about things being faster in general and you know, things aren't necessarily they got better when we switched to energy and I, so that was an electric rate system like a power line, we already know that. A bit of the initial impression after BMC decides to start the machine boot process is posted in a few seconds in a video, but we spent a lot of time reviewing the family, so where were we to power 9 systems they have?

It's been out for a while, about a year, it's not the first system we've shipped with this firmware stack and I thought I'd go back and see what our original open power system was, how long did it take? To begin with, it was this, so it's kind of an enterprise IBM box, but it ships with a firm stack like you would get in an open power system and it has this beautiful web UI where you can do things like click the button that says please turn on my system and it has 4 states. What you have is that nothing appears on the IPMI serial console and you have this little web thing here that gives you very limited status to what's happening and If you can figure out what those hex codes mean and remember what they mean, then you're better person than me because I just Google them and that's the answer you get when you boot the machine and in fact it took a long time, the only thing we see.

The way out before that happens is a little boot load and Linux before we get to an operating system and this is part of something interesting because a quirk of our systems is that before we load an operating system we boot an operating system, so If you were going to design a system of how you would read a Linux file system to find a kernel in it and then start that kernel and run a bunch of faster data structures and like a configuration file, what do you think that would be? a good environment to write that in both, yes there is always only one person in the whole world who wants to write it henceforth.

It turns out that now I want to do it. I'd love to run it in userspace, wouldn't you love it? to write all this code in userspace we have memory protection and good debugging tools and all that and hey, didn't someone already write code to parse all those filesystems? Isn't it called Linux? Why don't we start? a Linux kernel and then how do we move it to another operating system? Well, exactly, that's literally what it's for, so we boot up a little embedded Linux environment from firmware on a flash that has a very small amount of user space that gives it a you know. curse the UI and it will read the filesystems because we only use the filesystem code and if that doesn't work correctly we could just fix the kernel and it will be fixed for everyone, that sounds good, let's contribute maybe and we might as well have a idea. that we could change things to be better for everyone, not just our firmware stack, it also means that everyone else can maintain it for us, so thanks for maintaining large parts of elf anyway, it's very cheap to hire people when they do it. for you for other companies and we don't have to rewrite the drivers properly, we have to write drivers for an OS anyway for them to work and we don't have to rewrite them in the firmware and fourth, we can write them once and in the kernel until the suppliers get their driver upstream and that's our strategy: you want to have a device that boots and our systems tell you, oh mom, it turns out that it works very easily and isn't that a good change in the world, so we?

We're not shipping kernels, so when we boot Linux here, this Linux environment, there are improvements that need to be made, so on that first machine that we launched you can tell that it's actually a pretty slow Linux boot. People have seen Linux boot very fast before and we got something. and it took about 40 odd seconds to boot the kernel and present the menu with what you can boot with all the options available here's booting from a network, booting from a disk and all that jazz, so it took quite a while to Boot a Linux system, so how can we deal with it?

There have already been a lot of Linux distributions that have done a lot of tricks to make our next boot faster, so we should copy all of WorkFirst right before we start. the rest of the firmware, this is a fix problem that people have been working on for years, so we do it, so our first efforts were going well. We were able to get all the driver modules right instead of building their entire kernel. We were able to do it all. modules and that means we could probably use a faster space instead of waiting for the hardware to go into it, then we could boot it in parallel and then we could start, you know, as the disks come up and scan raid arrays, we could start analyzing them. and it would be a lot faster and it's like, yeah, turns out we can do that and we've gotten that from Linux from about 40 odd seconds to now about seven, which is a nice improvement, so we've shaved off 35 odd seconds of boot time. , but one strange thing here is that you will notice that the disks have not appeared yet.

Did you have an option to boot? It took you an extra 30 seconds once you're in the menu to determine if you could actually boot. something so you can go and set up a network connection to a shell and do all that, but you still can't start the OS, so what do we do there? What do we get? Faster hardware. It turns out that enterprise speed adapters can take a long time to start. What if we use different speed adapters or poach the teams that made those speed adapters to make them detect the drives faster? So if we take the different machine with a different thing, maybe like a normal hall nvm ager or some cheap card, what does that do?

Well, it turns out that hard boot time, right, we could shave tens of seconds off boot time just by having an easy way to find disks, so for machines with the big enterprise raid cards, that works out to adding just 30 seconds at boot time. being off the board to discover pucks and yeah, fast, right. Oh time, brilliant job, well done and we combined them well, we combined all these tricks and one crazy trick was we added the silent option to the kernel and it turns out that many ipmi implementations are less than ideal and can spend a lot of time trying to emulate a port series and we added quite a bit of option and on some machines they cut 18 seconds off the boot time, so yeah, a bit like the good old side quest, but here we can update the Linux kernel very quickly. on a big machine like you know, dual socket, 22 cores per socket, 4 threads per core, a lot of memory, we can make Linux increase user space and run very fast, so Colonel saved many seconds of boot and we fixed this with Brienne allowing the BMC. like an open source BMC called open BMC and you should definitely run it on your hardware, so this is great, we saved a lot of time, you know, between 40 seconds and seven to nine seconds, not just because it feels faster because we started. we set up the UI first before we've detected everything, it's like, oh, it's done.

You notice how you turn off a DES and turn it on and the disk still runs and things are still booting up because you don't need everything to be on first, so it's kind of lovely and simple in a way, so it feels faster and actually , it's considerably faster to get there, so what about where that time goes? So in 7 or 9 seconds, you know where that time goes, and so on, if you look. in the kernel log you see between four and a half and four point nine seconds when we get to the initial user space, so we get to this point of running it while processing it, you say, well, that's a long time before we can get to using a space, what the hell is he doing?

I investigated it and we passedabout two point five seconds unzipping into a Fess Branch because we're flashing this in the firmware, so we want to use the sheet space as much as possible. so we have an The laptop or the bar serves as we sell or whatever, it always seems to take two in half a second, so one idea we have is to eliminate that two and a half seconds are white. There are previous startup phases. What would happen if we did the decompression before and? whereas we were just saying wait for PCI devices to be discovered, could we say two and a half seconds, so every series of patches that are written there that do that and we say two and a half seconds and that's great.

We're not currently in production due to a bunch of really horrible, deep technical details in how we assemble the firmware images, something I'll only talk about with copious amounts of alcohol, but we'll get there soon and we'll be able to narrow it down further by two and a half. seconds off, so that's great and once we get to that, we're like, you know, two and a half seconds away from spawning a kernel on the machine, yeah, we can probably get it down, we can probably get the most userspace fast that only takes a couple of seconds, but let's look at the entire boot time.

We have made the Linux process really fast. We copy everyone else's work on how to do it.Quickly and off we go. I have an improvement to shave a couple of seconds off the total boot time for a powered off system, so the first one I showed was slow and we never got to the end. It was originally more than three minutes long. from clicking the button in the web UI to getting the ability to load your OS which took over three minutes, we got it down to less than two minutes and it's like a full minute of boot time which is pretty good , 33 percent without terrible effort, that's great. it turns on a nine or a bit slower for various reasons, we usually add features in various places that slow things down such as secure boot, it takes some time when you are checking the firmware and you have to pin more pages in memory when your cache contained which means that then you have to verify more signatures and have less memory to get things out, so there's all that kind of stuff, we have to load more, we have to verify that we have more code in the small processor that starts your big processor, so which takes more time for it to load its firmware, boot and deliver things, but it's obvious if we think about nine seconds, when we're talking about more than two minutes, nine seconds may not be the place where we should look for the next place to optimize .

Okay, we're probably done at that layer, so here's what our firmware stack looks like. You have the last part. I've been talking about Linux and the small boot tier, which is the kind of deal you want to get into. and eject some disks before that we have a piece of firmware called ski boot right where I mainly work and it's a little thing written and I see it's boot and runtime so it still stays with Linux running before we have the boot of the host that is activated. your computer from a stack of transistors to something that looks like a computer and before that we have the autostart motor, which is a microcontroller inside the CPU that makes it so that a core can execute instructions from the cache, so it's what works well.

What I need is to push some instructions into the cache that will tell how to load the rest of the firmware, configure the CPU to be able to execute instructions and press "Go", so that's our boot process, so let's find one, remember to go back in the time, let's What are we doing here? On a Power Eight open power system, we spent about 20 seconds there and it used to take a lot longer, so in that first iteration it used to take about a minute and that was because we did things in series and it turns out you have a lot of CPU cores so do things in parallel and you know it worked so what's taking time well PCIe in the spec says you have to wait a certain amount of time to find out if there are any cards there it's like say hi.

Either okay, let's go there, there's that time that's built into the spec, so we guarantee that we'll always spend a couple of seconds when you have PCI, which is a minimum amount of time that we could achieve and our original time took a long time. because we did "hey" for each phb, we tried to find things and it turns out you could do it in one go, so we cut a lot of seconds to do it that way, we have a lot of cpu threads, like you can have 20 to cause a socket on 4 threads per core, that adds up to a lot, so even without doing that, you know how to switch tasks, you just run everything on different CPU threads and it's fine, so PCIe now takes about three seconds of that boot time, per what you may know that 20 less. 3 equals 17, someone is not well, you learned something today, so where does that time go?

It turns out if you measure it and that may include looking at more detailed firmware and logging levels or adding some printf without additional because printf is the absolute supreme way. From debugging we can find out that all this time is spent loading something called this, so Linux and the small boot environment I was talking about is a compressed sixteen megabyte partition on flash, our flash is partitioned and the kernel has been compressed with exed and the init Rama faces compressed Exide, so that's probably the smallest thing we can do and part of skiboots' job is to load it from flash somewhere and jump to it at the end, so, How long has that been on our Enterprise?

II so everyone has the web UI there and knows the hex codes and all of that takes less than a second to load those 16 megabytes because it's an enterprise system that has a fast connection between the service processor where all the firmware is located . It's stored in the CPU power, so you can just put it in memory real fast, but what about our open friends system, so the ones you know you can buy now for a thousand dollars a board or the ones you know they are among the top two supercomputers in the top 500, for example, what happens there.

Well, to tell this story, we need to go back in time. Now, one comment when I gave a draft of this conversation at the company was that some people may not know nicer buses. and I think I said Marquis, you're telling me the old thing, so please someone assure me that someone knows what this is, so we had these slots, those slots and they still exist in modern computers and in almost all of them and it's called pin count because they basically took and saw and reduced the number of wires to 4 and then quadrupled the clock speed and added some extra features and that's it, otherwise it looks very very similar and the ideas are the same , so I have We haven't evaluated loading our firmware from flash from one of these things, but I suspect performance is similar, so this is what LPC looks like to cycle firmware reads correctly, so it's a 33 megahertz bus and is quite chatty for reading data, so this is reading. a single byte and in case you haven't done the math there, you can see that there are a lot of cycles involved in reading a single byte, so if you go to 33 megahertz, you know that's not many megahertz anymore and now you calculate that they can't. you're too quick and you go, there's a lot of overhead, it could be better, it's like, they thought about that and said, you know there's a big 128 byte company that we read, come on, ah, that was all overhead, you only took 128 bytes instead of one, so that would be great and you might notice a little word that says optional, but let's do it, let's do the math, so for small reads here you can say, okay, how many clocks is all this in the spec, it's a great specification. from Intel is very useful, read if you need a book on time of flight and say how you are going to do it here anyway, okay, here is the total access time to the clocks, bandwidth Oh megabytes per second 1 .75, is fine and for some reason the math is missing. this slide and that over there, but let's look at it there and think 1.75, what does this mean? 16 megabytes divided by 1.75 megabytes per second in the optimal situation where you never have to wait for it to flash and nothing is busy and you know our code is perfect. and all of that is nine point one four seconds, whatever, eh, so it turns out that we are not quite ideal an hour, around 12 seconds or something like that, but that is the lower limit to read those 16 megabytes and if we look closely, What if we had this 128 byte read cycle when you look at the other side and it does the math for you because you know who wants to do the math when you're urgently trying to write an LC?

I'm talking, you could say wow if we had we could get 5.7 4 megabytes per second and he would say oh that would be great but it turns out our hardware has a limitation so our service processor doesn't support these one hundred and twenty eight bytes per cycle Reading. I know this because not only did they tell us it didn't support it, I tried it and it didn't work, I never believed almost any documentation, I always tried it and it sucked and it turns out that if we maybe change our system designed to, instead of having our system connected to the processor service you know, out of the BMC and read it and we had it connected directly to the LPC bus, maybe we could get some flash chips that would support this and boot faster, I think some x86 systems do this, they have a MUX between who controls what and We ended up doing it this way because we have some extra features and other things that work and details etc., but making hardware changes there would really improve boot time with a single line of code. changed that way, we'd shave that dime off by five seconds per OH, so what other buses do we have for the BMC to pull the firmware?

Could we just read it via PCI so that the BMC also connects to the host via a PCI and That's why on servers you have a VGA port on the back which is actually a PCI device on the BMC which is then exports to the host, so PCI appears quite late at boot, so it appears on the ski boat, but to do the data transfer. to something that we would have to write more code for and you know it sounds like work and you know this important B is sampling so we can do it, but we also want to think about whether this is the best investment in this. place, this will solve everything and let's look at the full boot image again, so we have two minutes, 25 is the total boot time of our system, we have Linux and the small boot is now 9 inches, seconds, anyway, it's okay, great, and then we have ski boots. now it's about 20 seconds with reading those 16 megabytes, which means that the previous two firmware phases, SB boot and host boot total about one minute and 50 seconds, so we should possibly see what they're doing, could we make them faster and Let's take a look, we spent about a minute and seven seconds booting up the host and this is a bit of firmware that, among other things, opens the memory, so it boots up, the little controller performs thermal controls, start the little thing that you can use to then do. deeper stall states within the CPU cores, so it's power management on the individual core, there's a little microcontroller that knows how to turn the core on and off and you want to configure it correctly, so you need some firmware and everything that is there configures it. it trains the memory and generates DRAM and all that and this is the boot process on a dual socket p9 so it's pretty fast now that we're starting to train the memory and it comes back but how does this work on large systems?

Because let's go on a detour on crazy big systems, so the host boot also runs on some kind of giant multi-drawer, many-socket system that you buy for a lot of money when you need a ton of memory on a ton of compute and the way they boot is kind of interesting because you have a different SMP bus between these drawers than you do between the CPUs on a board, so you end up triggering the hosts to boot on each draw individually, so you have four instances of this piece of firmware at some point and you get to the point where you train the links between the enclosures that have additional processors and then you just put everything together and convert from four separate computers running something that looks like an operating system to one computer which now runs an OS with a lot more memory causing and a bit strange but that's really cool but I won't talk too muchAbout that, I just feel like that's how big machines work, so we shouldn't break those machines because they turn out to make a lot of money and the firmware team would be very grumpy if they couldn't do the rest of their work anymore, so let's break it down, what does it take to boot the host?

Before we have D RAM enabled, while we run the cache content, it takes around 33 seconds. you know, I say a relatively small machine with only 512 gigs of RAM, it takes about 10 seconds to boot that up and it's well, that's pretty fast, 10 seconds, half a terabyte, do it all and post that the RAM is about 25 seconds, so what? Is it working and remember how? Just talking about there's all that time reading that stuff from flash, so let's look at our flash's partition table, so these are all the partitions in a firmware image for a widow's phone system and they all have names and offsets and flags and all that kind of jazz, but the ones that the host boot will read during a normal boot just do a normal boot with no errors, these are the ones that are really hard to see and read and you don't really need to worry about other than the fact that If you add up all the numbers it's about 32 megabytes, so let's do some math. 32 megabytes divided by 1.75 megabytes per second equals 18 seconds, so at best, we are not waiting for us to flash, everything is working perfectly and we are going very fast, so there are at least 18 seconds of that minute and 7 that by definition hardware will be spent reading things and one of the strange things about that bit of firmware is that it does on-demand paging, so it basically sets up the MMU and it will page things in and out of flash because running of the contained cache is that you now have like a small computer a counter of megabytes of memory and that memory is quite scarce, so you could just pay some of your code to it and That doesn't mean that many firmware authors care so much about the size of the code and it's also a way to get the computer to boot instead of really fast and evil premature optimization, so we have the strange thing if there is a beating before there is memory, so we have swap before there is RAM, So how can we take a look at what's happening?

We want to know it. Okay, so you're probably doing this and you're probably reading a lot of it, but are you optimally reading what's going on? You could do this by trying to put printf without - host boot correctly, but remember how I said it adds time to the boot time when you print too many things silently, saves 13 seconds and then you say, well that's it, it would be annoying to print everything and then you have to pause it and it would be annoying or you could put it in memory somewhere and copy it but you don't have memory yet so you know it's a bit annoying and you don't want to mess with the boot process, if you want to know what actually happens when you actually start the computer, go to what you would do in the Linux crash trace.

There's this lovely kernel feature where it says, please trace everything that's happening to block the device at the entire I/O layer. and you get wonderful, pretty graphs and movies about what all the blocks have been read and written and when it was cured and all kinds of fancy stuff, you know, cool, that'll be good for the firmware, so I wrote it so we read the Flash Fire Demon firmware. on the BMC, so we could be reading it over that LPC bus, but there's a little piece of software there that marshals what's in an LPC window so it can be read, so a little modification needs to be made to that piece of software on the other side, for example, saying We can only show you one 4k page at a time, so you'll have to request a new one, which slowed down, but... but we have a very accurate representation of which pages are they were reading.

This host book runs in 4k. However, our Linux will run on 64K page hardware. 4k is good when you have systems with limited memory, like running cache, so what do we do? Let's take a look. This is a full boot of a system that all reads and writes to flash while you're booting, so you can see some important things there. I tried to label it a little bit and what you're getting so the starter Colonel bit at the end, a big tilt reads from flash. read the nvram stuff from the voting variables there, which is kind of a runtime firmware and a ski boot that's being read, so that way some other data we need to read is another chunk of firmware for the control on the ship that does thermal stuff and we have this kind of big spread and something here that looks strangely like a bunch of paginations going on and as you look along the lines, you'll see, some of those things read more than one time, so we look at the full statistics. of that locked run and it's like there's 50 megabytes of total IR, which means that of the roughly 30 to 35 megabytes of flash that we need to read in total, we're actually doing about 15 megabytes of Io more than essential, which If you do the math, it's about eight and a half seconds, so we spent about eight and a half seconds just changing the swap, but booting up now doesn't sound ideal, so how do you fix that?

How is it fixed? Make everything smaller, well, you would. I think GCC:OS would be like the first step, like asking the compiler to do the smallest things and it turns out it's hard and this is because, for reasons no one can remember, Host Brew uses a custom linker that just doesn't work. takes care of everyone. the possible relocations that GCC can issue and we have survived for the last six years by pure luck, as our Chang tool guys put it when I asked them, so we possibly shouldn't have a custom linker in the first place and we thought: let's skin to this cat properly and let's do it, you know, skin first, I don't actually know how to skin a cat, but let's try to fix that first, so we want to try to fix that so then we can have - OS and that.

It should probably skew a lot of boot time essentially for free, but there's also another area of boot that's annoying, but there's a pretty good motivation if you understand that and we can do things a lot faster, we could save several seconds from the boot without having to do so. actually change any part of the core code, so that sounds great. I think the compiler does it with such a strong motivation that the SBE automatically starts the motors of that little microcontroller inside the CPU that starts the single core of the CPU to get its firmware working and stuff.

It's been described as a 20 second black hole and that's for a bunch of reasons why some of our initialization had to go to sba from host startup or go before for reasons that when you ask you know one of the developers it's more involved you say why did you do that and you know the reaction you get when you ask hey why do you understand them man which means you know don't ask so there's some of that and that's I saw them pair up and part of that too en By adding functions correctly, SBA does much more than in p8.

Because we have secure boot for the firmware, it ends up being at runtime a sort of sorting entity to make sure that you can't go through debugging interfaces and get access to things that you can't do, a lot of things. so we have to address that as well, but we have a good idea on how to address that host boot issue and get a bunch of seconds back. We also have the idea. There are several hardware and host boot bits that are initialized serially. We could probably do it in parallel. There are a couple more things we could do by simply applying exid compression to the partition so that we have these large linear reads.

Use that trick too, so we probably figure we can shave 10 to 20 seconds off next year on that SPE. I'm not sure I haven't looked at it too closely yet, but one question I asked was: how long do other computers take? You just said that Helvete focused on my computers and it turns out that we are in the server stage. I was talking about this with some people at an Open Power Foundation meeting and one of the guys says, "You realize that you're from very large companies, you realize that you're faster than literally all of our other servers for boot.

I thought, shut up, I want to do it faster, I thought how cool, but one of the problems is that now we have machines that are working again like on the desktop or as a workstation, so if it has a huge error, take a look at the Blackbird board with micro ATX board power, a couple of really cool CPU sockets because you should definitely have a desktop that has a completely open source firmware stack like no other and we have AB off on. this room after these conversations, they should come and see it, so why do people care about booting faster and doing things that seem to boot faster and part of it is I don't know why? a VGA tree or something somewhere in the world, but you know there is and if we take too long to boot to the point where we can get some sort of video, you're going to end up with the user experience for a desktop. a button and then it beeps, he starts cursing and then two minutes, two minutes later, your monitor turns on, you can be one for two minutes, what the hell is going on?

The right people like feedback, so we fixed the boot process on our machines now in the BMC by basically having a daemon in the BMC that reads the serial port and puts it in the video buffer before Linux boots and it takes care of that which works fine but it doesn't work when you have something that is not vga but is more of a discrete graphics card so we never use a discrete graphics card it is not connected to the BMC at all you have a AMD there because an open source driver stack means you can use it on a PC, so you say, well, me.

I need to cut the graphics and it turns out that the graphics drivers are really simple, we don't want to write them very early in the firmware, so we want the boot progress to be done on a discrete graphics card, so you know, this is a desktop running on a Power 9 system. which is great, so you don't want two minutes of blank time on a desktop PC like you know PCIe needs to be tested, you still already have seconds, but we'll find the information from graphics that you know there and You know, you might think that all of that happens pretty quickly when I turn on the computer and it's like I'm timing some desktop computers and asking other people, it's like actually no modern machine takes quite a while for the screen to turn on like ours. a decent amount of seconds before the screen turns on, so you know we might be okay with delaying doing this a little bit and they're definitely not as easy to deal with as a frame buffer, so that's a little annoying. we don't have VGA BIOS because you know we don't have legacy PC stuff in there it makes it a little bit harder and we could even handle it while the cache holds but before we have D RAM could we even increase the PCI enough to then talk with the card and then figure out how we write a driver for these things in you know me and it turns out megabytes of memory someone asks hardware engineers and what could we bring PCI while we still have a container of cash and they had good luck so I think it's probably not verified hardware and would probably be somewhat adventurous to try, but what could we do to improve things in some situations?

Know? If we can't necessarily do it faster and that's not going to be a short-term thing, is there a way we can cheat? And it turns out we can. So one of the problems here that I've been talking about is booting a computer. What's the other thing? Just restart. Yes, so we have a function. call the fast reboot, which is basically the idea that everything is fine, we don't necessarily need to reset all the hardware when you reboot. Could we get a minimum set of hardware that we can reboot when it says reboot? Linux again and on our bootloader reliably enough and we figure out when it's going to be unreliable and we do it that way, it's like we can and now it takes a few seconds to reboot, so we're rebooting orders of magnitude faster than that we are starting and I.

It can do over a thousand boots in 23 hours now so yeah it's pretty good for rebooting or then booting so we ship this on some machines not all because it depends on the hardware you have enabled and the code to do it in the firmware , but you know, quick reboot, it's a machine that reboots and runs a hardware exerciser, like thiswhich you know you just looked at how fast it can be to reboot and boot into LS and boot up a bunch of things, but there's no use in cold booting for us to solve the problem. reboot there's also the argument that this is a bit of a cheat as you know sometimes when you say reboot you mean reboot and doing it properly if something went horribly wrong so there's an argument to be made there so I've been looking at other systems and what they do and trying to infer what happens on a cold boot versus a reboot and it seems other people do similar ashita but they just take longer to fix so I'm not too worried but it would be well If you had the whole boot process to be faster instead of relying on some kind of cheat to reboot, but it's a nice feature that I solved it, so where did we go?

We had p9 that were better than p8 even though we regressed in some areas right in our GA. winning our first launch for p8 took over three minutes and our first launch of p9 took two minutes 25, which was slower than where we were going in p8, but it was several years of development optimization, so we got better launches in time. We have a good track record and optimize simple components like Linux in user space. We need to dig a little bit into Harry, so some of the depths of host boot and like some legacy code, we need to dig into that. more and we've been planning where we can fit this into the development time and make things work, we can get good measurements and act on data instead of guesswork and that was a big reason why I did the hey even let's get something. that's like block tracing to do flash, let's try it like we have these theories of what's happening and we should test it and act on real data, not just correct guesses to have good measurements, have a good idea of what we can attack like individual things that different people can do at different times and we're faster than computing because some machines compete, so that's a good place to start and anything anyone raises on that point, I probably wouldn't have a problem with not just be a fast computer. but being the one that boots up faster than anyone else is like here's the demo of the machine you might want to buy and it's running and booting an operating system and it's installed or the others even turned on sounds fantastic and I have a problem with that.

So where can we find the code? The code is on github so you can get all the source code of our firmware on github open power. If you don't have a machine that has that, then you should reconsider your life choices and ask yourself why there is no when. We were there, so thank you, we have photo credits and movie credits for the things that are there and I open Power Buff here in minutes and there's probably some time for some questions right under the decompression. Is it the disco reading? Decompression takes ages, so does reading the disk. takes 12 seconds and then decompressing the unit grammar party takes two and a half and it's that parallelizable or it's already parallelized, it still sucks so we're thinking about crippling the decompression by putting it earlier on boot so we do it while' When detecting PCIe devices for example, we decompress it or do it to read the Rama fest drive before reading the kernel, so while we read the kernel, which will take about two seconds to read, we can decompress it to Rama fests, so we are thinking about switching that way and doing it that way and being able to build some sort of firmware image without constantly changing partition signs manually in very different XML files and for Perl scripts, that's something like that. how to make that build available reliably so we can make it so that the prototype works and on enterprise machines it cuts to 1/2 second of technology right away and that's what it should do, thank you so much.

A lot of this is really great work. I'm not familiar with this deep hardware topic, but it seems like a lot of things you rely on or know to wait to initialize need Ram, so why not just start? Why not start weird and random? It's relatively like, yes, initializing 512 gigabytes of RAM takes 10 seconds, so why not do it sooner? I have noticed laughter here, yes. Why not do Ram before? That's a good question for the IPL to flow, as it's called, as an initial load of the program because we are IBM and we have terms ahead of everyone else joining the industry, let alone being there, so the rate of the boot process has been kind of designed like you know Sydney, with hardware engineers and let's, you know, do this one two three four five six then and then six point one six point two to twenty one point three and then there are the other bootstrap bits, but and that's been one of the things we could go back and look at, so here's the verified thing that everyone thinks would work and the question is could we rearrange it a little bit to make things better?

We haven't had that discussion yet and maybe it's something we should have had. For something like that, we could do it maybe on existing processors or we could talk about it for next generation ones. I can't say what future processors might be called, but you know, there has been power, power, two, three. four five six seven eight nine, so we have no idea what the next one might be called, but you know we could look at that and maybe we had the discussion there to say, Hey guys, could we make sure this hardware works first? So we could implement this and improve boot time, plus have everyone write firmware, with a lot less headaches because one minute you start running out of memory like the cache is contained, people start crying and crying and having a terrible day, as if.

It's a good question if it's firmware or hardware then that's something we need to figure out and if it's hardware then we should talk to the hardware people because it turns out they also work at a company called IBM and we can ship There are things that are They look like an email ace: how much of this is at least architecture-independent? Are there other benefits that you are rejecting that will benefit traditional x86 hardware boot that the arms community can take advantage of or is this like most of this effort? very focused on what you do in the open power space, so we periodically get questions about why don't you use UEFI and Cole boot and stuff, so we've dug deeper and, well, one of the things is that nobody all the world has UEFI and it doesn't bring them joy which is what's hot right now and if you look at what the call boot components are there and pretty much what we would get from that that I can figure out is like we would get the limited Lib of someone else safely and that doesn't give us much more than what we already have in the current codebase, so it's a little difficult to say if this is something more common that would be shared among everyone.

Platforms would really help because a lot of that time is spent on very specific operations of the chip, like these are the exact registers to manipulate with these values because that register is at that address because here is the physical layout of the chip and are these values because they solved the ones that work when the thing is manufactured because the transistors and the physics and there are a lot of those things that become yes in the firmware because we never want to put the t-shirt in the core and that's our line of Well, you know, we never we should have the kernel look at internal processor registers which are not designed and most of our firmware time is spent on things that are more power specific and not really translatable to all Linux work. so for things like, can we build a faster root boot and issue images and ways that the Builder image is faster and that's what we look at so we can push it to the I/O team and be done?

It really should make that raid controller run faster and you know that before we clean in Kay Exact for example and we do it to get a lot of bug fixes from K Exact because that's how we boot no s and we want that for the K dump of anyway, so it's like you already know D duplication of work. so there are little bits but a lot of them are power specific yes and going anywhere it might not be is a questionable return on investment and why keep the UEFI link we just keep Linux and improve the life of everyone, hey, so you mentioned maybe connecting the flash more. directly to the CPU in the future.

I assume you've already given this as feedback to the hardware team so they know that in future iterations you've considered ordering more flash so you don't need to compress anything, so yes, order more flash, not having to compress it means we have to read more megabytes of flash memory, which might take longer, so it's kind of a trade-off. One of the things we might consider is could we do like an FS squash and place it on the page? -sue for those things, but then you have a problem with secure boot, it's very easy when you're like here's a file that you sign here's a petition in flash, you sign this, you load it, you verify the signature when you're now making a pumpkin , the first thing you need to do now, oh, we need to be able to check individual blocks or read everything ahead of time anyway and that becomes something you have to think about a lot so we can, there's pressure like you could get more flash chips large, but that also adds cost, so we already have a fairly large flash chip size, typically 64 megs or 128 megs, and that's pretty big for system firmware chips, but we could get more, some of business workers will do so.

I have much bigger things because I want more features that use disk space, but it's a balancing act and it's also like you never want to use a bus that's too complex to activate, like LPC, it's great because it just works correctly and it's the slow enough. It's well known that it just works, you don't need complex training or a magic setup, it's really simple hardware and that's why we use it and that's why there are other generation buses, but there are other complications, so it's a discussion about what you can reasonably get, you know all the legal aspects online so you can implement what you can do, that's the least amount of effort instead of spending your time doing other things and it's all that complex negotiation of where you're going and then Sometimes, it's because I don't want to do that real work this week.

I'm going to go see if we have more. Our time is over? So stay here to enjoy if you want, check out the tone board. It's cool, I actually also have a power 9 as a die, if you want to see the bear CPU and you can point out that although there is the CPU core, it's kind of Neil, so stick around and thank you.

Watch Video & Subscribe

If you have any copyright issue, please Contact