Zig's I/O and Concurrency Story - King Protty - Software You Can Love 2022

Mar 13, 2024

uh, testing Can You Hear Me?, so yeah, my talk is about pretty much the only thing I do at Zig, which is my own

concurrency

. uh, so here's a little introduction about me in particular. My name is K Party online, real name is King. butcher if you want to find me in a real professional environment uh uh right now I'm a co-founder of Tiger Beetle, which is a database or like a company that writes a database in Zig for originally financial transactions, but now we're like escalating to cases of even broader usage um also with tiger beetle I'm part of the zig core team uh I was wor

king

on an i o and

concurrency

project uh before I called zap and a sick call from Andrew and then it kind of snowballed from from there and uh it's a core team, now I can contribute some ion concurrency stuff in the standard library or do other parts, but in the concurrencies where my focus is I also do blogs and drawings on some blog sites and you can find some of my drawings on Discord. or Twitter, okay, more volume and more precision with the microphone, yes, it will be like at home, so the first thing you can probably notice about IO concurrency is what you actually use it for, so one of the first things you What comes to mind is like creating a web server because it's the most marketable for a language since everyone knows Tech Empower benchmarks are fake but we still use it as a good bragging system.

It's also a good place to try out the bass. design decisions because a web server is not just sending requests to receive requests, it also has to do programming for efficient sending of requests, uh, some optimizations to look at some optimizations on how i or is sent in batches, if you can do things like that, That's right. It's actually a pretty interesting problem in general, so the goal of this particular web server is to support at least 10,000 simultaneous clients and with low latencies. That last part is pretty important. I guess you could try it if you did this. yourself without having any knowledge of how to write efficient web servers you would probably just spawn one thread per socket to make it work and it is actually recommended but the problem is that it doesn't scale so when you have more concurrency than the programmer can handle it, you start having to oversubscribe to threads to sell the cores, so what this could mean is that some threads hog the course so much that other threads that still need to do so are not scheduled and therefore in the configuration of i o.

More Interesting Facts About,

zig s i o and concurrency story king protty software you can love 2022...

Latency is actually more important than performance, so current Linux like the Linux scheduler, for example, because the Linux schedule is probably what you're going to run the web server on, optimizes performance, so if you have, say, 1000 threads, probably like 20 or 200 actually, 200 of them will be running most of the time, the others will be replaced by 20 or 200, so your server's tail latency will be really bad, so which can take more than a second to process a request. on one of the unlucky threads and that is a problem for the latency side and one way to solve it is by introducing Network i or non-bloc

king

so that with socket our thread per socket model each thread has only one socket that it can Give service. because when you do IO and have to block the entire thread, you can overcome it by using a scheme where instead of blocking the thread with one socket operation, you can have multiple socket operations in a single thread and therefore not have to use, you don't have to rely on the OS thread scheduler for a concurrency and can therefore optimize the latency yourself.

There are two types of ways to model i or non-locking and then it really depends on what your operating system offers. Uh, the window that you're probably familiar with if you know Posix is the reactive and prepare approach, so basically this is where you try to do IO and IO will tell you whether or not you can do it and then you can use an external system to tell it when it can retry, so this is convenient because it relies on existing posix systems where you already have the i o operation in front of you or already do it as first, so this doesn't work. you have to modify posix, it only modifies what the posix functions return, so you don't have to change the APIs, so like read and write you can still use them for files and sockets.

All it does is just do read and write, just return a different operation. code that you can now handle in an asynchronous setup, the other version of that, which I'll get to or why it's better because it's better, the other version is completion based, so this is when you start an operation and then you have the system. tell you when it's ready instead of having to try again or perhaps wait for you to try again later. I'll go over the pros and cons of both in a second, but basically they're both two systems of ways to handle multiple IOs in the same thread and that allows you to not have to use multiple threads and rely on a Linux scheduler for IO concurrency. in particular, but non-blocking IO can be slower than blocking IO, which is something people don't really like to talk about. about why everyone is pushing for non-blocking Io like they've seen Apache and then they don't see it not being destroyed and then everyone thinks non-blocking yoyo is the future, technically it is but there are some reasons why it isn't.

We'll go over it a bit once we go over the pros and cons of this, so this is probably the first version of non-blocking IO you'll touch if you ever try to get into writing high-skill servers in Linux. I have two, I guess a popular one, Cisco, which is EPO and in BC, you have KQ, they basically do the same thing, they just allow you to pull the plugs which I'll explain a little bit, so they are both preparation based approaches or reagents. Basically you have all 6 positive houses before you try the i or it reports and I'm done and then these two systems can tell you when to try again so yes IO projects can return e again or E would crash they are basically the same value , I don't know why they both exist, probably some posix specifications and then this part is optional, but you can register the file and script there for when you want to be notified, there are ways to avoid this and avoiding it is actually the most efficient because at having to re-register the support call every time other than Cisco, we register the file descriptor every time it gets an e again, it doesn't really work, but yeah, the APIs just give you a way to extract it. for preparation or extraction for when you want to retry multiple operations, the problem with this is that there can be multiple system calls involved in doing a single read, so it starts with the correct read and then goes to it and gives you e again. optionally register it and then push it to be ready and then try reading again so it's like this, it's about three or four syscalls for something that would normally be a support call in an i o blocking approach so if you have like maybe low concurrency levels, maybe like five or six threads just blocking io, that would actually be faster than using the non-blocking io approach because you're making fewer calls overall and you're not oversubscribing to the scheduler , so it doesn't matter what the latency spikes are because there wouldn't be any since there is no oversubscription, so Linux and Apple, not the Linux, EPO and KQ people are reactive and have this, uh, I guess you'd create a con in your API design that Windows solves it for some strange reason, so one of the things that Windows does really well is almost right, it's pretty much i or apis, so it's proactive and based on completion, which also means that reading and writing take an extra argument called overlay and this overlay. it's kind of a representation of your asynchronous request, so it tries to read and write and then it completes immediately, which is also the case with posix, or it tells you it will complete later, but what happens with uh iocp versus EPO is that when you get the results, it doesn't tell you what to retry, it just tells you what the values were, so you don't have to retry at all, so this completely avoids the second system call.

You still have two support calls for the startup operation. I will tell you that it will tell you when it is ready and then you will pull and then it will tell you the results, but it is still better than three system calls for one operation, at least on Linux or any Unix based system, but there is one last one. The inefficiency that Windows still has is that you can't perform multiple I/O operations in the same system call, such as read and write, you still have to make separate switch calls for iocp to technically work.

I was going to say that technically that's not true because you can override it with some internal and Windows things, but I don't want to defend that because I've seen people screw it up so many times, so yeah, you have your server, you have it doing web requests, you might want to make it useful. when sending files over the network, otherwise we would use a server to um, so you want to do non-blocking file I/O because if you already have non-blocking network I/O, if one of your files gets blocked , then somehow defeats the whole. purpose of this, then you go to implement non-blocking file IO, you search online, you try something and it just doesn't work for some reason, there are some reasons that I know and then some reason that I just don't know, but Basically, the file systems in the kernel in multiple clear rooms simply aren't structured for asynchronous I/O or have difficulties that cause synchronous asynchronous I/O to become synchronous in some way, so like taking e-pol and KQ, basically, uh, file or disk fds just always returns ready to read or ready to write, so you can't use it to efficiently look up when to retry on Windows, even though you have IO or asynchronous IO for files that the API can randomly lock for any circustance. actually documented by them and I always have the link ready for people who come to say deploy the file.

I own the standard library for whatever reason. It can crash if the file is compressed, if it's encrypted, if the Windows page suddenly decides that your file is not in the cache and you have to fetch it again, or any other random driver details you have. It doesn't block basically the equivalent of shortcut, so you can open your file in unbuffered mode so it doesn't go through the page cache, you can open it with a um. I forget the exact flag, but it's similar to odirect on Linux, where it goes directly to the still-locked disk. This is asking us for disk access but not blocking for other strange reasons?

Box on Windows Linux also has something called AIO that people use and then don't use anymore and has an option to do i or non-blocking, but which can also block randomly. Linus has multiple complaints online about why it shouldn't exist. and there's a good reason for that, but basically all you're left with is that if you can't do i or lockless in the kernel, you basically have to provide said API in user space and the way people do it is just using a thread pool put file i o on a third pole let it do it's lane and block there go back to normal block now im blocking audio later.

One problem with this is that it goes to a thread and returns so it can be an overhead, you can amortize this overhead by making a really good thread that I have a blog post on but you still have the overhead for at least one of the requests that they go towards and, uh, one of the last requests that come back. and this can actually add up, so in live V this is a practical example, so it has a function called uvq work. Libby V has a specific thread pool that is designed to block operations, so it uses this for files and uses this for DNS. resolution or any operation that would normally block the thread that it cannot handle on its own.

Tokyo, which is a rusty frame, does this too. It has a separate thread just to execute blocking functions. This is configured separately from the main thread pool which is used for non-blocking stuff and even goling does this to some extent but it actually does it quite interestingly so go has a background thread called cismon and this periodically monitors all the go routines running on the system, so it sees one of them blogging. on the help call it will just move the scheduler resources to another thread so it basically spawns a new thread automatically instead of explicitly with the other two but I'm still using a thread to follow you and this is what I O U ring was intended for to solve and why Jen's Expo saved Linux, so iring is a proactive and completion-based IO,so it's similar to Windows - you start an operation and it will tell you when it's complete with the result, so it's efficient in that sense, it also uses a ring buffer between user and kernel space, so this means that yeah, I put some little animations here.

I don't know if you can see them, but this means it's a poll for Io, you don't have to do a Sysco, so this is just a ring buffer, it can check memory so you can pull Io without a support call, so eliminate a minimum of two syscalls that Windows had before and reduce them to one and you can send a batch of i or since it's a ring buffer and it's a size larger than one so you can send multiple layouts and that solved the other problem that has Windows. So how it works is there's a sending ring buffer of basically structures that define system calls and then there's a completion buffer of the outputs of that Cisco. so ring uh not ring IOU ring is just a method to schedule a bunch of system calls to the operating system, make your call once and then get all the results.

I think I clicked the wrong button. Okay, so, yeah, the method to program this. is called and structs are called enter this both submit and wait for completion at the same time, so Jen was very smart with this API decision, so you don't have to make two system calls like Windows nor do you have to succeed in make two system calls like Linux you can simply submit and wait for it to complete at the same time, so having a Cisco means you can also do other optimizations, for example when you submit you can choose to complete the file online, since the kernel is a kernel and therefore knows if an IO file will be blocked or not, so I can complete it directly online when you log in or it just defaults to a kernel thread pull because the io file still doesn't It's asynchronous in the kernel, it just uses its own thread pull, which is even better. than a Starfall userspace because you're more familiar with it and it's a core and you can do more things with it so when we bring it back to Zig so we know iring is pretty efficient at programming i o So what What would happen if we did something similar?

To the ring UI you can get the benefits of it, but in a standard library where it is compatible with all platforms. I mean, it seems like a good idea. One of the problems with Irene's design is the ring buffer itself, so it would be efficient to use. It's ironic, but when it comes to adapting this to other carrier systems, it fails in some areas, so one of the areas it fails is that you can have more operations in flight than the size of the ring buffer, so let's say your ring buffer size is like 30 and you send 30 operations and you want to send another one in case you have to allocate it to send the other one, while the third year in Flight should block it from sending up to the full 30 things like that means you have a basically implicit allocation that is normally handled by the kernel, but now you have to deal with user space and Zig is visiting.

The community doesn't really like implicit assignment, so that goes against some of the goals you can't achieve either. the inline input, so if you imitate our earrings API and you have input, when you enter, Opera has no operation, it has the decision to do some of the i o operations inline, that is not efficient in user space because it is still make a support call for each of the i o operations, so doing it online saves you virtually nothing, so most of the benefits that come from my pending are somewhat mitigated for the ring buffer if you want it to work. like Mac or Windows, so instead of a circular buffer you can use an intrusive linked list of operations so that this way the user provides the memory and what to do with the result so they can decide how to restore it and batch schedule it so efficient.

It works for Mac and Windows for your system, so this system is actually used in Tiger Beetle right now. In fact, I started a year ago working on migrating Tiger Beetle because at that time it only supported Linux Eye Earring, so I was tasked with working to support other systems like Mac and Windows, so I installed a tiger beetle. on Mac OS using the linked list version and then we simply imported it to Windows and now we have a cross platform implementation of urine and tiger wheel. we just sat there and used that for efficient disk IO disk shipping, so Jared made a pretty smart decision and just took the code, copied it, and put it in bun, but he made the strange decision to copy everything except the part of Windows, so it still doesn't support Windows for whatever reason, but I don't know why it did, but yeah, these are just the two libraries that I know of that are written in Zig and are also used in our cross-platform pending.

We have a lot of stuff already being used in practice, so it seems like a good idea to put it in a standard library right now, so we should just leave it there. It's a difficult question because we still have the zig. event Loop, so the zig event Loop is basically our reconciliation of how to make i o asynchronous, the model was pretty much copied from existing systems like golang or Tokyo, where you have multiple threads and there is a background runtime that handles i o that submit your Threads are like a thread pool and they don't block i o at the same time, so that doesn't fit well with the iron urine approach because our earring has a ring buffer in a ring buffer, but that buffer ring is not synchronized. so this doesn't work with multithreading if you want to make it multithreaded.

You'd probably just wrap it around a mutex, but that doesn't scale and that defeats the purpose of having an event loop that is also a thread pool, you could also use i or urin per thread, which is why it's typically referred to as a thread per core. This is actually quite popular. There are some blog posts from Red Panda, which is a database that does this. Is there a guest I have or just me. I have just benchmarks of this approach versus the event loop and this is actually quite efficient because there is virtually no scheduling overhead, so in a thread pool all threads typically have to compete with each other to perform the same task or try to perform the same task. a different task, but everything is on the same resource they are competing for on a 3rd per core architecture, everything has its own signal, there is no synchronization overhead, so this results in the highest input being can obtain is higher than oxidation. or any other form of custom architecture for Io, this is also what unjinks use.

You probably know that being dominated by almost all the web servers on the web right now, H2O is also another web server, it's less popular, but it's like really specialized for high i o performance, there is also the proxy which this is the one you use, is just a TCP server for Linux which is like a buffer of engines like big proxies. The problem with these is that they are usually paired with Linux specific APIs, so like engines or h2o, they will have systems where they can reboot the server by binding the port the server uses to multiple sockets and then transition that way, but not all operating systems have that and supporting it on thread by core architecture would do it.

It doesn't make sense for standard libraries or event loop so we should get rid of event loop because iring seems to be the future of IO as it solves all the batch processing problems, it solves the multiple systemic problems that Jens is working. and you are optimizing it practically every day. It's going to update the hardware to do it as quickly as possible, so since the current Loop event is the third group, it doesn't map as efficiently to our audition, so one of the questions is whether we just throw it out. before we do that let's look at the Event Loop Programming Iceberg because that exists so at the top you have single threaded languages this is where everyone just slides in you are single threaded you can use any system that you want for Io. agree it works you don't have to worry uh we are not those uh the next system is uh actor systems so these ones they have can use i or non-blocking but they also have multithreading so the iron ring it's also off the table somehow um they don't like sharing memory either at least in the programming model and we're not like that either because we have shared memory so the next part of the iceberg is the ones with shared memory and the ones which they are too. thread pulls and the ones that do non-blocking IO and this is basically just the rust ecosystem and the gold ecosystem, there are probably others but these are the most famous ones.

The Tokyo runtime is actually just a time port, it's with its own specific things. use the same programming, constants and all that, and then the last part of the studs, probably something that most people won't touch, is dpdk, which is like a custom driver for doing I/O and Linux, and it's also the Seesaw framework, so Sea Star is a thread. by chord architecture that Skyla DB uses and it's a database that is tuned if you could write the most efficient database in the world that's probably what Skyla DB would be because that's how it takes over the entire system and it's supposed to know the architecture.

Just Linux and try to do everything to be fast, implement your own TCP stack element, implement your own uh driver and I/O interactions like this is the lowest level of the iceberg you've probably gotten to, plus custom fpgas or whatever, but we are not either. The current um zig Loop event is in the rust and go section, but what if we could do something that's in the middle, something that's closer to the starfish, but not quite the multi-thirds approach we had before? One idea is to still have the io, the cross-platform iron, but also have a thread puller on the side, so basically the users are more explicit with the programming, so the user would be the one who decides if they want to make a log of thread. come over if you want to make an eye earring plus a third group, if you want to not even pull the thread and I'll have the overall cost of the thread and things like that, um, we wouldn't have a global execution time either because again the user decides what gets executed. about where so we can finally remove the i mode or which I would like to remove because it causes a lot of confusion with your blog post about colors that sync.

We could also remove the limit on standard libraries. Loop. So let's try to apply this. Existing projects like one of them are obviously HTTP server, so an HP server doesn't really care about CPU performance; it's more about servicing requests easily on each request in the shortest time possible, so it's very sensitive to latency, so in that case you'd probably use multiple multiple pendings because they have the lowest overhead for scheduling and they have the highest performance of i o. Let's say you want to do distributed computing. This is generally what people do if they have specialized workloads, so Litu actually wrote a blockchain and uses something like this has an i o thread and then multiple CPU threads to do all the hashing, so This is generally the approach people use to replace the event loop, so the nice thing about this is that if you have the CPU tied, the third party works. pole can handle that and then the i o part will handle all efficient i os even if it is not linked to i o, but what if it does?

We cover the cases of i o bound, which is bound to the CPU of the HP server, which is distributed computing and if you have no CPU or you are not bound to i o, then, you can use a single i o or, so this is probably what that other people who don't care about performance or people who are just looking to write simple scripts that do IO, don't care. I want all other programming issues or decisions to be made, yes, and that's it for my talk. I was going to do this talk about uh Zig async, but I decided to put that in a separate blog post because there's a lot to cover and it's very hard to convince people that it's right because people have different goals for what it's supposed to do. you should do async, but if you want to know more about the discussion to get rid of the Lube event, what the current goal is or if you want to discuss. to preserve it, you can go to a discussion on that GitHub issue that I've been pushing for a few months and also been neglecting for a few months, but for existing implementations of this, you can go totiger beetle source code, lithu API, my API.

The lithu repository also has the thread pull implementation, if you want to just copy that button, you also copied my third poll and then you used it and then you had problems, but then you modified it, so that's fine and yes, that's it, thank you very much, thank you very much. King, okay, any questions, have some time. I see a pen, maybe two friends, okay, thanks. Excellent talk, because I was wondering if you saw anything. I looked into the research on something called the lmax disruptor, a kind of approach to concurrency. Well, it started in the financial applications space and the idea was to avoid crashes and system calls all together to access hardware caches.

What you're doing with the coin sounds like maybe that's what you're bringing back into a bit of funnels, but Yeah, any thoughts on that. I played with lmax. Its idea is to have signals for multiple subsystems. It's similar to the idea of Dispatch lib or Apple GCD. It's your own framework for programming in a way that really benefits the specific userspace domain of Mac OS. but the idea of having signals is actually quite efficient, it's just that most people don't think about concurrency that way, so it's hard to install it as a standard library or put it somewhere where people can look for a tool which probably exists in other languages. but there is no particular um lmax Distributors Q here, isn't it going to be controversial?

But there are parts that are not as efficient at that scale even though it is highly optimized. I found this out through some of the benchmarks, specifically the mpmc signals, those can be optimized and by using uh was the mpmcq name of viewcov. I think it's used in Rust Tokyo, it's also used for Mrs Q erlings, but lmac's idea of using signals for things is pretty efficient, I just wouldn't say use their specific implementation for it, okay, next question, so the IOU ring is a userspace implementation in Windows, there is no IOU ring, it's a way for the user in the kernel to make calls to each other, with the minimum amount of taxation, so how do you do that in Windows? although, um, so Windows 11 has its own port of our urine, but one I don't use Windows 11 uh two, it only supports fall reading, so it's very simplistic right now, the only system that has something like urine more high supported by the kernel without having external drivers is just Linux, so to do it yes, we basically make a pending-like API that uses the signals, but below for Mac uses KQ below for Windows.

I still use iotp so it's not as efficient as any other question might be, if I don't see your hand it certainly does, I mean, it's okay I think. That's it, so, oh, I know it's a question, okay, so we didn't see the sweetest thing you said you did right with the HTTP server. Remember to see one thing I started doing, especially with my online publishing, is I avoided showing Benchmark. numbers because people always draw wrong conclusions from them, specifically with the HTTP server. I didn't want to show how fast I could be. I had an idea to show it, but then I was too lazy to implement it and didn't.

I think it would be compelling, so instead I just talked about how to achieve it rather than what it looks like in practice, but I don't know if that answers your question, so yes.

Watch Video & Subscribe

If you have any copyright issue, please Contact