YTread Logo
YTread Logo

Why Distributed Systems Are Hard

Jun 11, 2021
Hello, thanks for inviting me. My name is Denise. This is the talk why our

distributed

system is so difficult. If you are in this room by accident, you have a few more minutes to leave and find the chat you want. I'm supposed to see you, but I really want to get started. Oh, we were actually a few minutes early. Okay, that's how it is. Anyway, I'm getting dangerously close to time, so I'll just get started. So I actually want to start by launching into a meta discussion about why we should bother learning about

distributed

computing in the year 2020.
why distributed systems are hard
After all, we've all been using the cloud for a couple of years, many of us are using cloud orchestration tools. open source like Kubernetes, for example, to manage our services. For us, the question remains why worry about the theory and history of distributed

systems

in 2020 when we arguably have tools that could do our job better than us because, after all, we have known for a long time how to run services and not It's like that. In the case that microservices are just smaller versions of services, then why go for this type of architecture? Why look for smaller decoupled and decomposed services?
why distributed systems are hard

More Interesting Facts About,

why distributed systems are hard...

I generally think that microservices architecture is motivated by a set of socio-technical goals, among those goals you may want to make the boundaries of your engineering team reflect the boundaries of your business domains, for example if you work for an e-commerce company Instead of having all your front-end people there and all your DBAs here, you could create cross-functional integrated teams. so they all work on maybe some of them work on payments maybe some of them work on search and so on, but why would you want to do this well if you can decouple your engineering teams to resemble the boundaries of your business maybe? that means you allow these teams to release and deploy more frequently and more independently, so that the payments team doesn't get stuck on a feature that the search team has to merge into trunk before successfully shipping , so continuous delivery is higher, higher frequency, higher, higher, faster. implements these are all good things, but why worry about it?
why distributed systems are hard
I think achieving these goals supports the overall goal that we ultimately want to improve the resilience of the critical

systems

we need to run our businesses, we want to protect these systems against catastrophic and spectacular events, there are a lot of images of cats in this deck. You are warned because after all, we have had microservices for a couple of years, after all, we know that if we are doing microservices we can make a change. to one microservice without having to change any other, okay, I'm just going to say take that as a yes, but I'm not going to stand here preaching to the converted.
why distributed systems are hard
I'm not going to spend the rest of this talk reciting a list of why microservices can be good, they can be useful in the right context. I'm going to assume that everyone is here because they've at least heard of microservices before, maybe they've used them, maybe they're here because having a healthy skepticism towards them, that's fine, but the thing is that every team that runs Microservices are now at the helm of a large distributed system and of course their size will depend on the specifics of your business, so it's almost like a law and talk about microservices, you need to have a slide showing someone else's architecture. and that's my slide for that, so monzo is a bank here, great company, fantastic people, love them to death, um monzo reached 1500 microservices at the end of last year, this diagram shows all the network rules They govern how a service communicates with all other services in the system, but even if your architecture doesn't look like that yet, perhaps the truth is that the specific technologies we use to manage and implement and orchestrate these systems will come and go and Honestly, technology trends are really fickle, just like my cat's eating habits, but the fundamental principles for designing and operating services over distributed systems haven't changed much in the last few decades and understanding the fundamentals, taking some time to Learning these things and internalizing them is the best way to future-proof the systems we are building today and beyond.
I honestly think learning about distributed systems is really rewarding and a lot of fun, so I look forward to it. that in the next 45 minutes and 35 seconds you will at least have fun as I hope you will laugh at least once I hope you enjoy the cat photos I found it very fun to learn about these things and I hope you will too, by way of introduction, my name is Denise, that's my Twitter, that's the wrong hashtag, it's actually qconlondon, so for accessibility reasons I like to upload my 100 slides online before my talk starts so you can access them right now if you go to Deniseo. .ioslash qcon I give you permission to take photos as live tweets, whatever you want to do if you want to use the higher resolution images that are there, not the ones you're taking with your phone, so I'm a senior. engineer at github I work on the community and security team in general my team creates tools to make github a more productive and secure place to build communities.
I haven't actually started yet, so if you have specific questions about the team, I'll direct you to my future colleagues after that, I'm based in Toronto, go ahead, any other Canadians here are fine, there's like four of us. Yeah, I'm not actually Canadian, I'm American, but I lived here for five years, so it's really fun to be back. and finally, when I'm not on stage rambling about cats and distributed systems, I spend my time making art about technology that bridges the gaps in understanding, so feel free to check it out later if you want many of the images in this talk to have been taken from a book that I co-authored with my friend Steve Smith, so if you have nephews, nieces, or children in your life who are going to write enterprise software, consider buying them my book, so let's go on a journey together today.
Generally speaking, we will cover some important topic areas, we will talk about why distributed systems, why distributed computing is a thing, we will summarize if you study the theory of distributed assistance before you already know it. No, don't worry, at the end you will know that we are going to talk a lot about networks and partial failures and then we will close by talking about the human side of things, that is, the sociotechnical mitigations and how the human part. of the system is the most adaptable but possibly the most complex part, so history lesson time, how did this all happen?
How did we get to where we are today a long time ago in a data center not too far away? Maybe some of you know where From your own experiences, where this data center is located, all of your business applications tended to be structured this way, so the client server architecture had multiple applications, all reading and writing from the same database, and a database like a database was probably hosted on some machine that was in a windowless server room in the basement of your building and this worked for quite a while it worked because it was kind of a necessary evil it was a cost center that we needed to support so that other people could go and do the actual operations to make money, but this stopped being true sometime in the 90s when computers stopped being this cost center for many companies.
I mean, there's that famous brand in a recent quote about how software is eating the world, but at some point in the 90s, computers actually started to become business differentiators and competitive advantages, which meant we had to start taking our roles much more seriously and of course the core value for many companies then and for many companies today is customer data, so as an industry we have always needed better ways of reading and writing to store and recover our customers' data, so it became a differentiator, the way we store and retrieve our data of course evolved, so today it's generally not like I don't want to say it.
It doesn't work for anyone, but for most businesses it's no longer enough to have a massive database on a server in the basement. However, one kind of driver here was that data-driven business analysis is becoming more important and you know that business analysts and product managers want to run expensive SQL queries so they can make informed business decisions. Additionally, things like machine learning, artificial intelligence, natural language processing have created a new set of requirements for how we interact with our data and the truth is that we simply have more data than we have ever had before in human history. and that statement will always be true, which is why I like to say it to have middle layers like key value caches and things like that that help us speed up data retrieval. in many different circumstances, but this means that our data is more distributed than it used to be, so the first thing we did was scale vertically, which means simply adding more computing power to the machines you already have and this worked for a time. until it went through a title, but at some point it no longer made financial sense for many companies to add that last one percent and this is something like unit economics: the cost of the margin was greater than what was greater than the value it gave to companies, but even if you had a lot of money and said: I want all the CPU, just give me all of it at some point, you will really reach the current limits of

hard

ware engineering, what do I mean by that?
The physical limits of

hard

ware engineering Moore's Law is the most common way to think about it. Moore's Law states that approximately every two years the number of transistors on a Chippewa IC doubles, which basically means the processing power doubles as does the size of my kitten. in the first two years of its life, and I know Moore's Law is becoming less and less true these days, but historically speaking, this has been a very good way to track the increase in processing power, so lucky for us in the early 2000s. cloud computing solutions appeared, so of course we know and love Amazon Web Services, Google Cloud, Microsoft Azure, but you also have the option to store your data on on-premises, if you know you need to store your data on-premises, there are solutions you can implement on your existing hardware, like I think vsphere or vmware is one of the most well-known options, not the only one, but you can take your own hardware and do It behaves like a cloud, so cloud computing gave us an easy way to provision new machines on demand, like you get it just in time, just when you need to scale, which means we're no longer limited by verticals. by the limits of vertical scaling because now we can do this thing called horizontal scaling, which means you can take a workload and you can distribute it across multiple machines now, so why do my teams want to take advantage of this ability to distribute horizontally?
The first reason is scalability, which means that sometimes you have a machine that cannot handle the volume of data you want to store or perhaps the request. the sizes are too large, so one solution is to split that data into multiple chunks using some index like an encyclopedia, as if you don't have volumes of encyclopedias that are like a four-foot-long book. Encyclopedias are examples of real-life fragmentation. where it divides them into several volumes by first letter, so another reason is availability. If you are operating with multiple machines it means you get the ability to have multiple copies of your data stored in different locations, so by having your data spread across more than one machine we build redundancy into our systems and the final reason is latency , so if you can store your data physically closer to where your end users will request it, it means that data has to travel like pulses have to travel.
Less less cable, but request times will be faster, so now I want to spend a little time talking about modern distributed systems, so you may have come across the term unshared architecture before, it's okay if you haven't. fact, but this is the most popular form of network computing and it basically means that different processes on the same machine do not share access to physical resources, so they do not share the samememory, they don't share the CPU, they don't share devices, honestly, like that. It's a really reasonable and sensible design, even if it makes life a little harder for lazy programmers like me, who say, oh, just give me the pointer all the time, like no, no, that's really bad, so of course fact is so sensible that the idea of ​​process-based memory isolation is built into some programming languages ​​by default in the Go programming language, for example, Rob Pike, who is one of the biggest contributors to the language, said that don't communicate memory by sharing, share memory by communicating, which basically means having your processes send.
Messages to each other don't allow them to arbitrarily read and write to the same memory locations that other processes are currently accessing, so let's zoom out for a second, what does it really mean to run a distributed system today? I think it's nice. It is clear to most people in this room, whether through some intuition or things they have run or read about in the past, that building and operating a distributed system is a fundamentally different game than building a system where everything is calls to functions on the same host. in the same process space, but this intuition and this perception were not always very obvious.
One of the first discussions of how distributed computing works. The fundamental difference here is the classic 1994 article called Note on Distributed Computing by Jim Waldo, Jeff Wyatt. and the wrath of lana and sam kendall who all work together at Sun Microsystems, this document is really interesting, it's like a blast from the past because in 1994 no one was building distributed systems at the scale we see and take for granted today, so a lot of people at the time I was just like keep waving my hand and keep theorizing about what these systems are going to be like and people were asking maybe hardware engineering will just solve our problems so when we need to build those spoiler systems it probably never will so this paper It's really worth reading in its entirety if you're interested in this sort of thing, but I'll try to summarize it briefly here, so in this paper they identify three reasons why distributed computing is a fundamentally different beast to local computing.
The first is latency, which means the difference between processor speed and network speed, memory access, so this idea of ​​accessing pointers and locations and partial failures, I would say that of these three access to memory has been the one that turned out not to be a big obstacle. because let's remember our earlier discussion about the shared nothing architecture, but of course, as latency has been addressed a little bit by being able to replicate data, you can reduce that delta a little bit, but of course it's not a totally solved problem and the partial failures are. What we're going to dive deeper into throughout this talk, according to Martin Klutman, I think a good summary of these challenges is that modern distributed systems look like this, they have a lot of different machines, they run different processes and we only have messages that pass over unreliable networks with variable delays and the system can suffer a series of partial failures, unreliable clocks and process pauses, so it is really difficult to reason about distributed computing.
We've known this since the early '90s, we've probably always known this. this since, as you know, this was on the radar, people building computer-based systems in the '90s, it came up with a different group of people at Sun Microsystems, so originally it was seven, but the Java guy James Gosling added the number eight later, but eight fallacies of distributed computing, so to look at them quickly, we know today that networks are unreliable, we know today that latency is not zero, that the width of band is not infinite, of course, that the network cannot be assumed to be secure, that the topology must be assumed as we do. we must assume that will change, we know that usually there is no administrator, sometimes there is no administrator, sometimes we are all the administrator if you do not protect your systems, sometimes we know that transportation costs are not zero and these days with more and more different devices that connect to the Internet, we know that networks are not homogeneous, so let's delve into this one, although they are worth discussing, but I have limited time in this talk, pretty dangerously close to it, so When When I was learning about this type of distributed computing, I felt like every new article I read delimited a whole category of things that I wasn't supposed to believe in and that I wasn't supposed to assume were true, and I felt on multiple occasions like, wow, there is so much unreliability, there are so many things that are conditionally true if other, if very specific, as you know, very specific things are true, so how can we know what is true about the state of the world? when we started, you know, we started learning and building distributed systems, so this really brought me back, so I studied philosophy in undergrad and I never thought I'd be talking about my philosophy degree at a tech conference, but here we are . they're um so I actually thought this is actually a problem of epistemology like it almost sounds like a problem of epistemology and epistemology is the philosophical study of knowledge.
It's a branch of philosophy that asks the question, so we think we know some things, but how? I really know I'm in London talking about qcon standing on a stage like how do I know this isn't just an illusion and generally speaking within epistemic philosophy there are two main schools: foundationalism, which says there are fundamental truths about The universe likes how its first principles in mathematics and everything else are based on them or you have coherentism, the second school which says that nothing is absolutely true on its own, but when we have enough other intertwined and reinforcing truths, it becomes they support each other. other types of matches can't stand on their own, so in distributed systems reasoning, regardless of what school of epistemology you subscribe to, it's quite difficult to begin to define, you know the basic building blocks of truth, and of course, what if we're all just brains and vats and nothing is real uh by the way the skeptics said no you won't find this in philosophy textbooks but they were actually the world's first internet trolls a fact funny, so besides that, let's go back to message passing for a second unreliable message delivery is totally one thing, then another thing in the category of unreliable like uncertainty, uh uncertainties, so the classic case that the people is the problem of Byzantine generals, so imagine how the thought experiment works, imagine two generals.
They are trying to coordinate a war but they can't communicate directly, so they rely on this unreliable little messenger to vary messages between the two, but all the time they can't know for sure if the message really even came from the other general, if was tampered with in flight, if it was delivered to some completely wrong place, so this is a bit silly, but this principle actually happens all the time when you build distributed systems, so of course we have some tools some technical mitigations, as we have tools to check the validity of the sender, for example, but we always have to be thinking about things like spoofing and message deletion and messages that get corrupted in flight, of course, we can mitigate a lot of these things by doing a good job of monitoring, observing and tracing our systems, but this is not a talk about any of those things today.
I can refer you to other people who can talk more about these topics, so there are a lot of things we're just never going to be able to know, but we can be sure of one thing that shit is going to fail, which brings us to the next chapter of this talk, The limit theorem, which is why the limit theorem debuted, came about in 2000 when Dr. Eric Brewer of Google gave a talk at the Computer Principles Conference on Robust Distributed Systems and when many people on the internet like to write about the limit theorem this way, they say there are three things here, you can have two of them like you can pick two, you can just throw out the third one, but that doesn't really make sense, actually, that's wrong, it's not possible to design a distributed system that way, there are many different alternative frameworks, the limit is not. the only way to think about tradeoffs for distributed support design, but you should at least think about it this way, with partition tolerance being the constant we trade off against on both sides, the reason why you can't sacrifice Partition tolerance like that statement doesn't even make sense.
I'll talk a lot more about this soon, but even if you're running a distributed system within a data center, even on a physical host, even you can't be 100% immunized. against network partitions or partition events, literally the only way to avoid the possibility of a partition is to have only one node, at which point you are definitely not running a distributed system and Brewer himself has talked about this in a um on This updated the limit 12 years later, in 2012, so it recognizes that, as some researchers point out, exactly what it means to lose partition p tolerance is unclear, so I realize I haven't gone through every letter yet, so let's do it. a little deeper into each part of cap c is 4 does anyone know that c is for linearization they are all wrong it's funny because it doesn't start with c um so what does this mean?
So linearization is a super narrow form of consistency, what that means and the word you'll find in the literature is record, it means you have an operation for a record update that you can think of as a row in a database that just can have a value at any given time, right? so imagine you have a log update where from time zero, which is before time one, the cat's state changes from hungry to full, so by linearization this means that if a single client sees that the cat is full from now on, all nodes in the cluster have to do it. returns that the cat is full you can no longer show any client that the cat is hungry this is actually very difficult this is ridiculously difficult probably impossible actually this is a super strong sense of the word consistency because it basically requires instantaneousness and universal replication, but replication can't really be zero since we know there will always be a delay in replication that you at least need, like the speed of light which is your upper limit, like the time it takes for a pulse to travel along of some optical fiber. cable, but what can we do about it?
Well, people who work with databases, of course, spend a lot of time and a lot of energy trying to reduce the replication delay, but of course there are other tradeoffs that they have to make along the way and the process. The final note I want to make here is that eventual consistency doesn't count, so when you have asynchronous replication in the background, that's not the formulation of cap, we're not talking about background replication, so there are many ways different from defining consistency and if You haven't seen this blog post before from Kyle Kingsbury on the jepsen.io blog.
I really recommend that you check it out so that you can map out all the different definitions of coherence that we casually use and what each of them logically entails. -It's amazing how many different ways there are to define coherence as a word we think we understand. The takeaway here that I want you to like is that coherence is not a binary state. I don't really want to say it's a spectrum. because that implies continuous data, but it's more a matter of degrees, there are many different degrees of consistency, so we just have to be very deliberate and careful about which one we need to design, so the next letter a is for availability, which one is refers to the ability of clients to update data when they are connected to any node, so we tend to think of availability as a binary state, but again, the reality is much more complicated because of this wonderful thing called latency, so if you try to issue an update for example and you don't get a response for a long time due to latency, but maybe, what if that's because the node is down and your request can't be processed at that given moment why is there any interruption?
So latency wasn't part of the original limit formulation, but it has some really important impacts on detecting and responding to network partitions. It's like a real-world example, like everyone has a friend who is chronically late. Our friend before likes to give up on our dinner reservation, soSo one way to deal with this in real life and when you're building systems is to set a timeout so you can say I'm just going to wait 10 minutes for my friend. Coming up with that requires a time-consuming effort, like building a computer system, but determining what constitutes a reasonable wait time is really complicated and the first time you set up a system you don't have historical data, so you might as well roll some dice if you say Oh yeah, 12 seconds sounds good, but of course tracking time is a good way to learn what's normal.
You may be lucky enough to choose software that can learn on its own what a reasonable wait time is over time and the final letter. p is for partition tolerance, a partition in this case refers to a network partition. I know sometimes this word is used to refer to individual volumes like encyclopedias when talking about database fragmentation, that is not the sense in which I mean partitioning today, I mean network failure or a partition event partitioned network, there are many different terms for this, it basically means that when you have an event that breaks the connectivity between two nodes of your system that are running in the same data center, different data centers during a partition event Yours Notes could also be on different sides of a wormhole, as if there was no way to know what was happening on the other side.
You don't know that the other side is responding to health checks. You don't know if he continues reading and writing. and process client requests, so let's quickly recap c is for consistency with the availability of an asterisk and partition tolerance, so the proof of the limit theorem is actually quite simple and, as if you didn't know a rigorous mathematical proof or anything, but intuitively we can reason that if you have a partition event that happens imagine this is what you know, your server configured two nodes in a cluster, which is actually not a good way to design a cluster, but imagine you have two nodes and three different clients connected, one on the right side. two on the left side a partition occurs, now you only have two logical ways to respond, or you let clients continue reading and writing on both sides of the partition, but if you do this you necessarily sacrifice linearization because if any update occurs on the green side of the partition, the pink client will never be able to see it or you will pause right on one side of the partition until one side of the cluster until the partition event ends, which also necessarily sacrifices availability because the side that is paused will not be able to receive updates, so we'll end up getting a little bit closer to partition tolerance, like What's this idea?
Because I can not? Why can't I just throw it away? The reason is because of network partitions. Partition events are inevitable. How inevitable. I'm going to pick a small startup that probably knows a little bit about this space. So Google, in the first year of a Google cluster's life, will experience five rack failures, three router failures, eight network maintenances, and a host of other hardware-related issues that Jeff Dean has written about extensively, why, although why, can't big companies like highly successful companies? with a lot of smart people you can't just avoid hardware failures well hardware will just fail like you can't design 100 indestructible hardware for example maybe you're into routers like the hardware that holds your routers together fails mysteriously, I don't know.
Network cables eventually run out and in fact apparently sometimes sharks mistake the cables under the seat for fish because of the small electrical pulses and try to eat them, but that's okay because some rs technica journalists want you to know that from As of 2015, sharks no longer exist. It's no longer a threat to undersea Internet cables because Google and Facebook lay them out and wrap them in Kevlar before submerging them, so the second category of reasons: Anyone who likes written software probably knows that sometimes, the software we write can do things. which we don't expect to happen and some of those unexpected things will result in events that look or feel a lot like network outages, so on multi-tenant servers, which is almost always the case in public clouds, we don't do that.
They have perfect resource isolation, which means that some similar users may exploit the CPU or memory a little as they need, but that's actually a good thing in general, but it can result in this strange case where if you have a process as part of your system is running somewhere else on that same host and everyone else is exploiting your process may slow down it may seem like there is a partition event it may seem like there is an interruption there uh some languages ​​have stopped garbage collection world, so if we run out of memory we have to pause everything, recover some resources so we can continue, this causes everything to suspend which also makes it look like a node is down and network failures happen randomly.
This is not really an illustration of the principle. It's just the Wreck-It Ralph character's mistake, which is a great movie, I recommend you watch it, it's about the Internet and also, sometimes people fail, sometimes people do bad things, so in April 2009 a person took a giant i think it was an ax actually i don't know why i drew scissors someone went into a sewer and cut a bunch of fiber optic cables in san jose so a lot of people and a lot of data centers in the Southern California were on the wrong side of a network partition for a while, so I hope you're convinced, whether from this talk or your own experience, that running distributed operating systems is more likely than not, based on your own experiences. really very difficult, Peter Alvaro, a teacher in California, asked the students.
Think about what is the hardest thing about distributed systems if you had to pick a word and then one student says uncertainty, yeah, very good uncertainty and then another student raises his hand and says docker and the first student says, actually, yeah, That's better, why? no okay, thanks for coming, no joke, so why does any of this matter? um, you might be sitting there wondering, well, that was a cool good story, bro, why are you telling me all this? The practical reality is that we can. It doesn't guarantee that every node in a system will always be alive and accessible, so it means that some part of every distributed system will always be at risk of failing, like thinking about how hard it is to coordinate plans with your friend who's having bad luck. right, it's like this on a huge scale over and over again with a lot of machines you can't, you can't see them, you can't text them, I think maybe you can text the machines now, I don't know, but this The entire discussion points to Fischer Lynch Patterson's correction result, which is the result of a very famous paper from 1985 that basically states that distributed consensus is impossible when at least one process can fail, so there is a lot here to unpack in the references to this Talk, I have linked to more resources if you want to delve deeper into this, so we just saw that in almost every case of running a distributed system today there is at least one element outside of your control that represents the possibility of failure, so what?
Can we do about it? um, we can't eliminate failure, we can't deny that it's going to happen, but what we can do is manage it, we can build for it, so to manage uncertainty we have a set of technical mitigation strategies. and just as a warning, I'm going to put in two fairly high density slides. I don't expect you to read everything in them. I'm going to wave at them a little bit, but they're in it. It will ship later, so one category, one genre of strategy, is that we can limit the amount of chaos in the world by limiting who can write at any time, so a very common pattern here is the leader follower pattern , where you have the leader. the leader is the only node that can write new data, of course followers can receive requests but they have to forward those correct requests to the leader who then determines in what order to write them all, but the challenging part is that the leader is also a node that can go offline, so we need a contingency plan to keep things writable in case that node goes offline, so we use a process called leader election to elect a new leader from the remaining nodes for them to respond and then a failover which usually has to be like one person needs to say "okay" so it can be the new leader, that's okay, trigger the failover to promote the new leader.
Another kind of way to mitigate uncertainty is to set some rules about how many yes votes from the entire group are enough to proceed, so that's a bit, this is an oversimplification, but these sets of rules can generally be thought of as algorithms. of consensus. The raft is one of many two-phase commitment strategies that try to keep a simple majority of nodes agreeing on what is the most recent data and what is the latest to go into the ledger or ledger or whatever. Of course, some of them use the leader follower within them as additional mitigation.
I also love to tell people who don't know this yet, did you know the raft didn't? It stands for anything, it's called a raft because it's a bunch of blogs, seriously that's true and I've also heard that people use rafts to escape from Pak Sauce Island which I don't know at all so what is it? even harder than getting machines to agree like we do? We've had consensus algorithms, we create better consensus algorithms every day, but what we haven't quite achieved is getting humans to agree and work together, so I'm going to dedicate the final part of this talk.
We're not talking about computer systems but human systems, so here's the thing: the more fault tolerant we try to make our systems, the more complex they inherently become when we start adding things to accommodate distributed failures, like messages being dropped. , message queues, maybe or maybe load balancers. and replication of different members of the system or maybe we can add some of our favorite container orchestration, what do you call this container-based workflow orchestration engine like kubernetes? Okay, we introduce complexity into the systems and this complexity is not a bad thing, I'm not saying never use Kubernetes, I'm not saying that using Kubernetes sometimes isn't a bad thing as inherent complexity is something you need to address to build the systems. systems with the degree of resilience and fault tolerance you need for your systems. but of course uncertainty is also introduced by humans operating and building their systems, so charity races talk extensively about this about this increasing complexity, today's systems are becoming more complicated, mental models are becoming harder to build and now we are all distributed systems engineers.
So it's getting harder and harder for us to reason about what's going on underneath. So how do we manage this growing complexity well? The first step is that we need to spend some time understanding where cognitive complexity comes from to build complex systems from microservices. Whatever you want to use, Wood's theorem becomes more and more important. Dr. David Woods wrote in 1988 that as the complexity of a system increases, the accuracy of any individual agent's model of that system decreases rapidly, so as humans operating large, invisible systems, everyone somehow We maintain mental models of what we think is happening, so imagine when you first join a new team or start learning a new system.
I think we think we build mental models like this. knowledge and we stack it clearly on top of knowledge we had before, but in reality we don't always have the time or resources to build those clear mental models and many times we like the context in which we find ourselves. We're building on changes right under our feet, so most of the time we just do our best to preserve the few relevant bits because, like computers, humans are limited by the size of their level cache. one. As I know I am, then the question we try to ask when we build complex systems as teams of humans is how can we achieve consensus with our understanding of the world when many times we don't even have very good language to compare and contrast what we have in the world? head, so in this example you tell three you know, as software engineers, like if we were going to have fish for dinner, they might havedifferent conceptions of what that really means or maybe have different mental models. about what our systems look like under the hood, how the systems we're building and running are really complex, possibly more complex than the conscious part we can maintain in our brain, which makes having conversations about them really difficult, so than the following The best we can do is try to extract this information by looking for situations that are information-rich proxies, so incident analysis is really a ripe place for mental model examination, so what is snapshot analysis?
Typically, teams conduct an incident. Review is sometimes called postmortem although I personally don't like the term postmortem because it implies that something terrible happened and something died like maybe your soul dies a little bit I don't know but no one died maybe depending on your systems I guess but Los Similar incidents are actually opportunities for new conversations and new learnings. If you want to know more about this, I really recommend it and I've linked some resources that will allow you to look into the work that John Ospal and his team have been doing over the last few years.
You can also come with my mistress with Dr. Laura McGuire at 4:10 today. She knows a lot more about this than I do, but how does snapshot analysis work to figure out mental models like where is this connection? So during the review of an incident, there are a couple. of typical activities that teams might perform, maybe you would like to make a timeline of everything that happened, you can create architectural diagrams, try to figure out decision trees, um, it doesn't really matter in the end, whatever format you choose, the goal of this. The meeting is to create an environment where the team can learn from incidents.
The only non-negotiable component of incident review is that it has to be blame-free, so blame-free discussions focus on learning rather than fairness, so by having these conversations you will almost certainly discover that each person in the room had a different idea of ​​what they thought was going on, so to close with this point and put the blame on those autopsies, I think a really clear litmus test as to whether their incident reviews are innocent or not is listen to counterfactuals, so Counterfactuals are statements like if Denise had not put the coffee cup on the table when the cat was in the room, then the incident would not have happened, so counterfactuals are difficult because they are hypothetical, They're like x happened, so why would it?
It hasn't happened, um, so what if you established that a different course of events could have happened that didn't happen? You can't observe them, you can't learn from them, they're not actionable so this is what human decisions never happen in a vacuum the slide is really dumb strange this little henry vacuum we don't have it in canada so be henry and we ask what seemed reasonable at the time because all human decisions were formed by some belief that a course of action was reasonable given the circumstances, if you want to learn more about this I have linked Courtney and Lex's presentation from sr econ last year , where they did a whole workshop on this, is really great, so in addition to conducting your incident reviews beyond reproach please. keep this in mind like please don't accept human error don't say like oh Denise screwed up and that's why the incident happened like don't accept human error as the root cause so ask how to go deeper ask harder questions that will actually be more , more useful, like asking, for example, a user was fooled by some design that was really unintuitive or really difficult to navigate. or maybe someone had too many alerts and was exhausted, was experiencing alert fatigue or maybe the way that some software was designed didn't take into account the assumptions that a user would bring with them, you know, maybe on the web. portal to the control room, whatever that is, and in fact, if you start reading the latest thinking on resilience engineering, there are a lot of people suggesting that maybe there are no root causes.
One of the best talks on this is from Netflix's Ryan Kitchen. also linked in the notes, so later I was half joking about that epistemology thing. I just wanted a way to include my philosophy degree in this talk. I think, oh, make everyone learn about epistemology, but in all seriousness, we all have different frames of view. reference and we don't have a big vocabulary to figure out the differences, so of course humans can make mistakes, I think that's just part of being human, but the best thing about humans is that we are also much more adaptable than machines .
We depend on technology and push it to its limits, the more we are going to need highly trained, highly trained, rather than well trained and well practiced people to make our systems resilient and act as the last line of defense against the failures that they will inevitably occur. This is an article titled The Ironies of Automation Still Strong at 30. It's also very good to read, so in pursuit of sociotechnical goals to build microservices to build resilient systems to build large expanding distributed systems, we are always taking a Loan Getting a loan against inherent complexity, but we as humans can learn and adapt, that is our superpower, so with this superpower I challenge you to challenge yourselves to empathize with each and every user, no just with the customers, not just the paying ones, but also the ones who are operating their systems, the ones who are going to be called at 2 am when we think about where to draw the lines, for example, for a limited context, those are design decisions, right, those are not technical decisions as I quote, those are design options and we.
We should always challenge ourselves to make those decisions in a way that optimizes for the humans who operate and build these systems, that means we should choose tools and processes that promote the things that help humans do their jobs better, make sure That your tools and processes promote learning and a sustainable pace because, after all, we owe it to our end users and our teams to understand and design the entire system, including the meaty human parts. Many thanks to Nikki Wrightson and Kucon for inviting me to be here for Today I'm Talking About Cats, I guess once again, those are the slides and the references.
Thank you very much for listening.

If you have any copyright issue, please Contact