Markov Decision Processes - Computerphile

Apr 02, 2024

Today I wanted to talk a little bit about the models and algorithms that we use to plan under uncertainty for robots and how robots make

decision

s when the world is against them the world is always against robots Yes, yes, it's complicated, Sure, that's probably the best place to think. or the best place to start is a shortest path algorithm, so I think I'm going to try to draw it and just warm up the drawing, so you can imagine that you're starting at home and you're trying. to get to work and I have to say this is an example of stealing from a talk I saw a few years ago and I'll thank you for that later, but you have, let's say, the simplest shortest path problem.

I only have one stock, so you can go by bike and that might take you 45 minutes and you can go buy a car and that might take you let's say 30 minutes and you can go by train and maybe that will take you 35. This is a Es kind of a standard

decision

-making problem, we have two states, so our states are being at home or at work and we have three actions so we can go by bike, we can go by car or we can go by train. We use these kinds of state transition models everywhere for robotics, so if I'm programming a robot to pick up something or navigate a map last time we were together, we're looking at robots navigating through these graphs and we turn those graphs into these types of decision problems as well, so this is a type of model that you could use to decide what route I use to get to work and in this case, whether we are trying to minimize time or minimize cost , then We would take the car option, which is 30.

More Interesting Facts About,

markov decision processes computerphile...

I am going to show what the problem is with this model. Too easy. It's too simple. So it doesn't reflect the fact that life is just yes, the world can be against you. the world can be uncertain, so the standard thing we do in robotics is take what we call deterministic models, so this is a model where the outcome is completely determined by the state you're in, so you know Where are you going to stay. and we can say, well, actually we can make them more complicated, we want to reflect the uncertainty, so I'll draw a kind of big version first and we can think about that, so we imagine that we take our action that we could call auto. and that's still going to cost us 30 or 30 minutes, right, that's my planning model, that's how we're going to model the world, but then what we're going to think about is that there will actually be different results.

So when I run this action, when I run that action in the world, I can only end up in one of three different states, but they can be different traffic levels, so it can be light, medium, and heavy, that's kind of surprising. in my car I chose to take it to work. I drove a little bit and got on the highway or whatever and it's one of these traffic states and then I can go to work from there, but what we can imagine is when we get out of those states it's going to take us a different amount of time, so I did it.

What we could do is use that to change our model, so this is the basic idea and what we would do is assign different probabilities that we could say 20 of the time or 0.2 the traffic is light 70 of the time it is average temperature in the moment or probability of 0.1 there is a lot of traffic, so this reflects some of that uncertainty, some of that complexity in the world, this is kind of the standard action model we would use and when we use action models like this, this is called a Markov decision process, so a Markov decision process is States and Actions like a shortest path problem, except now when I perform my action, there is a probability distribution over the outcomes.

I can reach I think we did. Mark off sometimes talks over the text, isn't that correct? Yeah, so Markov is a probabilistic assumption when you have a state, it doesn't need actions, right, it can just be state transitions, excuse me, state transition. systems and what we do is a brand of assumption that says the probability of an outcome only depends on the current state, so it doesn't matter the history, it doesn't matter what I was doing before I chose to leave home, the only thing that determines that Probability is the action I'm taking now to be more formal.

You can think of that as a kind of first-order Markov system and you can have a second-order Markov system that says only the current state and the previous state are determined. my probabilities, but in general, if you think of someone like a Markov chain, if you're processing text, you can think of word generation following probability distributions and there you can think of a step, so if I give you, there is a probability of getting limited or if I give you I'm the, then you get a probability of getting something different and if you take that whole story, if you make that there will be a third order Markov assumption, you'll get a different probability distribution over the word. you would get the next thing that's desperate says um okay so that's our action type that's our action model and so we can take this action model we can paste it into the graph that we have before the process that we have before let's set this up again and we'll draw that small area at the top which means this is our initial state so like in a shortest path problem we will start from somewhere and we normally do this in robotics because our robot has to be someone when it starts thinking . and we're going to make our whole world a little more complex, so I can go by train, I can go by car, I can go by bike, so these actions now really reflect the decisions that I'm going to make to begin with. so not necessarily go ahead and do that, but make the decision that this is what I want to do and maybe take that form of transportation, so now the car will take one, this is me leaving the house and going to my parking lot. space and find which car Railway maybe you have to walk a little bit further to get to the railway so it will cost two and then we can start putting these actions um so actually these arrows should be points and I can put the action that we had before , where we have different levels of traffic, so light that we have medium.

I'm going to start abbreviating these things because my handwriting is atrocious and it will take forever to just watch me write heavy words and then down. at the bottom down here we're going to have work and I'm making two circles on purpose because we cook up this state transition system, we have our initial state with the arrow coming in and I make these two rings to mean that this is where we want to end up, we call This is an absorbing state, there's no way to get out of it when you're there and since this is a bit of a depressing job, but here we go and then we could have an action called Drive, so our Drive action will be deterministic in the sense that I can take it and it always gets me to work, but I can only take it in certain states, so I can only take it once I'm in my car.

I just like to drive. so and we can use this now to model the fact that we could get different trip lengths, so let's say in heavy traffic it takes me 70 minutes to get to work in heavy traffic, 30 and a half, 20 minutes in light traffic, and I think we have these probabilities of 0.2 0.7 and 0.1 so this shows me that if I go by car I have some probability of these different levels of traffic and once I have experienced that level of traffic there is no turning back I have to follow the car I have to follow that action option, so use the car to get to work and it may take a different amount of time.

Let's say here that the bicycle is deterministic, so if I am on a bicycle I have full control of the time it takes me. that will take 45 minutes, so it is further by bike, but there is no uncertainty and that could be interesting later in this example, we can say that when I go to the train station, well, maybe 90 of the time it seems quite unlikely that the train is there. He is ready to leave and then he takes me. I have an action that we will call relaxing because once we are on the train we are happy and that will take me let's say 35 to get to work and then 10 of the time I have to wait I am in the waiting room and then from the waiting room I have I have to wait and then 90 of the time the train comes, ten percent of the time I have to wait again, so the interesting thing here is that there is a kind of notion of a passage of time, you can start to imagine that I am , I'm wandering around, um, I hope the train comes, I hope the train doesn't come and I'll say it again and of course if I'm stuck in the waiting room for too long I might as well choose to go home and the good thing about that is that I can make another decision, so that could cost me two waits, maybe I wait three minutes each time, so now I have this big graph and this is the Markov decision process, this is a process because I'm going through a sequence of steps, in each step I have to choose what action to perform and then that evolves the state so that the state I am in changes and we have different probabilities of reaching different states, but in a sense we can still think of this as the algorithm shortest path, but instead of a shortest path we call it a stochastic shortest path because what we want to do is reach our goal, so we are still trying to reach our goal, but now we are also dealing with uncertainty.

This is a model that describes the options your agent might have, so these are the options you might have when you go to work, we could imagine. I'm building an autonomous vehicle and this is what you know, it's an autonomous taxi system and it has to decide how to get you to work. These kinds of models really capture what we know, the decision, you know, is a part of artificial intelligence, whether it's robotics, whether it's. that's Chat Bots all kinds of stuff um so the next question is really how do we solve this and there are two answers or there are two parts that answer the first is what the solution looks like the second bit is what algorithm do we use when Do you say how it looks like you've been?

Is it just a simple number or is it yes, well, then what structure does the solution have? So if it were the shortest path problem, the solution is a path, right, it's because and that's a sequence of actions that you're going to take from start to finish, but in a mdp in a marking decision process we can't. use a path because when I take a step, the world changes probabilistically and stochasticly, so I don't know what world I'm in, what state I'm going to end up in, so instead of a path we have what's called a policy, so What the policy is is a lookup table that simply tells you when you're in this.

The state takes this action and that state could be the problem states or we could even augment them, so as these models become more complex we could add additional information to the state, we could add the time of day, we could add, ya You know. How much time is left before some kind of deadline? How many times have we waited in the waiting room? It could have been interesting, you can add more to your state to help make decisions, but overall the policy instead of just being a straight line plan. a sequence of actions will be a lookup table that usually tells what is the optimal action and what is the best action to take to achieve a particular specification and a particular goal that you are trying to achieve when you are in this state. and you can think of that for mdps there are all kinds of specifications, the most common for something like this for a stochastic shortest path problem is to minimize the expected cost to the goal, so the expected cost is the average cost. if I get if I go to work, you know, every day for the rest of my life and I have these options, then having the expected, so cost averaging makes sense because I can do this, I can execute this. the policy for this problem many, many times sometimes I do well sometimes I do badly but all that evens out over time um so we could think about that to start um so my goal is to get to work, my cost is the time, so what I want to do is minimize the cost to reach the goal, so I need an algorithm that calculates the cost of what we think of as the cost of what is expected.

The cost of the entire policy from the initial state will be similar and will affect thinking about the average cost of a path from the beginning to the end or the cost of the average path, not the average cost of the reverse part. because paths change due to probabilities, the way we solve this is with something called the Bellman equation, so the Bellman equation describes how good a particular action option is, so we need to think about the different options of action available to us in a state and we use those where we value those options and those where we basically choose the best action, which in this case will be the action with the lowest cost and we put that in the policy, which is also about why can't we use things like a star or dijkstra to star and dijkstra deal with deterministic problems,so they deal with state transition systems that might look a little like this, but they assume that actions are purely deterministic, so they produce past from start to finish, no probability, no probability, yeah.

They assume that when they perform an action, the world changes in action wave behavior, there are sort of analogies for some of these things in the probabilistic world, so the algorithm that we'll come up with is that it has similarities to the way that dijkstra's algorithm It works and actually a star is a heuristic search algorithm. There are also heuristic search algorithms for Markov decision

processes

, so they are classes of algorithms that fit well together, but for this we can't take dijkstra or a star and say you know, come on, if you're just thinking about the average cost, then everything gets a lot easier and what we're basically going to do is take all the different ones, we're going to multiply the cost by the probability of getting that cost, sum them up and that really tells us how good that stock is, so which is just kind.

It's almost like taking something like a normal shortest path algorithm and collapsing all the actions into an average action, but we have to do it recursively because each action takes me to another state where I can apply a different one. I have to make a different action choice, so we have this recursive problem that my actions are only as good as the states they reach, that's the kind of recursive part, but I also have to think about the fact that that happens probabilistically. , which is where this type of averaging comes into play. expected cost we can also think about trying to limit the probability of success or failure or the probability of certain extremes.

I think we can talk about that later. Algorithms are much more difficult, but super interesting and they really aren't. The things that matter to you when your boss tells you that he has to be here at 10 o'clock for a meeting or you run out of patience if your commute takes more than 45 minutes, so the real world problems are many. I think real world problems are better captured by those richer specifications, I mean, it seems to me that the waiting room is the starting point here because it's potentially a little bit more complicated, yeah, so if you took the train route , then you have multiple types of nests. of that right, then it depends on the problem you're trying to solve, so if we're solving for expected value, if I create a policy to optimize only the average cost of a trip, then the best option is the car.

I think it's something like 33 or 34 per minute, which is faster than the bike, which is 45, and I think the train goes over 35 or over 37 on average, so it's more expensive than the car. , cheaper than the bicycle, but yes, average is interesting and this is where these models become fun. Robotics and problem solving are yours when choosing the car. 10 of your trips will take you 71 minutes correctly, which would make you very late for work if you should generally take 30 to 40. So if you only solve for the average value, then 10 of the time you are very late and therefore Therefore, what you can think about in these types of problems is different from what we call specifications or objective functions, so if you are doing any optimization, you usually think about objective function optimization. what you're minimizing or maximizing and when we plan or make decisions with these types of models, we often express it as a specification which is a mix of the goal you're trying to achieve and the constraints you're trying to achieve. to put it on, yes, so, for example, you are trying to do it in the shortest time possible, but it should never be longer.

Yes, that is a perfect example, so if you try to do it in the shortest time possible, you say it can never be. longer than um yeah I'm going to say 60 because the mass I remember the mass of this so if you say it by 60 you always take the bike because in everything else there's a chance that the car takes 70. um there's no -Zero probability that the train will also take longer than 60. So there is a non-zero probability that if you take the train, you will be waiting forever, but the bike is deterministic, but the bike is long, so which is guaranteed to be less. than 60. but it's guaranteed to always be 45, so you'll always arrive later than you want.

I think the optimal type of strategy to get below 60. Guaranteed to get below 60 while minimizing time. to the train wait three cycles wait and then go home and take the bike so effectively that you can think about the possibility of taking the train within three cycles and then if you go longer it will take you longer you know you will . You violate your deadline, so you go back to get the bike, but that's those kinds of resolution mechanisms. I've actually been in that scene, yeah, exactly, um, it happens a lot, particularly as a covered post where you're like, well, I.

I could work from home but I should be there. Oh, the train is late. I'm going to go back and put my pajamas back on, but you have to think about what we call an augmented state space. I waited in the state and that allows your policy to include actions, you know, you think about actions where I'm in the waiting room and the time is now nine minutes and those things that you do in your head saying oh God, I've waited for so long, maybe the Train will never arrive. You can also include them in decision making.

If you know exactly where you want the robot to go, you take the 3D map and then manually annotate it with the points that in this case we are making. Okay, the graph is built autonomously using that Lego representation, so you say what we call Submitted Computing, in particular, what we want is the

Watch Video & Subscribe

If you have any copyright issue, please Contact