Bellman equation
It's named after Richard Bellman and is defined as a necessary condition for optimality. It is associated with the mathematical optimization method known as dynamic programming. The deterministic equation is listed below: 
V(s) = maxₐ  (R(s,a) + ɣ(V(s'))
maxₐ: represents all the possible actions that we can take.
R(s, a): The reward of taking an action at a particular state.
ɣ: Discount. Works like the time value of money.
You can see the dynamic programming aspect as we call the same method on s'. So it will recursively operate to solve the problem.
Deterministic vs non-deterministic: Deterministic is definite, there is no randomness whereas non-deterministic is stochastic.
So above we were not adding any randomness (deterministic) but nothing in this world is truly predictable, let's add randomness. Whereby each step is not so certain to be done (adding probability). It makes our agent more natural (being drunk! lol) so that means maybe our agent will have an 80% probability to listen 10% to go another way and 10% to go another way.
Markov Decision Process
This is essentially the Bellman equation but catering for non-deterministic scenarios. We will loop through each possible next state, multiply the value of that state by its probability of occurring and sum them all up together.
The Markov Property
This property states that the past does not matter - it only looks to the future and the present. So it only looks to the current state. It also helps us save our memory since we do not need to save past states. 
Concept
A stochastic process has the Markov property if the conditional probability of the future states of the process depends only on the current state. The MDP provides a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of the decision-maker. It adds more realism and perfection to the Bellman equation. It changes the Bellman from being deterministic to be non-deterministic.
The equation
S: This is the state our agent is in.
Model : T(s,a, s') ~ Pr(s'|s, a). Also known as the transition model. Determines the physics or rules of the environment and the function produces the probability.
Actions: A(s), A. Things you can do in a state.
Reward : R(s), R(s,a), R(s,a, s' ) Represents the reward of being in a state. The main focus is R(s). This is the reward function.
Policy: 𝝿(s)→a, 𝝿*. The solution to an MDP. It differs from the plan because a plan is just a sequence of events whereas the policy tells you what to do in any state. Shows the key-value pairs.
V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Living Penalty- It's a negative reward for being in a particular state. The main incentive for this is for the agent to want to finish the game as quickly as possible.
Comments
Post a Comment