Q Learning
Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state. People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning.
Deriving the equation
Remember the stochastic Markov equation?
V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Value of an Action equals Value of a state. ie V(s) = Q(s,a)
↡
Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes
↡
Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions. One function Q and this makes it easier to calculate. The quality of the actions is now called q values.
Value of an Action equals Value of a state. ie V(s) = Q(s,a)
↡
Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes
↡
Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions. One function Q and this makes it easier to calculate. The quality of the actions is now called q values.
An important piece of the puzzle now is the temporal difference.
Temporal Difference
It helps calculate the q values with respect to the changes in the environment over time. A very important piece of Q learning. This aids in calculating values of actions in a non-deterministic environment. This is especially helpful if your environment keeps changing and the moment your temporal difference converges to 0 then the algorithm would have converged.
Deriving temporal difference equation
For easier understanding lets temporarily use the deterministic Bellman equation which is
V(s) = maxₐ(R(s,a) + ɣ(V(s'))
Lets refactor this in terms of Q Values
↡
Q(s,a) = maxₐR(s,a) + ɣ maxₐQ(s',a')
Now lets imagine our agent has moved. Temporal Difference (The difference between Q Values) Is then defined as :
↡
TD(a,s) = R(s,a) + ɣ maxₐQ(s',a') - Q(s,a)
(R(s,a) + ɣ maxₐQ(s',a')) : Is the new Q Value.
Q(s,a) : The Old Q Value.
Does that mean we dispose of the old Q Value? No, because if its a stochastic move that our agent made then the new Q Value could just be another random value so we wouldn't want to change our Q values for something that is a random and possibly one-time event. So the Q value should be
↡
Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉TDₜ(a,s)
The new variable ⋉ represents our learning rate (Has to be between 1 and 0). This equation sums up how q values are to be updated. If you want the whole equation in one line it becomes
↡
Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉(R(s,a) +ɣ maxₐ།Q(s',a') - Qₜ₋₁(s,a) )
Comments
Post a Comment