AI Simplified 3 : Q Learning - From State to Action

Q Learning

Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state. People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning.

Deriving the equation

Remember the stochastic Markov equation?

V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Value of an Action equals Value of a state. ie V(s) = Q(s,a)

↡

Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes

↡

Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions. One function Q and this makes it easier to calculate. The quality of the actions is now called q values.

An important piece of the puzzle now is the temporal difference.

Temporal Difference

It helps calculate the q values with respect to the changes in the environment over time. A very important piece of Q learning. This aids in calculating values of actions in a non-deterministic environment. This is especially helpful if your environment keeps changing and the moment your temporal difference converges to 0 then the algorithm would have converged.

Deriving temporal difference equation

For easier understanding lets temporarily use the deterministic Bellman equation which is

V(s) = maxₐ(R(s,a) + ɣ(V(s'))

Lets refactor this in terms of Q Values

↡

Q(s,a) = maxₐR(s,a) + ɣ maxₐQ(s',a')

Now lets imagine our agent has moved. Temporal Difference (The difference between Q Values) Is then defined as :

↡

TD(a,s) = R(s,a) + ɣ maxₐQ(s',a') - Q(s,a)

(R(s,a) + ɣ maxₐQ(s',a')) : Is the new Q Value.

Q(s,a) : The Old Q Value.

Does that mean we dispose of the old Q Value? No, because if its a stochastic move that our agent made then the new Q Value could just be another random value so we wouldn't want to change our Q values for something that is a random and possibly one-time event. So the Q value should be

↡

Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉TDₜ(a,s)

The new variable ⋉ represents our learning rate (Has to be between 1 and 0). This equation sums up how q values are to be updated. If you want the whole equation in one line it becomes

↡

Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉(R(s,a) +ɣ maxₐ།Q(s',a') - Qₜ₋₁(s,a) )

Shingirayi Mandebvu's Blog

Search This Blog