Q Learning Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state. People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning. Deriving the equation Remember the stochastic Markov equation? V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s')) Value of an Action equals Value of a state. ie V(s) = Q(s,a) ↡ Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s') No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes ↡ Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a')) Why max? Well, we still want to get all the po