Skip to main content

AI Simplified 3 : Q Learning - From State to Action

Q Learning

Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state.  People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning.

Deriving the equation

Remember the stochastic Markov equation? 

V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Value of an Action equals Value of a state. ie V(s) = Q(s,a)



Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes 



Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions.  One function Q and this makes it easier to calculate. The quality of the actions is now called q values.


An important piece of the puzzle now is the temporal difference.

Temporal Difference

It helps calculate the q values with respect to the changes in the environment over time. A very important piece of Q learning. This aids in calculating values of actions in a non-deterministic environment. This is especially helpful if your environment keeps changing and the moment your temporal difference converges to 0 then the algorithm would have converged.

Deriving temporal difference equation

For easier understanding lets temporarily use the deterministic Bellman equation which is 

V(s) = maxₐ(R(s,a) + ɣ(V(s'))
Lets refactor this in terms of Q Values
 


Q(s,a) = maxₐR(s,a) + ɣ maxₐQ(s',a')
Now lets imagine our agent has moved. Temporal Difference (The difference between Q Values) Is then defined as :


TD(a,s) = R(s,a) + ɣ maxₐQ(s',a') - Q(s,a)
(R(s,a) + ɣ maxₐQ(s',a')) : Is the new Q Value.
Q(s,a) : The Old Q Value.
Does that mean we dispose of the old Q Value? No, because if its a stochastic move that our agent made then the new Q Value could just be another random value so we wouldn't want to change our Q values for something that is a random and possibly one-time event. So the Q value should be  


Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉TDₜ(a,s)
The new variable  represents our learning rate (Has to be between 1 and 0). This equation sums up how q values are to be updated. If you want the whole equation in one line it becomes


Qₜ(s,a) =  Qₜ₋₁(s,a) + ⋉(R(s,a) +ɣ maxₐ།Q(s',a') - Qₜ₋₁(s,a) )
















Comments

Popular posts from this blog

Django & Firebase - A marriage of awesomeness

Requirements 1) Django (obviously) 2) Pyrebase (pip install pyrebase, you know the drill) So to give a better appreciation- I will first show you the HTML way then I'll proceed to show you how its done in Python. METHOD 1 : The HTML way Then you need to go to the firebase console. To setup a new project. After that you will see a web setup and you select that. Once selected you will see the config. It should be similar to this : Now that you have configured firebase, we now need to select the components which will be needed. depending on what you want for your app- you can now select the modules that you need. For us today, we will do the login authentication. Make sure you include firebase app first, so this is my screen: METHOD 2: Enter Python Open your dev environment and create a file named  pyrebase_settings within your django app folder. In it, you will have the following: Now, lets go to views.py!

PRG, PRF, PRP in Cryptography - What are they?

So I have been reading up on my cryptography and I figured I should give out a brief lesson on these three amazing concepts What are they ? a) PRG (Pseudo Random Generator) You probably know the difference between stream and block cipher. One of the main differences between them is key size. Stream ciphers require the key to be of equal length of greater than the plaintext ,   whereas Block Ciphers take a key smaller than the PT and is then expanded. This is the PRG The PRG expands the seed Considerations: Stream Ciphers base on Perfect Secrecy whereas Block Ciphers base on Semantic Security b) PRF (Pseudo Random Function) Lets share a secret- imagine something- you want to authenticate yourself with me by proving that you know a secret that we both share. Here's a possible option i) Possible Option 1:  PRNGs We both seed a PRNG with the shared secret, I pick and then send you some random number i.  You   then have to prove that you know the s...

Making money with the falling rand: Lessons from Zimbabwe

It is no secret that the rand is falling like there is no tomorrow. This year alone it has fallen by over 18%. And if you look closely, at the last 3 years- it has fallen by 35%! This is not neglecting the economic setup where the slightest thing leads to ‘ toi toi. ' This trend of continuous striking and pay rate increase bargains has created such a vicious cycle. Prices rise, people strike, economy starts going through stuff. And we back at square one. We all know for sure that this cycle is bad. Zimbabwe and South Africa might not be different soon, only difference being that Zimbabwe chased the farmers, South Africa is chasing stabilisation. (Maybe the paradox of thrift  (prompted by the large population) will save them! Hope so.) In Zimbabwe 2008, a lot of people made a lot of money from ‘burning money’. This was whereby people took advantage of the bank rate versus the ‘streets’ rate of forex. The streets rate for forex was lower than the bank rate. Problem wa...