Skip to main content

AI Simplified 3 : Q Learning - From State to Action

Q Learning

Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state.  People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning.

Deriving the equation

Remember the stochastic Markov equation? 

V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Value of an Action equals Value of a state. ie V(s) = Q(s,a)



Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes 



Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions.  One function Q and this makes it easier to calculate. The quality of the actions is now called q values.


An important piece of the puzzle now is the temporal difference.

Temporal Difference

It helps calculate the q values with respect to the changes in the environment over time. A very important piece of Q learning. This aids in calculating values of actions in a non-deterministic environment. This is especially helpful if your environment keeps changing and the moment your temporal difference converges to 0 then the algorithm would have converged.

Deriving temporal difference equation

For easier understanding lets temporarily use the deterministic Bellman equation which is 

V(s) = maxₐ(R(s,a) + ɣ(V(s'))
Lets refactor this in terms of Q Values
 


Q(s,a) = maxₐR(s,a) + ɣ maxₐQ(s',a')
Now lets imagine our agent has moved. Temporal Difference (The difference between Q Values) Is then defined as :


TD(a,s) = R(s,a) + ɣ maxₐQ(s',a') - Q(s,a)
(R(s,a) + ɣ maxₐQ(s',a')) : Is the new Q Value.
Q(s,a) : The Old Q Value.
Does that mean we dispose of the old Q Value? No, because if its a stochastic move that our agent made then the new Q Value could just be another random value so we wouldn't want to change our Q values for something that is a random and possibly one-time event. So the Q value should be  


Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉TDₜ(a,s)
The new variable  represents our learning rate (Has to be between 1 and 0). This equation sums up how q values are to be updated. If you want the whole equation in one line it becomes


Qₜ(s,a) =  Qₜ₋₁(s,a) + ⋉(R(s,a) +ɣ maxₐ།Q(s',a') - Qₜ₋₁(s,a) )
















Comments

Popular posts from this blog

Django & Firebase - A marriage of awesomeness

Requirements 1) Django (obviously) 2) Pyrebase (pip install pyrebase, you know the drill) So to give a better appreciation- I will first show you the HTML way then I'll proceed to show you how its done in Python. METHOD 1 : The HTML way Then you need to go to the firebase console. To setup a new project. After that you will see a web setup and you select that. Once selected you will see the config. It should be similar to this : Now that you have configured firebase, we now need to select the components which will be needed. depending on what you want for your app- you can now select the modules that you need. For us today, we will do the login authentication. Make sure you include firebase app first, so this is my screen: METHOD 2: Enter Python Open your dev environment and create a file named  pyrebase_settings within your django app folder. In it, you will have the following: Now, lets go to views.py!

PRG, PRF, PRP in Cryptography - What are they?

So I have been reading up on my cryptography and I figured I should give out a brief lesson on these three amazing concepts What are they ? a) PRG (Pseudo Random Generator) You probably know the difference between stream and block cipher. One of the main differences between them is key size. Stream ciphers require the key to be of equal length of greater than the plaintext ,   whereas Block Ciphers take a key smaller than the PT and is then expanded. This is the PRG The PRG expands the seed Considerations: Stream Ciphers base on Perfect Secrecy whereas Block Ciphers base on Semantic Security b) PRF (Pseudo Random Function) Lets share a secret- imagine something- you want to authenticate yourself with me by proving that you know a secret that we both share. Here's a possible option i) Possible Option 1:  PRNGs We both seed a PRNG with the shared secret, I pick and then send you some random number i.  You   then have to prove that you know the s...

Deploy Django app online for free!

So after a number of lines of code, brilliance and dreaming. Your next dream is for the world to see. Of course you can walk around with your computer and doing a 'manage.py runserver' But cumon guys, lets embrace the cloud. Not like this guy though! I choose to deploy on  PythonAnywhere . So you ask why? 1) Free amazing support - You actually talk to a live human ! 2) Easy - Very easy 3) Affordable - As you scale up, it gets way better! So by now I assume you are already on a version control system (So I will not waste much energy on that one). Maybe Ill someday write on my two favs  Github  and  Bitbucket . STEP 1: Create an account on pythonanywhere. Kindly note that your username will be included in your apps url. So it will be like : " yourUsername .pythonanywhere.com" STEP 2: Select other and set a bash console. STEP 3: Push your code from version control This will push from (in my example) github to your pythonanywhere. You...