Skip to main content

AI Simplified 3 : Q Learning - From State to Action

Q Learning

Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state.  People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning.

Deriving the equation

Remember the stochastic Markov equation? 

V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Value of an Action equals Value of a state. ie V(s) = Q(s,a)

Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes 

Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions.  One function Q and this makes it easier to calculate. The quality of the actions is now called q values.

An important piece of the puzzle now is the temporal difference.

Temporal Difference

It helps calculate the q values with respect to the changes in the environment over time. A very important piece of Q learning. This aids in calculating values of actions in a non-deterministic environment. This is especially helpful if your environment keeps changing and the moment your temporal difference converges to 0 then the algorithm would have converged.

Deriving temporal difference equation

For easier understanding lets temporarily use the deterministic Bellman equation which is 

V(s) = maxₐ(R(s,a) + ɣ(V(s'))
Lets refactor this in terms of Q Values

Q(s,a) = maxₐR(s,a) + ɣ maxₐQ(s',a')
Now lets imagine our agent has moved. Temporal Difference (The difference between Q Values) Is then defined as :

TD(a,s) = R(s,a) + ɣ maxₐQ(s',a') - Q(s,a)
(R(s,a) + ɣ maxₐQ(s',a')) : Is the new Q Value.
Q(s,a) : The Old Q Value.
Does that mean we dispose of the old Q Value? No, because if its a stochastic move that our agent made then the new Q Value could just be another random value so we wouldn't want to change our Q values for something that is a random and possibly one-time event. So the Q value should be  

Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉TDₜ(a,s)
The new variable  represents our learning rate (Has to be between 1 and 0). This equation sums up how q values are to be updated. If you want the whole equation in one line it becomes

Qₜ(s,a) =  Qₜ₋₁(s,a) + ⋉(R(s,a) +ɣ maxₐ།Q(s',a') - Qₜ₋₁(s,a) )


Popular posts from this blog

Django & Firebase - A marriage of awesomeness

Requirements 1) Django (obviously) 2) Pyrebase (pip install pyrebase, you know the drill) So to give a better appreciation- I will first show you the HTML way then I'll proceed to show you how its done in Python. METHOD 1 : The HTML way Then you need to go to the firebase console. To setup a new project. After that you will see a web setup and you select that. Once selected you will see the config. It should be similar to this : Now that you have configured firebase, we now need to select the components which will be needed. depending on what you want for your app- you can now select the modules that you need. For us today, we will do the login authentication. Make sure you include firebase app first, so this is my screen: METHOD 2: Enter Python Open your dev environment and create a file named  pyrebase_settings within your django app folder. In it, you will have the following: Now, lets go to!

PRG, PRF, PRP in Cryptography - What are they?

So I have been reading up on my cryptography and I figured I should give out a brief lesson on these three amazing concepts What are they ? a) PRG (Pseudo Random Generator) You probably know the difference between stream and block cipher. One of the main differences between them is key size. Stream ciphers require the key to be of equal length of greater than the plaintext ,   whereas Block Ciphers take a key smaller than the PT and is then expanded. This is the PRG The PRG expands the seed Considerations: Stream Ciphers base on Perfect Secrecy whereas Block Ciphers base on Semantic Security b) PRF (Pseudo Random Function) Lets share a secret- imagine something- you want to authenticate yourself with me by proving that you know a secret that we both share. Here's a possible option i) Possible Option 1:  PRNGs We both seed a PRNG with the shared secret, I pick and then send you some random number i.  You   then have to prove that you know the s...

My Arduino journey

So I was starting off the Arduino and I did not know a thing about it! All I knew is that you could be Iron Man from this tiny thing. My skill was coding and none of the hardware stuff so I knew I had to learn. This post serves to document my journey into this Arduino world. What I know so far... Arduino is open source, the hardware AND software(Cool huh?) Arduino needs an IDE  You can connect it to your pi STEP ONE: Download the IDE from  this link. Depending on the OS. I'm on Ubuntu 16.04. I downloaded- did the ./ and voila! done. What the challenge now was to navigate the board properly but after a bit of reading I learnt that pin 13 has an inbuilt resistor  and putting in the LED was now easy as cathode gets into the GND and  our friendly pin 13 has our back! Now my challenge became the permission error then I realised how to set permissions with this Linux code sudo chmod a+rw /dev/ttyACM0 And you ready to go! Seconds later- I h...