Skip to main content

AI Simplified 3 : Q Learning - From State to Action

Q Learning

Previously we were looking at the value of the state. Q Learning now moves to calculate the value of an action. We now move based on our actions as opposed to the value of a state.  People tend to think that its called q as shorthand for quality learning. Now let's move to determine the equation for q-learning.

Deriving the equation

Remember the stochastic Markov equation? 

V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))
Value of an Action equals Value of a state. ie V(s) = Q(s,a)



Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') V(s')
No max as we are not considering all of the alternative actions but just one action. We need to wean ourselves from V so we need to replace V(s'). V(s') represents all possible states. It is also worthy to note that V(s') is also max(Q(s', a')). With that it becomes 



Q(s,a) = R(s,a)+ ɣ∑s' P(s, a, s') maxₐ (Q(s', a'))
Why max? Well, we still want to get all the possible values of the next actions.  One function Q and this makes it easier to calculate. The quality of the actions is now called q values.


An important piece of the puzzle now is the temporal difference.

Temporal Difference

It helps calculate the q values with respect to the changes in the environment over time. A very important piece of Q learning. This aids in calculating values of actions in a non-deterministic environment. This is especially helpful if your environment keeps changing and the moment your temporal difference converges to 0 then the algorithm would have converged.

Deriving temporal difference equation

For easier understanding lets temporarily use the deterministic Bellman equation which is 

V(s) = maxₐ(R(s,a) + ɣ(V(s'))
Lets refactor this in terms of Q Values
 


Q(s,a) = maxₐR(s,a) + ɣ maxₐQ(s',a')
Now lets imagine our agent has moved. Temporal Difference (The difference between Q Values) Is then defined as :


TD(a,s) = R(s,a) + ɣ maxₐQ(s',a') - Q(s,a)
(R(s,a) + ɣ maxₐQ(s',a')) : Is the new Q Value.
Q(s,a) : The Old Q Value.
Does that mean we dispose of the old Q Value? No, because if its a stochastic move that our agent made then the new Q Value could just be another random value so we wouldn't want to change our Q values for something that is a random and possibly one-time event. So the Q value should be  


Qₜ(s,a) = Qₜ₋₁(s,a) + ⋉TDₜ(a,s)
The new variable  represents our learning rate (Has to be between 1 and 0). This equation sums up how q values are to be updated. If you want the whole equation in one line it becomes


Qₜ(s,a) =  Qₜ₋₁(s,a) + ⋉(R(s,a) +ɣ maxₐ།Q(s',a') - Qₜ₋₁(s,a) )
















Comments

Popular posts from this blog

Django & Firebase - A marriage of awesomeness

Requirements 1) Django (obviously) 2) Pyrebase (pip install pyrebase, you know the drill) So to give a better appreciation- I will first show you the HTML way then I'll proceed to show you how its done in Python. METHOD 1 : The HTML way Then you need to go to the firebase console. To setup a new project. After that you will see a web setup and you select that. Once selected you will see the config. It should be similar to this : Now that you have configured firebase, we now need to select the components which will be needed. depending on what you want for your app- you can now select the modules that you need. For us today, we will do the login authentication. Make sure you include firebase app first, so this is my screen: METHOD 2: Enter Python Open your dev environment and create a file named  pyrebase_settings within your django app folder. In it, you will have the following: Now, lets go to views.py!

PRG, PRF, PRP in Cryptography - What are they?

So I have been reading up on my cryptography and I figured I should give out a brief lesson on these three amazing concepts What are they ? a) PRG (Pseudo Random Generator) You probably know the difference between stream and block cipher. One of the main differences between them is key size. Stream ciphers require the key to be of equal length of greater than the plaintext ,   whereas Block Ciphers take a key smaller than the PT and is then expanded. This is the PRG The PRG expands the seed Considerations: Stream Ciphers base on Perfect Secrecy whereas Block Ciphers base on Semantic Security b) PRF (Pseudo Random Function) Lets share a secret- imagine something- you want to authenticate yourself with me by proving that you know a secret that we both share. Here's a possible option i) Possible Option 1:  PRNGs We both seed a PRNG with the shared secret, I pick and then send you some random number i.  You   then have to prove that you know the secret by respon

Modern Learning : Information Age

Times have changed. We moved from making fire with sticks to amazing things like 3d printing, biotechnology, block-chain and artificial intelligence. No doubt- we are only starting. But one thing is of worry- how we learn. Learning can be subdivided into 4 parts(in the particular order) namely: Fundamentals Information Skills Innovation These can be applied to the Technology, Business and Finance and the Leadership and Management branches. The main goal nowadays is not to be an expert if you want to quickly build your career and make some money in the process. The main aim for now is to gain competence. Why I say that is because it is cheaper for an organisation to get someone with competence as opposed to getting an expert. The most efficient way is to aim for the skill level as opposed to the expert level(which tends to come by itself as you sharpen the skills level) We want to maximize financial benefit from the most efficient way possible. 1. Fundamentals Fundamental