Skip to main content

AI Simplified 2 : Bellman and Markov

Bellman equation 

It's named after Richard Bellman and is defined as a necessary condition for optimality. It is associated with the mathematical optimization method known as dynamic programming. The deterministic equation is listed below: 

V(s) = maxₐ  (R(s,a) + ɣ(V(s'))


maxₐ: represents all the possible actions that we can take.
R(s, a): The reward of taking an action at a particular state.
ɣ: Discount. Works like the time value of money.

You can see the dynamic programming aspect as we call the same method on s'. So it will recursively operate to solve the problem.

Deterministic vs non-deterministic: Deterministic is definite, there is no randomness whereas non-deterministic is stochastic.

So above we were not adding any randomness (deterministic) but nothing in this world is truly predictable, let's add randomness. Whereby each step is not so certain to be done (adding probability). It makes our agent more natural (being drunk! lol) so that means maybe our agent will have an 80% probability to listen 10% to go another way and 10% to go another way.

Markov Decision Process 

This is essentially the Bellman equation but catering for non-deterministic scenarios. We will loop through each possible next state, multiply the value of that state by its probability of occurring and sum them all up together.

The Markov Property

This property states that the past does not matter - it only looks to the future and the present. So it only looks to the current state. It also helps us save our memory since we do not need to save past states. 

Concept

A stochastic process has the Markov property if the conditional probability of the future states of the process depends only on the current state. The MDP provides a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of the decision-maker. It adds more realism and perfection to the Bellman equation. It changes the Bellman from being deterministic to be non-deterministic.

The equation

S: This is the state our agent is in.
Model : T(s,a, s') ~ Pr(s'|s, a). Also known as the transition model. Determines the physics or rules of the environment and the function produces the probability.
Actions: A(s), A. Things you can do in a state.
Reward : R(s), R(s,a), R(s,a, s' ) Represents the reward of being in a state. The main focus is R(s). This is the reward function.
Policy: 𝝿(s)→a, 𝝿*. The solution to an MDP. It differs from the plan because a plan is just a sequence of events whereas the policy tells you what to do in any state. Shows the key-value pairs.

V(s) = maxₐ (R(s,a)+ ɣ∑s' P(s, a, s') V(s'))


Living Penalty- It's a negative reward for being in a particular state. The main incentive for this is for the agent to want to finish the game as quickly as possible.



Comments

Popular posts from this blog

Django & Firebase - A marriage of awesomeness

Requirements 1) Django (obviously) 2) Pyrebase (pip install pyrebase, you know the drill) So to give a better appreciation- I will first show you the HTML way then I'll proceed to show you how its done in Python. METHOD 1 : The HTML way Then you need to go to the firebase console. To setup a new project. After that you will see a web setup and you select that. Once selected you will see the config. It should be similar to this : Now that you have configured firebase, we now need to select the components which will be needed. depending on what you want for your app- you can now select the modules that you need. For us today, we will do the login authentication. Make sure you include firebase app first, so this is my screen: METHOD 2: Enter Python Open your dev environment and create a file named  pyrebase_settings within your django app folder. In it, you will have the following: Now, lets go to views.py!

PRG, PRF, PRP in Cryptography - What are they?

So I have been reading up on my cryptography and I figured I should give out a brief lesson on these three amazing concepts What are they ? a) PRG (Pseudo Random Generator) You probably know the difference between stream and block cipher. One of the main differences between them is key size. Stream ciphers require the key to be of equal length of greater than the plaintext ,   whereas Block Ciphers take a key smaller than the PT and is then expanded. This is the PRG The PRG expands the seed Considerations: Stream Ciphers base on Perfect Secrecy whereas Block Ciphers base on Semantic Security b) PRF (Pseudo Random Function) Lets share a secret- imagine something- you want to authenticate yourself with me by proving that you know a secret that we both share. Here's a possible option i) Possible Option 1:  PRNGs We both seed a PRNG with the shared secret, I pick and then send you some random number i.  You   then have to prove that you know the s...

Why I think the fiat money system is useless – PART 1 (How it all began)

The fiat money system came as a solution to the so called ‘problems’ of the gold standard.  We, at the time, thought we are now too cool for neo classical economics and started to vouch for classical economics (smh as if it raised us). But wait whoa! What am I talking about? I’m talking about money. Let me start off with a brief history of money. So we get on the same page. Let’s get into our imaginary time travel pod and let’s take a sneak peak of the past. By definition, money is a generally accepted medium of exchange. We started this medium of exchange by bartering. Bartering is a direct trade of goods and services - I'll give you a stone axe if you help me kill a mammoth - but such arrangements take time. You have to find someone who thinks an axe is a fair trade for having to face the 12-foot tusks on a beast that doesn't take kindly to being hunted. If that didn't work, you would have to alter the deal until someone agreed to the terms. O...