Reinforcement Learning with Multi Arm Bandit (Part 2)

This is Part 2 of two part articles. The combined article is present in the Lazy Data Science Guide.

Let make the problem a little bit more….complex!


This is in continuation of the original post, I highly recommend to go through it first, where we understood the intuition of multi arm bandit and tried to apply e-greedy algorithms to solve a representative problem. But the world isn’t that simple. There are some factors, introducing which, the problem changes completely and the solution has to be redefined. We will pick where we left, introduce new problem, show how our beloved algorithm fails and try to build intuition of what might help in the case.

Non-stationary problems

# define a function to update the expected_action_value  
def update_expected_action_value(expected_action_value, arms):  
    expected_action_value += np.random.normal(0, 0.01, arms)   
# inside the multi_arm_bandit_problem function add,   
estimate_action_value = update_estimate_action_value(estimate_action_value)


Try to recall the estimation function of the true reward distribution for stationary problem, it goes something like this,


  1. The first equation gives the formula for the estimated reward value for an action at t which is a simple average of all the rewards we received for that action till time t-1
  2. The second equation is just a nice way of writing the same thing. It implies that the estimation of the reward for n+1 step will be average of all the rewards till step n
  3. The third equation is what you get when you expand the second equation and put in the formulae of $Q_n$ which is similar to the one of $Q_{n+1}$, just one step less (replace n with n-1). Here, $Q_{n+1}$ is the new estimation for the n+1 steps, $Q_n$ is the old estimation i.e. estimation till step n, $R_n$ is the rewards for nth step and 1/n is step size by which we want to update the new estimation.

To be clear, as per this formulation, if a particular action was chosen say 5 times, then each of the 5 rewards (n actions leads to n rewards) will be divide by 1/5 and then added to get the estimation till step 5. If you look closer, you will see we are giving equal weights to all the rewards, irrespective of their time of occurrence, which means we want to say, every reward is equally important to us and hence the equal weights. This holds true for the stationary problems but what about the newer problem? With rewards distribution changing, isn’t the latest rewards better estimation of the true rewards distribution. So shouldn’t we just give more weight to the newer rewards and lesser to the old one?

Well this thought is definitely worth pursuing. And this would mean just changing the reward weights, which can be done by replacing the average reward estimation to exponential recency-weighted average. We can further make this process generic by providing an option of if the newer or older rewards should be given more weight or a middle workaround of decreasing weights with time. As it turns out this could be easily done by replacing the step function of 1/n in the older formulation with a constant, say 𝛂. This updates the estimation function to,


  1. If 𝛂 = 1; $R_n$ i.e. the very latest reward will have maximum weight of 1 and rest will have 0. So if your excepted action value’s deviation is too drastic, we could use this setting.
  2. If 𝛂 = 0; $Q_1$ i.e. the oldest reward estimation will have maximum weight of 1 and rest will have 0. We could use this when we only want to consider initial estimation values.
  3. If 0 < 𝛂 < 1; the weight decreases exponentially according to the exponent 1-𝛂. In this case the oldest reward will have smaller weight and latest rewards higher. And this is what we want to try out.

The non-stationary solution

Lets formalize the solution by simply updating the code to replace the step size by an constant of value, say 0.1 and keeping rest of the parameters same. This will implement the exponential decreasing weight. Later we will compute the optimal action selection percentage and reflect on it.

# update the estimation calculation  
estimate_action_value[action] = \
    estimate_action_value[action] + \
    0.1 * (reward - estimate_action_value[action])

And we are back in business! e=0.01 is out-performing it’s rival and the performance converges to maximum after some steps. Here we are not seeing any decrease in performance, because of our modified estimation function which factor the changing nature of reward distribution.


The take away from this post could be understanding the stationary and non-stationary environment. The art of designing estimation functions by questioning and considering all the properties of the environment. And how sometimes minimal but intuitive changes could lead to maximum advancements in accuracy and performance. We still have some set of problems and improvements left which could lead to better accuracy and optimization, but later on that.

Cheers. :wave:


[1] Reinforcement Learning — An Introduction; Richard S. Sutton and Andrew G. Barto