Explore versus Exploit - Part 1

11 Aug 2019

Prologue To Explore versus Exploit - Part 1

Some reference of articles states that exploration is the topic that separates reinforcement learning from other kinds of machine learning. Exploration is part of fundamental trade off of reinforcement learning.

Begin from K-Armed Bandits

[1] K-armed bandits
Supposed we are given K-armed bandits, where $K$=$7$ in below exhibition. The bandit has no state transition, all about stochasticity, you pull it, you either get a jackpot, or you don’t get a jackpot.

[2] Which one you will pull?
In this example, suppose each machine will have different paypffs. The question is which bandit will you pull?
➀the one that has the best chance of winning the jackpot, giving you money.
➁the point is that we don’t know the payoffs are, hard to decide which bandit to choose.
➂if we know the payoffs, just do the $max_{arm}Bandit(arm)$ could we get the answer.
[3] we have to explore
Suppose each bandit has some probability of paying off a one, some probability of paying off nothing, saying that zero. But, we don’t know what it is at first.

If we pull bandit a, and it doesn’t pay off, does it mean that bandit a is not the best one? No, we don’t know what anything else is.

One arm pull just give us some information, but not an awful lot. We have to combine information across a whole lot of pulls to start to evaluate what the best bandit is.

So, we have to explore!!
[4] Illustration
Suppose each of these bandits has been pulled multiple times and come out the number of times that it pays off 1, with the fomer in the denominator, the later in the nominator, given in below exhibition.
In between a and e, it looks as if that bandit e is better than bandit a, since the same 10 pulls of bandit, e has payed 2 times, while a just 1 time.

Next to examine in between a and d, the bandit d gave the pay off only after 5 pulls, it might also be plausible if we prefer to bandit d rather than a.

By the given data, we have 2 concerns:
➀which bandit is the highest expected pay off?
➁which are we most confident in?

For ➀, the bandits d,e,f,g just play to a tie, since
$\frac {1}{5}\cdot 1$+$\frac {4}{5}\cdot 0$,$\frac {2}{10}\cdot 1$+$\frac {8}{10}\cdot 0$,$\frac {4}{20}\cdot 1$+$\frac {16}{20}\cdot 0$, $\frac {8}{40}\cdot 1$+$\frac {32}{40}\cdot 0$, are the expected pay off for bandit d,e,f,g respectively. They are all the same higher than the rest.

For ➁, as to the confidence level, if you choose g, it might be a good answer, but not the good reason. When it comes to expectation, the confidence level is monotonically increasing function of the numbers of sample. so, whichever has the biggest pulling numbers should be the most confident, because it has most samples. We finally compare c and g, and choose g.

Confidence-Based Exploration

[1] Something else in stochasticity
Given 2 bandits a and b, you make 1 pull of a and get 0 pay off, whereas, you make 2 pulls of b and get 1 pay off. Do you think the best thing at this point is to always pick up a, always pick up b, or something else?

Trivially, something else.

First, let’s calculate out the probability for:
➀always a to get 0 pay off
$\frac {1}{C_{1}^{3}}\cdot 1$=$\frac {1}{3}$
Since there exists 1 pull of bandit a out of the 3 pulls, and it indeed come out with 0 pay off.

➁always b to get 1 pay off
$\frac {1}{C_{2}^{3}}\cdot 2$=$\frac {2}{3}$
We treat 3 pulls as distinct computation unit, by given, there exists 2 pulls of bandit b and come out with 1 pay off. Totally 2 out of 3 pulls, we prefer all bandit b, where each pull is not replacable, and the 1 pay off might be appeared in the 1st or 2nd pull, that’s why we multiply by 2 at the end.

Suppose you choose bandit b by intuition, you are right $\frac {2}{3}$ of the time, but with $\frac {1}{3}$ of chance you make mistake. That’s why we end up with something else.
[2] What is something else?
If we run an arbitrary number of pulls, what is something else? I think, we might just evaluate which bandit to pull after some time, or infinite amount of time later. Then, I might know the true expected value for pulling bandit a or b, so I know which to choose.

It refers to the thing I’d like to do to get me more confidence in my expectation, and more number of pulls explicitly. Get more confidence in expectation would be the next step:
➀maximum likelihood
➁maximum confidence level
➂minimum confidence level
[3] Maximum likelihood
The maximum likelihood over here is to find out the bandit which has the highest expected pay off, and choose it, then keep revisiting this question.

Each time you pull the bandit, recalculate your expected values again and again. choose the highest pay off bandit and continue so on.

As new data of next pull comes in, we will refactor our calculation and make new decision again, but, that’s not a good approach.

By the given case, bandit b is the maximum likelihood, say $MLE_{a}$=$0$,$MLE_{b}$=$0.5$. In this test of maximum likelihood, you will just choose b as beginning, since $MLE_{b}$>$MLE_{a}$.

When I choose b for the very first time, whatever outcome of the result, the expected pay off of b is much higher than bandit a. Next, I continue to choose b, it still has expected pay off higher than 0, such behavior would be elapsed over a long long period would just fall into the expectation. And, it will begin to approach to 0 after an infinite amount of time.

The major point is that even if I choos b and it never pay off, it still has non-zero expected pay off, which implies that I never have the chance to choose a, it is equivalent to say that $\frac {2}{3}$ of the time you are right, but $\frac {1}{3}$ of the time you make mistake.

We can do better than that.
[4] Maximum confidence level
The maximum confidence level is the strategy for whichever one you have the best estimate, you should choose that one. But, that’s the bad idea, too. Why?

Because whichever one I start out must have the most(current maximum) confidence level of positive result, which implies that I will have more and more data to further consolidate the current maximum confidence one.

Although current maximum confidence level of positive result bandit might not guarantee its future pay off, it still can keep non-zero expected payoff over some period of time. In this example, the strategy will always choose b.
[5] Minimum confidence level
The minimum confidence level is to choose the bandit of most(current minimum) confidence level of expected payoff.

In this case, you would choose bandit a, if it happens to come out with 1 pay off, that is $P(a)$=$0.5$ and $P(b)$=$0.5$, it just break at a tie. If multiple bandits have the same minimum confidence level, it doesn't matter which one you choose.
What happens, if you choose bandit a, it comes out with 0 pay off?
This strategry would continue to choose the one of minimum confidence level, that is bandit a, until $P(a)$=$P(b)$=$0.5$.

At the moment it plays all bandits to a tie, it just proceeds in the fashion of round robbin. The very next time, when the result is skewed, it will choose the one with the less confidence level, continue and so on.

If you use this strategy, then it will uniformly distributed on all the bandits, and give you the answer to choose any of the bandits.
[6] There must be something else
So, there must be something else outside maximum likelihood,maximum confidence level,minimum confidence level. These are not better than always a, or always b.

Cautions must be made that the maximum likelihood is not a bad idea, whereas minimum confidence level is not a terrible method, either, for they two bring you the new data.

Some textbooks treat maximum likelihood as exploration, relate minimum confidence level to exploitation:
➀the maximum likelihood gets you more rewards.
➁the minimum confidence level gives you better estimates.

Addendum

➀Exploring Exploration, Charles IsBell, Michael Littman, Reinforcement Learning By Georgia Tech(CS8803)

mjtsai1974's Dev Blog Welcome to mjt's AI world

Explore versus Exploit - Part 1

Prologue To Explore versus Exploit - Part 1

Begin from K-Armed Bandits

Confidence-Based Exploration

Addendum

Related Posts

Partial Observable Markov Decision Process - Part 4 04 Dec 2020

Partial Observable Markov Decision Process - Part 3 10 Oct 2020

Partial Observable Markov Decision Process - Part 2 13 Aug 2020