Explore versus Exploit - Part 1
11 Aug 2019Prologue To Explore versus Exploit - Part 1
Begin from K-Armed Bandits
[1] K-armed banditsSupposed we are given K-armed bandits, where $K$=$7$ in below exhibition. The bandit has no state transition, all about stochasticity, you pull it, you either get a jackpot, or you don’t get a jackpot.
[2] Which one you will pull?
In this example, suppose each machine will have different paypffs. The question is which bandit will you pull?
[3] we have to explore
➀the one that has the best chance of winning the jackpot, giving you money.
➁the point is that we don’t know the payoffs are, hard to decide which bandit to choose.
➂if we know the payoffs, just do the $max_{arm}Bandit(arm)$ could we get the answer.Suppose each bandit has some probability of paying off a one, some probability of paying off nothing, saying that zero. But, we don’t know what it is at first.
If we pull bandit a, and it doesn’t pay off, does it mean that bandit a is not the best one? No, we don’t know what anything else is.
One arm pull just give us some information, but not an awful lot. We have to combine information across a whole lot of pulls to start to evaluate what the best bandit is.
So, we have to explore!!
[4] IllustrationSuppose each of these bandits has been pulled multiple times and come out the number of times that it pays off 1, with the fomer in the denominator, the later in the nominator, given in below exhibition.
In between a and e, it looks as if that bandit e is better than bandit a, since the same 10 pulls of bandit, e has payed 2 times, while a just 1 time.Next to examine in between a and d, the bandit d gave the pay off only after 5 pulls, it might also be plausible if we prefer to bandit d rather than a.
By the given data, we have 2 concerns:
➀which bandit is the highest expected pay off?
➁which are we most confident in?For ➀, the bandits d,e,f,g just play to a tie, since
$\frac {1}{5}\cdot 1$+$\frac {4}{5}\cdot 0$,$\frac {2}{10}\cdot 1$+$\frac {8}{10}\cdot 0$,$\frac {4}{20}\cdot 1$+$\frac {16}{20}\cdot 0$, $\frac {8}{40}\cdot 1$+$\frac {32}{40}\cdot 0$, are the expected pay off for bandit d,e,f,g respectively. They are all the same higher than the rest.For ➁, as to the confidence level, if you choose g, it might be a good answer, but not the good reason. When it comes to expectation, the confidence level is monotonically increasing function of the numbers of sample. so, whichever has the biggest pulling numbers should be the most confident, because it has most samples. We finally compare c and g, and choose g.
Confidence-Based Exploration
[1] Something else in stochasticityGiven 2 bandits a and b, you make 1 pull of a and get 0 pay off, whereas, you make 2 pulls of b and get 1 pay off. Do you think the best thing at this point is to always pick up a, always pick up b, or something else?
Trivially, something else.
First, let’s calculate out the probability for:
➀always a to get 0 pay off
$\frac {1}{C_{1}^{3}}\cdot 1$=$\frac {1}{3}$
Since there exists 1 pull of bandit a out of the 3 pulls, and it indeed come out with 0 pay off.➁always b to get 1 pay off
$\frac {1}{C_{2}^{3}}\cdot 2$=$\frac {2}{3}$
We treat 3 pulls as distinct computation unit, by given, there exists 2 pulls of bandit b and come out with 1 pay off. Totally 2 out of 3 pulls, we prefer all bandit b, where each pull is not replacable, and the 1 pay off might be appeared in the 1st or 2nd pull, that’s why we multiply by 2 at the end.Suppose you choose bandit b by intuition, you are right $\frac {2}{3}$ of the time, but with $\frac {1}{3}$ of chance you make mistake. That’s why we end up with something else.
[2] What is something else?If we run an arbitrary number of pulls, what is something else? I think, we might just evaluate which bandit to pull after some time, or infinite amount of time later. Then, I might know the true expected value for pulling bandit a or b, so I know which to choose.
It refers to the thing I’d like to do to get me more confidence in my expectation, and more number of pulls explicitly. Get more confidence in expectation would be the next step:
[3] Maximum likelihood
➀maximum likelihood
➁maximum confidence level
➂minimum confidence levelThe maximum likelihood over here is to find out the bandit which has the highest expected pay off, and choose it, then keep revisiting this question.
Each time you pull the bandit, recalculate your expected values again and again. choose the highest pay off bandit and continue so on.
As new data of next pull comes in, we will refactor our calculation and make new decision again, but, that’s not a good approach.
By the given case, bandit b is the maximum likelihood, say $MLE_{a}$=$0$,$MLE_{b}$=$0.5$. In this test of maximum likelihood, you will just choose b as beginning, since $MLE_{b}$>$MLE_{a}$.
When I choose b for the very first time, whatever outcome of the result, the expected pay off of b is much higher than bandit a. Next, I continue to choose b, it still has expected pay off higher than 0, such behavior would be elapsed over a long long period would just fall into the expectation. And, it will begin to approach to 0 after an infinite amount of time.
The major point is that even if I choos b and it never pay off, it still has non-zero expected pay off, which implies that I never have the chance to choose a, it is equivalent to say that $\frac {2}{3}$ of the time you are right, but $\frac {1}{3}$ of the time you make mistake.
We can do better than that.
[4] Maximum confidence levelThe maximum confidence level is the strategy for whichever one you have the best estimate, you should choose that one. But, that’s the bad idea, too. Why?
Because whichever one I start out must have the most(current maximum) confidence level of positive result, which implies that I will have more and more data to further consolidate the current maximum confidence one.
Although current maximum confidence level of positive result bandit might not guarantee its future pay off, it still can keep non-zero expected payoff over some period of time. In this example, the strategy will always choose b.
[5] Minimum confidence levelThe minimum confidence level is to choose the bandit of most(current minimum) confidence level of expected payoff.
In this case, you would choose bandit a, if it happens to come out with 1 pay off, that is $P(a)$=$0.5$ and $P(b)$=$0.5$, it just break at a tie. If multiple bandits have the same minimum confidence level, it doesn't matter which one you choose.
What happens, if you choose bandit a, it comes out with 0 pay off?This strategry would continue to choose the one of minimum confidence level, that is bandit a, until $P(a)$=$P(b)$=$0.5$.
At the moment it plays all bandits to a tie, it just proceeds in the fashion of round robbin. The very next time, when the result is skewed, it will choose the one with the less confidence level, continue and so on.
If you use this strategy, then it will uniformly distributed on all the bandits, and give you the answer to choose any of the bandits.
[6] There must be something elseSo, there must be something else outside maximum likelihood,maximum confidence level,minimum confidence level. These are not better than always a, or always b.
Cautions must be made that the maximum likelihood is not a bad idea, whereas minimum confidence level is not a terrible method, either, for they two bring you the new data.
Some textbooks treat maximum likelihood as exploration, relate minimum confidence level to exploitation:
➀the maximum likelihood gets you more rewards.
➁the minimum confidence level gives you better estimates.