Motivate many arm的故事

# Motivate many arm的故事 ## 來自[The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms](https://arxiv.org/pdf/2002.10121.pdf) 的第一頁 ### 很多應用需要驗的Arm都比Budget多很多 We consider the standard stochastic multi-armed bandit (MAB) problem, in which a decision-maker takes actions sequentially over T time periods (the horizon). At each time period, the decision-maker chooses one of k arms, and receives an uncertain reward. The goal is to maximize cumulative rewards attained over the horizon. Crucially, in the typical formulation of this problem, the set of arms k is assumed to be “small” relative to the time horizon T ; in particular, in standard asymptotic analysis of the MAB setting, the horizon $T$ scales to infinity while $k$ remains constant. In practice, however, there are many situations where the number of arms is large relative to the time horizon of interest. For example, drug development typically considers many combinations of basic substances; thus MABs for adaptive drug design inherently involve a large set of arms. Similarly, when MABs are used in recommendation engines for online platforms, the number of choices available to users is enormous: this is the case in e-commerce (many products available); media platforms (many content options); online labor markets (wide variety of jobs or workers available); dating markets (many possible partners); etc. ### 如何賣有趣的simulation現象 However, numerical investigation reveals interesting behaviors. In Figure 1, we simulate several different algorithms over 400 simulations, for two pairs of T,k in the many-armed regime. 2Notably, the greedy algorithm (Greedy) — i.e., an algorithm that pulls each arm once, and thereafter pulls the empirically best arm for all remaining times – performs extremely well. This is despite the well-known fact that Greedy can suffer linear regret in the standard MAB problem, as it can fixate too early on a suboptimal arm. Observe that in line with our first insight above, subsampling improves the performance of all algorithms, including UCB, Thompson sampling (TS), and Greedy. In particular, the subsampled greedy algorithm (SS-Greedy) outperforms all other algorithms. ### 賣我們設計演算法的動機 Motivated by these observations, in §5 and §7 we embark on a theoretical analysis of Greedy in the many-armed regime to complement our empirical investigation. We show that with high probability, one of the arms on which Greedy concentrates attention is likely to have a high mean reward (as also observed in the right panel of Figure 1).