Outline
In this tutorial, we will first motivate the need of exploration in machine learning algorithms and highlight its importance in many real-world problems where online sequential decision making is involved. In real-world application scenarios, considerable challenges arise in such a setting of machine learning, including sample complexity, costly and even outdated feedback, and ethical considerations of exploration (such as fairness and privacy). We will introduce several classical exploration strategies, and then highlight the aforementioned three fundamental challenges in the learning from exploration paradigm and introduce the recent research development on addressing them respectively.
1. The learning through exploration paradigm
In this section, we will introduce the importance of exploration in general learning problems, which motivates the learning from exploration paradigm. Most importantly, we will use several important application scenarios where efficient and effective exploration is needed to demonstrate the key challenges, including dependent observations, non-stationary environments, safety, privacy, and ethic concerns. We will formulate the mathematical foundation of the learning by exploration paradigm in the multi-armed bandit setting.
2. Classical exploration strategies
In this section, we will introduce several classical exploration strategies developed in the past years, including their theoretical properties and practical usage in real-world application scenarios.
- Random exploration. We will introduce the basic random exploration strategy: -greedy and epoch-greedy, which motivate more advanced exploration methods introduced later.
- Optimism in the face of uncertainty. The basic idea is to “over estimate” (i.e., optimism) the reward of every action such that current under explored arms will still have a chance to be chosen. And the extent of this over estimation should shrink with respect to the number of observations obtained on each arm. We will introduce the upper confidence bound (UCB1) exploration strategies: UCB1, linear bandit with upper confidence bound (LinUCB), generalized LinUCB, KL-UCB.
- Posterior sampling based exploration. The basic idea is to sample from the posterior distribution of the reward or model parameters, and act greedily with respect to the sampled reward/model. We will introduce Thompson Sampling based exploration strategies.
- Perturbation based exploration. We will introduce the recent development of using pseudo reward for perturbation based exploration, including perturbed history exploration in multi-armed bandits, perturbed history exploration in stochastic linear bandits. The basic idea is to introduce perturbation to the history of observed rewards to “over estimate” on each arm, so that a greedy policy can also effectively explore. We will also cover an interesting analysis about greedy policy’s behavior and explain why its empirical performance isn’t too bad.
Papers to be covered in this section include:
- Auer, Peter. “Using confidence bounds for exploitation-exploration trade-offs.” Journal of Machine Learning Research 3.Nov (2002): 397-422.
- Langford, John, and Tong Zhang. “The epoch-greedy algorithm for multi-armed bandits with side information.” Advances in neural information processing systems. 2008.
- Li, Lihong, Wei Chu, John Langford, and Robert E. Schapire. “A contextual-bandit approach to personalized news article recommendation.” In Proceedings of the 19th international conference on World wide web, pp. 661-670. 2010.
- Filippi, Sarah, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. “Parametric bandits: The generalized linear case.” In Advances in Neural Information Processing Systems, pp. 586-594. 2010.
- Garivier, Aurélien, and Olivier Cappé. “The KL-UCB algorithm for bounded stochastic bandits and beyond.” In Proceedings of the 24th annual conference on learning theory, pp. 359-376. 2011.
- Agrawal, Shipra, and Navin Goyal. “Thompson sampling for contextual bandits with linear payoffs.” In International Conference on Machine Learning, pp. 127-135. 2013.
- Russo, Daniel, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. “A tutorial on thompson sampling.” arXiv preprint arXiv:1707.02038 (2017).
- Kveton, Branislav, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig Boutilier. “Perturbed-history exploration in stochastic multi-armed bandits.” arXiv preprint arXiv:1902.10089 (2019).
- Kveton, Branislav, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig Boutilier. “Perturbed-history exploration in stochastic linear bandits.” arXiv preprint arXiv:1903.09132 (2019).
- Kannan, Sampath, Jamie H. Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. “A smoothed analysis of the greedy algorithm for the linear contextual bandit problem.” In Advances in Neural Information Processing Systems, pp. 2227-2236. 2018.
3. Efficient exploration in complicated real-world environments
In this section, we will illustrate need of more efficient exploration in real-world intelligent systems, where obtaining feedback is costly, for example when the feedback is provided by humans. We will introduce recent developments on more efficient exploration in various learning scenarios.
- Exploration through information propagation across multiple learning agents realized by collaborative bandit learning.
- Exploration through factorization based bandit learning, which discovers the underlying low rank structures in the problem.
- Warm-start exploration using offline data.
Papers to be covered in this section include:
- Cesa-Bianchi, Nicolo, Claudio Gentile, and Giovanni Zappella. “A gang of bandits.” Advances in Neural Information Processing Systems. 2013.
- Wu, Qingyun, et al. “Contextual bandits in a collaborative environment.” Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016.
- Gentile, Claudio, Shuai Li, and Giovanni Zappella. “Online clustering of bandits.” International Conference on Machine Learning. 2014.
- Li, Shuai, Alexandros Karatzoglou, and Claudio Gentile. “Collaborative filtering bandits.” Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016.
- Gentile, Claudio, et al. “On context-dependent clustering of bandits.” International Conference on Machine Learning. 2017.
- Kawale, Jaya, et al. “Efficient Thompson Sampling for Online Matrix-Factorization Recommendation.” Advances in neural information processing systems. 2015.
- Wang, Huazheng, Qingyun Wu, and Hongning Wang. “Learning hidden features for contextual bandits.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 2016.
- Zhang, Chicheng, et al. “Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback.” ICML. 2019.
4. Learning by exploration in non-stationary environments
There is a growing interest in understanding the interactive online learning problem in non-stationary environments, as non-stationarity naturally arises in many real-world application scenarios. In this section, we will introduce the need and challenges in learning through exploration in non-stationary environments. And we will introduce the recent development on bandit learning in non-stationary environments.
Papers to be covered in this section include:
- Garivier, Aurélien, and Eric Moulines. “On upper-confidence bound policies for non-stationary bandit problems.” arXiv preprint arXiv:0805.3415 (2008).
- Yu, Jia Yuan, and Shie Mannor. “Piecewise-stationary bandit problems with side observations.” Proceedings of the 26th annual international conference on machine learning. 2009.
- Hariri, Negar, Bamshad Mobasher, and Robin Burke. “Adapting to user preference changes in interactive recommendation.” Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
- Wu, Qingyun, Naveen Iyer, and Hongning Wang. “Learning contextual bandits in a non-stationary environment.” The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018.
- Luo, Haipeng, et al. “Efficient contextual bandits in non-stationary worlds.” Conference On Learning Theory. 2018.
- Liu, Fang, Joohyun Lee, and Ness Shroff. “A change-detection based framework for piecewise-stationary multi-armed bandit problem.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
- Cao, Yang, et al. “Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit.” The 22nd International Conference on Artificial Intelligence and Statistics. 2019.
- Wu, Qingyun, et al. “Dynamic Ensemble of Contextual Bandits to Satisfy Users’ Changing Interests.” The World Wide Web Conference. 2019.
- Russac, Yoan, Claire Vernade, and Olivier Cappé. “Weighted linear bandits for non-stationary environments.” Advances in Neural Information Processing Systems. 2019.
- Chen, Yifang, et al. “A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free.” Conference on Learning Theory. 2019.
- Besson, Lilian, and Emilie Kaufmann. “The generalized likelihood ratio test meets klucb: an improved algorithm for piece-wise non-stationary bandits.” arXiv preprint arXiv:1902.01575 (2019).
- Maillard, Odalric-Ambrym. “Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds.” (2019).
5. Ethical considerations of exploration
Ethical considerations, such as fairness and privacy preserving, are becoming an important constraint for online learning algorithms, especially when sensitive personal information is involved in the learning pipeline and/or the online decisions have important consequences on people’s lives. In this section we will introduce recent efforts in developing fair exploration and the protection of potential privacy leakage during online exploration.
- Mishra, Nikita, and Abhradeep Thakurta. “(Nearly) optimal differentially private stochastic multi-arm bandits.” Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. 2015.
- Neel, Seth, and Aaron Roth. “Mitigating Bias in Adaptive Data Gathering via Differential Privacy.” International Conference on Machine Learning. 2018.
- Shariff, Roshan, and Or Sheffet. “Differentially private contextual linear bandits.” Advances in Neural Information Processing Systems. 2018.
- Ren, Wenbo, et al. “Multi-Armed Bandits with Local Differential Privacy.” arXiv preprint arXiv:2007.03121 (2020).
- Joseph, Matthew, et al. “Fairness in learning: Classic and contextual bandits.” Advances in Neural Information Processing Systems. 2016.
- Chen, Yifang, et al. “The Fair Contextual Multi-Armed Bandit.” Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems. 2020.
6. Future research directions
We will discuss some urgent and important research directions that we believe would make an impact to the practice of learning by exploration. The directions include but not limit to, 1) how to perform effective exploration with non-linear deep models, given most of exploration strategies are designed for simple linear models; 2) privacy preserving when learning in a collaborative multi-user environment; 3) learning in a multi-agent adversarial environment; and 4) incentivize the environment for exploration. We share our thoughts on those interesting and important directions of research, and hope to spark new ideas and collaborations in this exciting field of research.