Post, Bandit

Outline

15 June 2020

In this tutorial, we will first motivate the need of exploration in machine learning algorithms and highlight its importance in many real-world problems where online sequential decision making is involved. In real-world application scenarios, considerable challenges arise in such a setting of machine learning, including sample complexity, costly and even outdated feedback, and ethical considerations of exploration (such as fairness and privacy). We will introduce several classical exploration strategies, and then highlight the aforementioned three fundamental challenges in the learning from exploration paradigm and introduce the recent research development on addressing them respectively.

1. The learning through exploration paradigm

In this section, we will introduce the importance of exploration in general learning problems, which motivates the learning from exploration paradigm. Most importantly, we will use several important application scenarios where efficient and effective exploration is needed to demonstrate the key challenges, including dependent observations, non-stationary environments, safety, privacy, and ethic concerns. We will formulate the mathematical foundation of the learning by exploration paradigm in the multi-armed bandit setting.

2. Classical exploration strategies

In this section, we will introduce several classical exploration strategies developed in the past years, including their theoretical properties and practical usage in real-world application scenarios.

Random exploration. We will introduce the basic random exploration strategy: $\epsilon$ -greedy and epoch-greedy, which motivate more advanced exploration methods introduced later.
Optimism in the face of uncertainty. The basic idea is to “over estimate” (i.e., optimism) the reward of every action such that current under explored arms will still have a chance to be chosen. And the extent of this over estimation should shrink with respect to the number of observations obtained on each arm. We will introduce the upper confidence bound (UCB1) exploration strategies: UCB1, linear bandit with upper confidence bound (LinUCB), generalized LinUCB, KL-UCB.
Posterior sampling based exploration. The basic idea is to sample from the posterior distribution of the reward or model parameters, and act greedily with respect to the sampled reward/model. We will introduce Thompson Sampling based exploration strategies.
Perturbation based exploration. We will introduce the recent development of using pseudo reward for perturbation based exploration, including perturbed history exploration in multi-armed bandits, perturbed history exploration in stochastic linear bandits. The basic idea is to introduce perturbation to the history of observed rewards to “over estimate” on each arm, so that a greedy policy can also effectively explore. We will also cover an interesting analysis about greedy policy’s behavior and explain why its empirical performance isn’t too bad.

Papers to be covered in this section include:

3. Efficient exploration in complicated real-world environments

In this section, we will illustrate need of more efficient exploration in real-world intelligent systems, where obtaining feedback is costly, for example when the feedback is provided by humans. We will introduce recent developments on more efficient exploration in various learning scenarios.

Exploration through information propagation across multiple learning agents realized by collaborative bandit learning.
Exploration through factorization based bandit learning, which discovers the underlying low rank structures in the problem.
Warm-start exploration using offline data.

Papers to be covered in this section include:

4. Learning by exploration in non-stationary environments

There is a growing interest in understanding the interactive online learning problem in non-stationary environments, as non-stationarity naturally arises in many real-world application scenarios. In this section, we will introduce the need and challenges in learning through exploration in non-stationary environments. And we will introduce the recent development on bandit learning in non-stationary environments.

Papers to be covered in this section include:

5. Ethical considerations of exploration

Ethical considerations, such as fairness and privacy preserving, are becoming an important constraint for online learning algorithms, especially when sensitive personal information is involved in the learning pipeline and/or the online decisions have important consequences on people’s lives. In this section we will introduce recent efforts in developing fair exploration and the protection of potential privacy leakage during online exploration.

6. Future research directions

We will discuss some urgent and important research directions that we believe would make an impact to the practice of learning by exploration. The directions include but not limit to, 1) how to perform effective exploration with non-linear deep models, given most of exploration strategies are designed for simple linear models; 2) privacy preserving when learning in a collaborative multi-user environment; 3) learning in a multi-agent adversarial environment; and 4) incentivize the environment for exploration. We share our thoughts on those interesting and important directions of research, and hope to spark new ideas and collaborations in this exciting field of research.