reinforcement_learning/hw4_model_selection_bandits/assignment4.tex · MVA-2021

\documentclass[a4paper]{article}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{fancyhdr}
\usepackage{hyperref}
\usepackage{amsmath,amsfonts,amssymb,amsthm}
\usepackage[a4paper, bottom=1.3in, top=1.3in, right=1in, left=1in]{geometry}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage[lined,boxed]{algorithm2e}
\usepackage{natbib}
\usepackage{dsfont}
\usepackage{tikz}
\usetikzlibrary{calc}
\definecolor{amaranth}{rgb}{0.9, 0.17, 0.31}
\newcommand{\rcol}[1]{{\color{amaranth}#1}}

\usepackage{todonotes}
\newcommand{\todomp}[1]{\todo[color=Green!10, inline]{\small MP: #1}}
\newcommand{\todompout}[1]{\todo[color=Green!10]{\scriptsize MP: #1}}

\newcommand{\wh}[1]{\widehat{#1}}
\newcommand{\wt}[1]{\widetilde{#1}}
\newcommand{\transp}{\intercal}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Insert your name here
\newcommand{\fullname}{Pierre Fernandez}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\lecture}[3]{
   \pagestyle{myheadings}
   \thispagestyle{plain}
   \newpage
   \setcounter{page}{1}
   \noindent
   \begin{center}
   \framebox{
      \vbox{\vspace{2mm}
              \hbox to .97\textwidth { {\bf MVA: Reinforcement Learning (2020/2021) \hfill Homework 4} }
       \vspace{6mm}
       \hbox to .97\textwidth { {\Large \hfill #1 \hfill } }
       \vspace{6mm}
       \hbox to .97\textwidth { {Lecturers: \it A. Lazaric, M. Pirotta  \hfill {{\footnotesize(
           January 9, 2020
        %    \today
           )}}} }
      \vspace{2mm}}
   }
   \end{center}
   Solution by \fullname
   \markboth{#1}{#1}
   \vspace*{4mm}
}


\DeclareMathOperator*{\argmax}{\arg\,\max}
\DeclareMathOperator*{\argmin}{\arg\,\min}
\DeclareMathOperator*{\arginf}{\arg\,\inf}


\setlength{\parindent}{0cm}
\begin{document}
\lecture{Model Selection for Contextual Bandit}{4}


\pagestyle{fancy}
\fancyhf{}
\rhead{Full name: \fullname}
\lhead{Model Selection for Contextual Bandit}
\cfoot{\thepage}

\textbf{Instructions}
\begin{itemize}
    \item The deadline is \textbf{February 5, 2021. 23h00}
    \item By doing this homework you agree to the \emph{late day policy, collaboration and misconduct rules} reported on \href{https://piazza.com/class/kf86owfvi2u2lg?cid=8}{Piazza}.
    \item \textbf{Mysterious or unsupported answers will not receive full credit}.
    A correct answer, unsupported by calculations, explanation, or algebraic work will receive no credit; an incorrect answer supported by substantially correct calculations and explanations might still receive partial credit.
    \item Answers should be provided in \textbf{English}.
\end{itemize}


\section{Model Selection}
In this homework, we look into the problem of model selection for bandit. For example,
\begin{itemize}
   \item Optimizing the exploration rate in an $\epsilon$-greedy algorithm
   \item Learning the most effective representation of a problem (e.g., UCB v. LinUCB, or selection among multiple linear representations)
\end{itemize}
These are just two examples of possible applications.

In general, we assume that we are given a set of $M$ bandit algorithms. We aim to design a meta algorithm (or \emph{master algorithm}) which selects at each round which base algorithm to play. The goal is to ensure that the performance of the master is not far away from the best base algorithm had it been run separately. Denote by $R_i(T)$ the regret bound of base algorithm $i$, the objective is to design a master algorithm having regret bound $\wt{O}(poly(M) R_{i^\star}(T))$, where $R_{i^\star}(T) = \min_i R_i(T)$.
This problem has received a lot of attention over the years~\citep{AgarwalDKLS12,AgarwalLNS17,SinglaH017}, with a surge of interest this year~\citep{AbbasiYadkori2020regrebalancing,PacchianoPA0ZLS20,pacchiano2020regret}.

\vspace{.2in}
While this idea applies to many scenarios, we consider the problem of selecting a representation for LinUCB.
We start reviewing the setting of linear contextual bandit.
At each round $t$, the algorithm receives a context $x_t \in \mathcal{X}$ (we assume the contexts are drawn i.i.d.\ from $\mathcal{X}$) and needs to select an action $a_t \in \mathcal{A}$ to play (we assume $\mathcal{A}$ is finite).
The reward associated to a context-action pair ($x,a$) is a linear function of a set of features in $\mathbb{R}^d$: $r(x,a) = \langle \phi(x,a), \theta^\star \rangle$. The vector $\theta^\star \in \mathbb{R}^d$ is unknown and the observed reward at time $t$ is a noisy version of the true reward: $r_t(x,a) = r(x,a) + \eta_t$ with $\eta_t$ being a conditionally zero mean $\sigma$-subgaussian random variable. The objective is to control the pseudo-regret: $R(T) = \sum_{t=1}^T \langle \phi(x, a^\star) - \phi(x,a_t), \theta^\star \rangle$. LinUCB algorithm suffers a regret at most $\wt{O}(\log(\det(\Sigma_t)/delta) \sqrt{t}) \leq \wt{O}(d\sqrt{t})$ for any $t \leq T$, where $\Sigma_t = \sum_{i=1}^t  \phi(x_i, a_i) \phi(x_i, a_i)^\top + \lambda I$ is the $(\lambda>0)$-regularized design matrix.

We now assume that the mapping $\phi$ is unknown but we have access to a set of candidates $F = (f_i)_{i \in [M]}$. We can instantiate a version of LinUCB for each representation $f \in F$. Denote by $B_i = LinUCB(f_i)$.
The new interaction loop is as follows. At each round $t$, the master observes a context $x_t$ and needs to select the base algorithm $B_t$ to play among the $M$ variants of LinUCB. $B_t$ is executed, i.e., $B_t$ selects the action $a_t$ to play and a reward $r_t(x_t,a_t) = \langle \phi(x,a), \theta^\star \rangle + \eta_t$ is observed. The master and $B_t$ are updated using the observed reward.
It is common to assume that at least one mapping is realizable
\[
   \exists f^\star \in F, \omega ~~:~~ \forall x,a, ~ r(s,a) = \langle \phi(x,a), \theta^\star \rangle = 
   \langle f^\star(x,a), \omega \rangle
\]


\vspace{.2in}
We provided an initial code implementing a linear representation and a bandit problem. The goal is to investigate the approach in the paper~\citep{pacchiano2020regret} and implement their algorithm (Algorithm 1) in this setting.

You can play with the code to test your implementation. We provide a simple entry point in file \texttt{main\_simple.py} to test your implementation on small problems.

Consider the following setting for the final experiment (already implemented in the file \texttt{main\_final.py}).
We consider a finite set of contexts and actions: $|\mathcal{X}| = 200$, $|\mathcal{A}|=20$.
Let $|F| = 8$ and $f_8 = f^\star$ with $d_8 = d^\star = 20$ (i.e., the realizable representation). The others are such that 
\begin{itemize}
   \item $f_1, f_2, f_3$ are nested representations of $f^\star$ with size $d_1 = 5$, $d_2 = 10$ and $d_3=25$ obtained by selecting the first $d_i$ features when $d_i < d^\star$ and padding random features when $d_i > d^\star$.
   \item $f_4, f_5, f_6, f_7$ are randomly generated feature of dimension $3, 7, 16, 30$.
\end{itemize}
Let $T=50000$, $\sigma = 0.3$, $\delta = 0.01$ and $\lambda=1$. Remember to average the regret over multiple runs.

Questions:
\begin{enumerate}
   \item Start from implementing LinUCB (a.k.a.\ OFUL). Use the slides or any other material as reference for the implementation (e.g., Section 6.1 in~\citep{pacchiano2020regret}). Suggestion: use Sherman–Morrison formula for a more efficient implementation.
   
   {\color{blue}
   See code. 
   
   The confidence scaling that was used is 
   $$ \beta_t = \sqrt{\sigma^2 d \ln \left( \frac{1+tL^2/\lambda}{\delta} \right) } 
   + \sqrt{\lambda}S $$
   Here is the first plot obtained by running main\_simple.py, 
   of the instant regret with regards to $t$.
   \begin{figure}[h!]
      \centering
      \includegraphics[width=9cm]{fig/oful.png}
   \end{figure}
   }

   \item Understand the ``Regret Bound Balancing and Elimination Algorithm'' (RegBalAlg) in~\citep{pacchiano2020regret} (Algorithm 1).
   
   \tikz{
      \node[text width=0.9\textwidth, color=red] at (0,0) {
   \textbf{!} It's a long document, focus only on the parts relevant for the homework. Section 4 should be enough for implementing the algorithm. Clearly, previous sections are necessary to better understand the setting and the notation.
      };
   }

   \begin{enumerate}
      \item Could you briefly explain the idea behind the algorithm?
      \item Could you explain the elimination condition? What does the set $\mathcal{W}$ contain?
   \end{enumerate}

   {\color{blue}
   
   \begin{enumerate}
      \item First, the algorithm is based on regret bound balancing and 
      selects the base learner in a way that balances the regret bounds 
      between the learners. Then, the algorithm proposes an additional step 
      to deal with learners that violate their regret bounds, 
      by eliminating the misspecified learners and 
      keeping a set of active learners. 
      At the end of the day, the final RegBal algorithm allows to select 
      the best models among a set of other models.
      \item The learner $i$ is eliminated if for any $j$: 
      $$ \frac{U_i(t)}{n_i(t)} 
      + c\sqrt{\frac{\ln(M \ln(n_i(t))/\delta)}{n_i(t)}} 
      + \frac{R_i(n_i(t))}{n_i(t)} > \frac{U_j(t)}{n_j(t)}
      - c\sqrt{\frac{\ln(M \ln(n_j(t))/\delta)}{n_j(t)}}  $$
      This condition comes from the fact that $U_i(t)$ 
      concentrates around $n_i(t)\mu^\star - Reg_i(t)$, where $Reg_i$ 
      is the pseudo regret. More rigorously, one can introduce a concentration 
      inequality that holds with propability at least $1-\delta$, and the bound 
      of this inequality is $c\sqrt{\frac{\ln(M \ln(n_i(t))/\delta)}{n_i(t)}} $.
      From this inequality and the fact that $Reg_i \leq R_i(n_i)$ (if $i$ is well-specified)
      one must have for every learners $i$ and $j$: 
      $ \frac{U_i(t)}{n_i(t)} 
      + c\sqrt{\frac{\ln(M \ln(n_i(t))/\delta)}{n_i(t)}} 
      + \frac{R_i(n_i(t))}{n_i(t)} < \frac{U_j(t)}{n_j(t)}
      - c\sqrt{\frac{\ln(M \ln(n_j(t))/\delta)}{n_j(t)}}  $.
      Therefore, when the above elimination condition holds, 
      it means that $Reg_i(i)$ is not 
      inferior to $R_i(n_i(t))$, so $i$ is not well-specified.

      $\mathcal{W}$ contains the well-specified learners.
   \end{enumerate}

   }

   \item Implement the algorithm ``Regret Bound Balancing and Elimination Algorithm'' (RegBalAlg) for the mentioned problem (see provided code).
   \begin{enumerate}
      \item Compare the RegBalAlg algorithm versus LinUCB on each representation (i.e., $B_i$ for $i \in [M]$). Report the picture with the pseudo regret of each algorithm.
      \item What upper-bound did you use for implementing RegBalAlg?
      \item Is any representation eliminated? Is the optimal representation the only not eliminated? Try to explain what you observed.
      \item Plot the regret of RegBalAlg and the one of the optimal representation. If you see a sudden change in the regret of RegBalAlg, could you explain why this happens?
   \end{enumerate}

   {\color{blue}
      \begin{enumerate}
         \item We report the picture of the pseudo-regret of each algorithm:
         \begin{figure}[h!]
            \centering
            \includegraphics[width=9cm]{fig/pseudo_regret.png}
         \end{figure}
         
         One can see that the RegBalAlg algorithm seems to find the best 
         learner quite fast (around 10000 iterations here) and without undergoing a high regret. 
         As expected, the misspecified learners achieve 
         poor final regret. The overall algorithm is therefore quite better 
         than the naive implementation that would consist in testing all the learners 
         one by one (suffer less cumulated regret and allows to quickly discriminate the learners).
         
         \item The upper bound used for the implementation was 
         $$ Reg(T) \leq 2\sum_{t=1}^T\beta_t \Vert a_t \Vert_{\Sigma_t^{-1}} $$
         that is easily computable and tighter than the one used 
         in the original OFUL algorithm of the paper.

         \item Recall that $f_8 = f^\star$ with $d_8 = d^\star = 20$ 
         (i.e., the realizable representation). 
         The others are such that 
         \begin{itemize}
            \item $f_1, f_2, f_3$ are nested representations of $f^\star$ with size $d_1 = 5$, $d_2 = 10$ and $d_3=25$ obtained by selecting the first $d_i$ features when $d_i < d^\star$ and padding random features when $d_i > d^\star$.
            \item $f_4, f_5, f_6, f_7$ are randomly generated feature of dimension $3, 7, 16, 30$.
         \end{itemize}
         In the RegBalAlg, all the representations except the $f_3$ and $f_8$ 
         ones are eliminated, 
         so the optimal representation is not the only 
         one remaining at the end, since $f_3$ is not optimal. However, 
         the way $f_3$ is constructed makes it plausible that it leads to 
         good performance, since it has at least all the features of $f^\star$ 
         (and 5 more).

         One can also observe that $f_2$ leads to a lower slope 
         of the pseudo-regret than $f_1$, which again, is plausible since 
         $f_2$ contains more features from $f^\star$ than $f_1$. 

         Finally, both $f_1$ and $f_2$ behave better than representations 
         between $4$ and $7$ that are randomly generated, and hence, are misspecified 
         with high probability.\\

         \item Here is the regret of RegBalAlg and the one of the optimal representation. 
         \begin{figure}[h!]
            \centering
            \includegraphics[width=9cm]{fig/regret.png}
         \end{figure}

         One can see a sudden change in the regret of RegBalAlg at horizon $\approx 10000$. 
         This horizon corresponds to the moment where the last misspecified 
         representation is eliminated. From this moment, the regret stays constant 
         because the rewards are always the ones of the optimal representation 
         (or $f_3$), 
         and therefore, thay are the optimal rewards.
      \end{enumerate}
      }
\end{enumerate}

\bibliographystyle{plainnat}
\bibliography{bibliography}
\end{document}