No-Regret Learning: Multi-Armed Bandits 2
Instructor: Thomas Kesselheim
1 Last Lectures
In the last lecture, we turn the Multiplicative Weights algorithm from the lecture before into one that works with bandit feedback.
We can choose from nactions in every step. An adversary determines the sequence of cost vectors`(1), . . . , `(T)in advance,`(t)i ∈[0,1]. The sequence is unknown to the algorithm. In step t, the algorithm chooses one of the n actions at random by defining probabilities p(t)1 , . . . p(t)n . The algorithm’s choice in step t is denoted by It. The algorithm gets to know `(t)I
t. The other entries of the cost vector remain unknown.
We used the Multiplicative Weights algorithm in a way that we could reuse the regret bound by computing “fake costs” ˜`(t)i . The final combined algorithm then looks as follows, usingγ,η, and ρ as parameters.
• Initially, setwi(1)= 1, p(1)i = n1, for everyi∈[n].
• At every timet
– Define q(t)i = (1−γ)p(t)i +γn. – Choose It based onq(t). – Define ˜`(t)I
t =`(t)I
t/qI(t)
t and ˜`(t)i = 0 for i6=It
– Multiplicative-Weights Update:
∗ Setwi(t+1) =w(t)i ·exp
−η1ρ`˜(t)i
∗ W(t+1)=Pn
i=1w(t+1)i
∗ p(t+1)i =w(t+1)i /W(t+1)
We set γ = 3 qnlnn
T , η = ln (1−γ) and ρ = nγ to get a regret bound of 3(nlnn)1/3T2/3. Note that we use the weight updatew(t+1)i =wi(t)·exp
−η1ρ`˜(t)i
instead ofwi(t+1) =w(t)i ·(1−η)1ρ`˜(t)i , which is only a different parameterization.
2 The Exp3 Algorithm
There is a way to improve the regret guarantee to O(√
nTlogn), which we will get to know today. The algorithm is called Exp3, which stands for “Explore and Exploit with Exponential Weights”. And, in fact, we already know the algorithm. It is exactly the one listed above but with a smarter choice of parameters and a more careful analysis.
Our original analysis of the multiplicative-weights update could only deal with cost vectors such that 0 ≤ `˜(t)i ≤ ρ. Now, a single entry ˜`(t)i can be as large as nγ. This is why we chose ρ = nγ. Exp3 instead sets ρ = 1. This means, the update step is much more aggressive than with our previous parameter choice. The vague idea to keep in mind why this is reasonable is
that ˜`(t)i = 0 most of the time. The fake cost is only non-zero if this is the action that has just been chosen.
The other parameters, γ and η, will be determined later.
3 A Refined Bound of the Multiplicative-Weights Update
The key to prove the regret guarantee of Exp3 is a more careful analysis of the multiplicative- weights update, now allowing ˜`(t)i >1 despite setting ρ= 1. We can show the following bound.
Lemma 18.1. Fix `˜(1), . . . ,`˜(T) arbitrarily such that 0 ≤ `˜(t)i ≤ 1η for all i and t. Then the vectors p(1), . . . , p(T) computed by the multiplicative-weights update (with ρ= 1) fulfill
T
X
t=1 n
X
i=1
p(t)i `˜(t)i −η
T
X
t=1 n
X
i=1
p(t)i
`˜(t)i 2
≤min
i T
X
t=1
`˜(t)i + lnn η .
Proof. We prove this bound in a very similar way to our original analysis of the multiplicative weights algorithm. We again use the sum of the weights to lower-bound any expert’s cost as well as to upper-bound the algorithm’s cost. For the lower bound, we use that for all expertsi
W(T+1)≥wi(T+1)= exp −η
T
X
t=1
`˜(t)i
! .
Taking the logarithm, this is equivalent to
lnW(T+1) ≥ −ηmin
i T
X
t=1
`˜(t)i .
For the upper bound, we consider the weight changes in step t. We have
W(t+1) =
n
X
i=1
w(t)i e−η˜`(t)i .
We use that ez ≤1 +z+z2 for−1≤z≤1. So, we have e−η`˜(t)i ≤1−η`˜(t)i + η`˜(t)i 2
because 0≤η`˜(t)i ≤1. Furthermore, note that we can writew(t)i =W(t)p(t)i to get
W(t+1) ≤
n
X
i=1
w(t)i
1−η`˜(t)i +
η`˜(t)i 2
≤
n
X
i=1
w(t)i −
n
X
i=1
w(t)i η`˜(t)i +
n
X
i=1
wi(t) η`˜(t)i 2
=W(t) 1−η
n
X
i=1
p(t)i `˜(t)i +η2
n
X
i=1
p(t)i
`˜(t)i 2! .
Repeatedly applying this bound and using thatW(1)=n, we get
W(T+1) ≤n
T
Y
t=1
1−η
n
X
i=1
p(t)i `˜(t)i +η2
n
X
i=1
p(t)i `˜(t)i
2! .
Again, we take the logarithm to get
lnW(T+1)≤lnn+
T
X
t=1
ln 1−η
n
X
i=1
p(t)i `˜(t)i +η2
n
X
i=1
p(t)i
`˜(t)i 2! .
We use that ln(1 +z)≤zfor all z∈R to simplify this expression to
lnW(T+1)≤lnn−η
T
X
t=1 n
X
i=1
p(t)i `˜(t)i +η2
T
X
t=1 n
X
i=1
p(t)i
`˜(t)i 2
.
Combining the two bounds on lnW(T+1), we get
−ηmin
i T
X
t=1
`˜(t)i ≤lnn−η
T
X
t=1 n
X
i=1
p(t)i `˜(t)i +η2
T
X
t=1 n
X
i=1
p(t)i
`˜(t)i 2
,
which is equivalent to the claim.
4 Analysis of Exp3
Based on Lemma 18.1, the remaining analysis of Exp3 works almost the same way as the one for the basic algorithm.
Theorem 18.2. If η ≤ γn, Exp3 has expected cost at most
mini T
X
t=1
`(t)i +lnn
η +ηnT +γT .
Proof. Once again, we first fix I1, . . . , IT arbitrarily. This also fixes ˜`(1), . . . ,`˜(T), which are fed into the multiplicative-weights part and this way p(1), . . . , p(T) are fixed as well. So, we can invoke Lemma 18.1. Replacingq(t)i = (1−γ)p(t)i +γn, we have
T
X
t=1 n
X
i=1
qi(t)`˜(t)i =
T
X
t=1 n
X
i=1
(1−γ)p(t)i +γ n
`˜(t)i
= (1−γ)
T
X
t=1 n
X
i=1
p(t)i `˜(t)i +γ n
T
X
t=1 n
X
i=1
`˜(t)i
≤(1−γ) min
i T
X
t=1
`˜(t)i +lnn η +η
T
X
t=1 n
X
i=1
p(t)i `˜(t)i
2! +γ
n
T
X
t=1 n
X
i=1
`˜(t)i
≤min
i T
X
t=1
`˜(t)i +lnn η +η
T
X
t=1 n
X
i=1
qi(t)
`˜(t)i 2
+ γ n
T
X
t=1 n
X
i=1
`˜(t)i .
Next, we consider how the values ˜`(t)i are derived from the`(t)i . To this end, keepI1, . . . , It−1
fixed. Like in the analysis of our black-box transformation, we have
Eh
`˜(t)i
I1, . . . , It−1
i
=Pr[It=i|I1, . . . , It−1]`(t)i qi(t)
+Pr[It6=i] 0 =`(t)i =`(t)i .
So, also
E
" n X
i=1
qi(t)`˜(t)i
#
=
n
X
i=1
Eh
qi(t)`˜(t)i i
=
n
X
i=1
Eh qi(t)i
`(t)i =Eh
`(t)I
t
i . Now, we also have quadratic terms. For these, we can derive
E
`˜(t)i 2
I1, . . . , It−1
=Pr[It=i] `(t)i qi(t)
!2
+Pr[It6=i] 0 =
`(t)i 2
qi(t) . This gives us for any choice of I1, . . . , It−1
E
" n X
i=1
q(t)i
`˜(t)i 2
I1, . . . , It−1
#
=
n
X
i=1
q(t)i
`(t)i 2
q(t)i
=
n
X
i=1
`(t)i 2
.
As the right-hand side is independent ofI1, . . . , It−1, this identity also holds for the unconditional expectation
E
" n X
i=1
qi(t)
`˜(t)i 2#
=
n
X
i=1
`(t)i 2
.
Taking the expectation over the bound from the multiplicative weights part, we get
E
" T X
t=1
`(t)I
t
#
≤E
"
mini T
X
t=1
`˜(t)i
# + lnn
η +η
T
X
t=1 n
X
i=1
E
q(t)i
`˜(t)i 2 +γ
n
T
X
t=1 n
X
i=1
Eh
`˜(t)i i
Inserting the above identities, this implies
E
" T X
t=1
`(t)I
t
#
≤min
i T
X
t=1
`(t)i +lnn η +η
T
X
t=1 n
X
i=1
`(t)i 2
+ γ n
T
X
t=1 n
X
i=1
`(t)i .
Finally, we use that `(t)i ≤1 for all iand t. This lets us bound the double sums by nT. (This is not too wasteful because they are multiplied byη or γn, which are small.) Therefore
E
" T X
t=1
`(t)I
t
#
≤min
i T
X
t=1
`(t)i +lnn
η +ηnT +γT .
Corollary 18.3. Setting η= qlnn
nT, γ =nη, the external regret of Exp3 is at most 3√
nT lnn.
5 Lower Bound on the Regret
Theorem 18.4. Even for n= 2, no algorithm guarantees external regret o(√ T).
Proof. Let T be an even square number. We generate a random sequence `(1), . . . , `(T). For each t, we set `(t) independently to (1,0) or to (0,1) with probability 1/2 each. Observe that in each step, no matter how the algorithm chooses the probabilities, its expected cost will be 1/2. So Eh
L(TAlg)i
=T /2, where the expectation is also over the randomization of the sequence.
We have to compare this toEh
miniL(Ti )i
. We will show thatEh
miniL(Ti )i
=T /2−Ω(√ T).
Note thatL(T1 )andL(T2 )are identically distributed, namely according to a binomial distribution
with parameters T and 1/2. So they are the number of times we see heads in T independent fair coin tosses.
Furthermore, L(T1 )+L(T2 )=T. So, miniL(Ti ) never exceedsT /2. Therefore, we can write
E
mini L(Ti )
≤Pr
mini L(Ti )< T 2 −α
√
T T
2 −α
√ T
+Pr
mini L(Ti ) ≥ T 2 −α
√ T
T 2
≤ T 2 −α√
T +α√ TPr
mini L(Ti )≥ T 2 −α√
T
.
We have miniL(Ti )≥ T2 −α√
T if and only if T2 −α√
T ≤L(T1 )≤ T2 +α√ T, so
Pr
mini L(Ti )≥ T 2 −α
√ T
=Pr T
2 −α
√
T ≤L(T1 )≤ T 2 +α
√ T
.
We have to show thatL(T1 ) is not always close to its expectation (which isT /2). Pictorially, we have to show that in the gray area there is at least a constant probability.
T 2 −α√
T T2 −α√ T
As L(T1 ) is binomially distributed, we have
Pr T
2 −α
√
T ≤L(T1 )≤ T 2 +α
√ T
=
T 2+α√
T
X
j=T2−α√ T
Pr h
L(T1 ) =j i
andPr h
L(T1 ) =j i
= 1 2T
T j
.
We have to bound the binomial coefficient. We can do this using Stirling’s approximation, which says √
2π kk+12e−k ≤ k!≤ e kn+12e−k for all k. This gives us T /2T
≤ πe√2T
T. Using the monotonicity of binomial coefficients, we have Tj
≤ πe√2T
T for all j. So
Prh
L(T1 )=ji
= 1 2T
T j
≤ e π
√1 T and therefore
Pr
mini L(Ti )≥ T 2 −α√
T
≤2α√ T· e
π
√1
T = 2αe π and also
E
mini L(Ti )
≤ T 2 −α
√ T +α
√ T2αe
π . Using, for example,α= 12, we getEh
miniL(Ti )i
≤ T2 − 12(1−πe)√
T ≥ T2 −0.06√ T.
6 Reference
Peter Auer, Nicol´o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2003. The Nonstochas- tic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (January 2003), 48-77