3 A Refined Bound of the Multiplicative-Weights Update

(1)

No-Regret Learning: Multi-Armed Bandits 2

Instructor: Thomas Kesselheim

1 Last Lectures

In the last lecture, we turn the Multiplicative Weights algorithm from the lecture before into one that works with bandit feedback.

We can choose from nactions in every step. An adversary determines the sequence of cost vectors`⁽¹⁾, . . . , `^(T⁾in advance,`^(t)_i ∈[0,1]. The sequence is unknown to the algorithm. In step t, the algorithm chooses one of the n actions at random by defining probabilities p^(t)₁ , . . . p^(t)_n . The algorithm’s choice in step t is denoted by I_t. The algorithm gets to know `^(t)_I

t. The other entries of the cost vector remain unknown.

We used the Multiplicative Weights algorithm in a way that we could reuse the regret bound by computing “fake costs” ˜`^(t)_i . The final combined algorithm then looks as follows, usingγ,η, and ρ as parameters.

• Initially, setw_i⁽¹⁾= 1, p⁽¹⁾_i = _n¹, for everyi∈[n].

• At every timet

– Define q^(t)_i = (1−γ)p^(t)_i +^γ_n. – Choose It based onq^(t). – Define ˜`^(t)_I

t =`^(t)_I

t/q_I^(t)

t and ˜`^(t)_i = 0 for i6=It

– Multiplicative-Weights Update:

∗ Setw_i^(t+1) =w^(t)_i ·exp

−η¹_ρ`˜^(t)_i

∗ W^(t+1)=Pn

i=1w^(t+1)_i

∗ p^(t+1)_i =w^(t+1)_i /W^(t+1)

We set γ = ³ qnlnn

T , η = ln (1−γ) and ρ = ⁿ_γ to get a regret bound of 3(nlnn)^1/3T^2/3. Note that we use the weight updatew^(t+1)_i =w_i^(t)·exp

−η¹_ρ`˜^(t)_i

instead ofw_i^(t+1) =w^(t)_i ·(1−η)¹^ρ^`^˜^(t)ⁱ , which is only a different parameterization.

2 The Exp3 Algorithm

There is a way to improve the regret guarantee to O(√

nTlogn), which we will get to know today. The algorithm is called Exp3, which stands for “Explore and Exploit with Exponential Weights”. And, in fact, we already know the algorithm. It is exactly the one listed above but with a smarter choice of parameters and a more careful analysis.

Our original analysis of the multiplicative-weights update could only deal with cost vectors such that 0 ≤ `˜^(t)_i ≤ ρ. Now, a single entry ˜`^(t)_i can be as large as ⁿ_γ. This is why we chose ρ = ⁿ_γ. Exp3 instead sets ρ = 1. This means, the update step is much more aggressive than with our previous parameter choice. The vague idea to keep in mind why this is reasonable is

(2)

that ˜`^(t)_i = 0 most of the time. The fake cost is only non-zero if this is the action that has just been chosen.

The other parameters, γ and η, will be determined later.

3 A Refined Bound of the Multiplicative-Weights Update

The key to prove the regret guarantee of Exp3 is a more careful analysis of the multiplicative- weights update, now allowing ˜`^(t)_i >1 despite setting ρ= 1. We can show the following bound.

Lemma 18.1. Fix `˜⁽¹⁾, . . . ,`˜^(T⁾ arbitrarily such that 0 ≤ `˜^(t)_i ≤ ¹_η for all i and t. Then the vectors p⁽¹⁾, . . . , p^(T⁾ computed by the multiplicative-weights update (with ρ= 1) fulfill

T

X

t=1 n

X

i=1

p^(t)_i `˜^(t)_i −η

T

X

t=1 n

X

i=1

p^(t)_i

`˜^(t)_i 2

≤min

i T

X

t=1

`˜^(t)_i + lnn η .

Proof. We prove this bound in a very similar way to our original analysis of the multiplicative weights algorithm. We again use the sum of the weights to lower-bound any expert’s cost as well as to upper-bound the algorithm’s cost. For the lower bound, we use that for all expertsi

W^(T⁺¹⁾≥w_i^(T⁺¹⁾= exp −η

T

X

t=1

`˜^(t)_i

! .

Taking the logarithm, this is equivalent to

lnW^(T⁺¹⁾ ≥ −ηmin

i T

X

t=1

`˜^(t)_i .

For the upper bound, we consider the weight changes in step t. We have

W^(t+1) =

n

X

i=1

w^(t)_i e^−η^˜^`^(t)ⁱ .

We use that e^z ≤1 +z+z² for−1≤z≤1. So, we have e^−η^`^˜^(t)ⁱ ≤1−η`˜^(t)_i + η`˜^(t)_i 2

because 0≤η`˜^(t)_i ≤1. Furthermore, note that we can writew^(t)_i =W^(t)p^(t)_i to get

W^(t+1) ≤

n

X

i=1

w^(t)_i

1−η`˜^(t)_i +

η`˜^(t)_i 2

≤

n

X

i=1

w^(t)_i −

n

X

i=1

w^(t)_i η`˜^(t)_i +

n

X

i=1

w_i^(t) η`˜^(t)_i 2

=W^(t) 1−η

n

X

i=1

p^(t)_i `˜^(t)_i +η²

n

X

i=1

p^(t)_i

`˜^(t)_i 2! .

Repeatedly applying this bound and using thatW⁽¹⁾=n, we get

W^(T⁺¹⁾ ≤n

T

Y

t=1

1−η

n

X

i=1

p^(t)_i `˜^(t)_i +η²

n

X

i=1

p^(t)_i `˜^(t)_i

2! .

(3)

Again, we take the logarithm to get

lnW^(T⁺¹⁾≤lnn+

T

X

t=1

ln 1−η

n

X

i=1

p^(t)_i `˜^(t)_i +η²

n

X

i=1

p^(t)_i

`˜^(t)_i 2! .

We use that ln(1 +z)≤zfor all z∈R to simplify this expression to

lnW^(T⁺¹⁾≤lnn−η

T

X

t=1 n

X

i=1

p^(t)_i `˜^(t)_i +η²

T

X

t=1 n

X

i=1

p^(t)_i

`˜^(t)_i 2

.

Combining the two bounds on lnW^(T⁺¹⁾, we get

−ηmin

i T

X

t=1

`˜^(t)_i ≤lnn−η

T

X

t=1 n

X

i=1

p^(t)_i `˜^(t)_i +η²

T

X

t=1 n

X

i=1

p^(t)_i

`˜^(t)_i 2

,

which is equivalent to the claim.

4 Analysis of Exp3

Based on Lemma 18.1, the remaining analysis of Exp3 works almost the same way as the one for the basic algorithm.

Theorem 18.2. If η ≤ ^γ_n, Exp3 has expected cost at most

mini T

X

t=1

`^(t)_i +lnn

η +ηnT +γT .

Proof. Once again, we first fix I₁, . . . , I_T arbitrarily. This also fixes ˜`⁽¹⁾, . . . ,`˜^(T⁾, which are fed into the multiplicative-weights part and this way p⁽¹⁾, . . . , p^(T⁾ are fixed as well. So, we can invoke Lemma 18.1. Replacingq^(t)_i = (1−γ)p^(t)_i +^γ_n, we have

T

X

t=1 n

X

i=1

q_i^(t)`˜^(t)_i =

T

X

t=1 n

X

i=1

(1−γ)p^(t)_i +γ n

`˜^(t)_i

= (1−γ)

T

X

t=1 n

X

i=1

p^(t)_i `˜^(t)_i +γ n

T

X

t=1 n

X

i=1

`˜^(t)_i

≤(1−γ) min

i T

X

t=1

`˜^(t)_i +lnn η +η

T

X

t=1 n

X

i=1

p^(t)_i `˜^(t)_i

2! +γ

n

T

X

t=1 n

X

i=1

`˜^(t)_i

≤min

i T

X

t=1

`˜^(t)_i +lnn η +η

T

X

t=1 n

X

i=1

q_i^(t)

`˜^(t)_i 2

+ γ n

T

X

t=1 n

X

i=1

`˜^(t)_i .

Next, we consider how the values ˜`^(t)_i are derived from the`^(t)_i . To this end, keepI₁, . . . , It−1

fixed. Like in the analysis of our black-box transformation, we have

Eh

`˜^(t)_i

I₁, . . . , It−1

i

=Pr[I_t=i|I₁, . . . , It−1]`^(t)_i q_i^(t)

+Pr[I_t6=i] 0 =`^(t)_i =`^(t)_i .

(4)

So, also

E

" _n X

i=1

q_i^(t)`˜^(t)_i

#

=

n

X

i=1

Eh

q_i^(t)`˜^(t)_i i

=

n

X

i=1

Eh q_i^(t)i

`^(t)_i =Eh

`^(t)_I

t

i . Now, we also have quadratic terms. For these, we can derive

E

`˜^(t)_i 2

I1, . . . , It−1

=Pr[It=i] `^(t)_i q_i^(t)

!2

+Pr[It6=i] 0 =

`^(t)_i 2

q_i^(t) . This gives us for any choice of I1, . . . , It−1

E

" _n X

i=1

q^(t)_i

`˜^(t)_i 2

I₁, . . . , It−1

#

=

n

X

i=1

q^(t)_i

`^(t)_i 2

q^(t)_i

=

n

X

i=1

`^(t)_i 2

.

As the right-hand side is independent ofI1, . . . , It−1, this identity also holds for the unconditional expectation

E

" _n X

i=1

q_i^(t)

`˜^(t)_i 2#

=

n

X

i=1

`^(t)_i 2

.

Taking the expectation over the bound from the multiplicative weights part, we get

E

" _T X

t=1

`^(t)_I

t

#

≤E

"

mini T

X

t=1

`˜^(t)_i

# + lnn

η +η

T

X

t=1 n

X

i=1

E

q^(t)_i

`˜^(t)_i 2 +γ

n

T

X

t=1 n

X

i=1

Eh

`˜^(t)_i i

Inserting the above identities, this implies

E

" _T X

t=1

`^(t)_I

t

#

≤min

i T

X

t=1

`^(t)_i +lnn η +η

T

X

t=1 n

X

i=1

`^(t)_i 2

+ γ n

T

X

t=1 n

X

i=1

`^(t)_i .

Finally, we use that `^(t)_i ≤1 for all iand t. This lets us bound the double sums by nT. (This is not too wasteful because they are multiplied byη or ^γ_n, which are small.) Therefore

E

" _T X

t=1

`^(t)_I

t

#

≤min

i T

X

t=1

`^(t)_i +lnn

η +ηnT +γT .

Corollary 18.3. Setting η= qlnn

nT, γ =nη, the external regret of Exp3 is at most 3√

nT lnn.

5 Lower Bound on the Regret

Theorem 18.4. Even for n= 2, no algorithm guarantees external regret o(√ T).

Proof. Let T be an even square number. We generate a random sequence `⁽¹⁾, . . . , `^(T⁾. For each t, we set `^(t) independently to (1,0) or to (0,1) with probability 1/2 each. Observe that in each step, no matter how the algorithm chooses the probabilities, its expected cost will be 1/2. So Eh

L^(T_Alg⁾i

=T /2, where the expectation is also over the randomization of the sequence.

We have to compare this toEh

min_iL^(T_i ⁾i

. We will show thatEh

min_iL^(T_i ⁾i

=T /2−Ω(√ T).

Note thatL^(T₁ ⁾andL^(T₂ ⁾are identically distributed, namely according to a binomial distribution

(5)

with parameters T and 1/2. So they are the number of times we see heads in T independent fair coin tosses.

Furthermore, L^(T₁ ⁾+L^(T₂ ⁾=T. So, miniL^(T_i ⁾ never exceedsT /2. Therefore, we can write

E

mini L^(T_i ⁾

≤Pr

mini L^(T_i ⁾< T 2 −α

√

T T

2 −α

√ T

+Pr

mini L^(T_i ⁾ ≥ T 2 −α

√ T

T 2

≤ T 2 −α√

T +α√ TPr

mini L^(T_i ⁾≥ T 2 −α√

T

.

We have min_iL^(T_i ⁾≥ ^T₂ −α√

T if and only if ^T₂ −α√

T ≤L^(T₁ ⁾≤ ^T₂ +α√ T, so

Pr

mini L^(T_i ⁾≥ T 2 −α

√ T

=Pr T

2 −α

√

T ≤L^(T₁ ⁾≤ T 2 +α

√ T

.

We have to show thatL^(T₁ ⁾ is not always close to its expectation (which isT /2). Pictorially, we have to show that in the gray area there is at least a constant probability.

T 2 −α√

T ^T₂ −α√ T

As L^(T₁ ⁾ is binomially distributed, we have

Pr T

2 −α

√

T ≤L^(T₁ ⁾≤ T 2 +α

√ T

=

T 2+α√

T

X

j=^T₂−α√ T

Pr h

L^(T₁ ⁾ =j i

andPr h

L^(T₁ ⁾ =j i

= 1 2^T

T j

.

We have to bound the binomial coefficient. We can do this using Stirling’s approximation, which says √

2π k^k+¹²e^−k ≤ k!≤ e kⁿ⁺¹²e^−k for all k. This gives us _{T /2}^T

≤ _π^e^√²^T

T. Using the monotonicity of binomial coefficients, we have ^T_j

≤ _π^e^√²^T

T for all j. So

Prh

L^(T₁ ⁾=ji

= 1 2^T

T j

≤ e π

√1 T and therefore

Pr

mini L^(T_i ⁾≥ T 2 −α√

T

≤2α√ T· e

π

√1

T = 2αe π and also

E

mini L^(T_i ⁾

≤ T 2 −α

√ T +α

√ T2αe

π . Using, for example,α= ¹₂, we getEh

min_iL^(T_i ⁾i

≤ ^T₂ − ¹₂(1−_π^e)√

T ≥ ^T₂ −0.06√ T.

6 Reference

Peter Auer, Nicol´o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2003. The Nonstochas- tic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (January 2003), 48-77