• Keine Ergebnisse gefunden

3 A Refined Bound of the Multiplicative-Weights Update

N/A
N/A
Protected

Academic year: 2022

Aktie "3 A Refined Bound of the Multiplicative-Weights Update"

Copied!
5
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

No-Regret Learning: Multi-Armed Bandits 2

Instructor: Thomas Kesselheim

1 Last Lectures

In the last lecture, we turn the Multiplicative Weights algorithm from the lecture before into one that works with bandit feedback.

We can choose from nactions in every step. An adversary determines the sequence of cost vectors`(1), . . . , `(T)in advance,`(t)i ∈[0,1]. The sequence is unknown to the algorithm. In step t, the algorithm chooses one of the n actions at random by defining probabilities p(t)1 , . . . p(t)n . The algorithm’s choice in step t is denoted by It. The algorithm gets to know `(t)I

t. The other entries of the cost vector remain unknown.

We used the Multiplicative Weights algorithm in a way that we could reuse the regret bound by computing “fake costs” ˜`(t)i . The final combined algorithm then looks as follows, usingγ,η, and ρ as parameters.

• Initially, setwi(1)= 1, p(1)i = n1, for everyi∈[n].

• At every timet

– Define q(t)i = (1−γ)p(t)i +γn. – Choose It based onq(t). – Define ˜`(t)I

t =`(t)I

t/qI(t)

t and ˜`(t)i = 0 for i6=It

– Multiplicative-Weights Update:

∗ Setwi(t+1) =w(t)i ·exp

−η1ρ(t)i

∗ W(t+1)=Pn

i=1w(t+1)i

∗ p(t+1)i =w(t+1)i /W(t+1)

We set γ = 3 qnlnn

T , η = ln (1−γ) and ρ = nγ to get a regret bound of 3(nlnn)1/3T2/3. Note that we use the weight updatew(t+1)i =wi(t)·exp

−η1ρ(t)i

instead ofwi(t+1) =w(t)i ·(1−η)1ρ`˜(t)i , which is only a different parameterization.

2 The Exp3 Algorithm

There is a way to improve the regret guarantee to O(√

nTlogn), which we will get to know today. The algorithm is called Exp3, which stands for “Explore and Exploit with Exponential Weights”. And, in fact, we already know the algorithm. It is exactly the one listed above but with a smarter choice of parameters and a more careful analysis.

Our original analysis of the multiplicative-weights update could only deal with cost vectors such that 0 ≤ `˜(t)i ≤ ρ. Now, a single entry ˜`(t)i can be as large as nγ. This is why we chose ρ = nγ. Exp3 instead sets ρ = 1. This means, the update step is much more aggressive than with our previous parameter choice. The vague idea to keep in mind why this is reasonable is

(2)

that ˜`(t)i = 0 most of the time. The fake cost is only non-zero if this is the action that has just been chosen.

The other parameters, γ and η, will be determined later.

3 A Refined Bound of the Multiplicative-Weights Update

The key to prove the regret guarantee of Exp3 is a more careful analysis of the multiplicative- weights update, now allowing ˜`(t)i >1 despite setting ρ= 1. We can show the following bound.

Lemma 18.1. Fix `˜(1), . . . ,`˜(T) arbitrarily such that 0 ≤ `˜(t)i1η for all i and t. Then the vectors p(1), . . . , p(T) computed by the multiplicative-weights update (with ρ= 1) fulfill

T

X

t=1 n

X

i=1

p(t)i(t)i −η

T

X

t=1 n

X

i=1

p(t)i

(t)i 2

≤min

i T

X

t=1

(t)i + lnn η .

Proof. We prove this bound in a very similar way to our original analysis of the multiplicative weights algorithm. We again use the sum of the weights to lower-bound any expert’s cost as well as to upper-bound the algorithm’s cost. For the lower bound, we use that for all expertsi

W(T+1)≥wi(T+1)= exp −η

T

X

t=1

(t)i

! .

Taking the logarithm, this is equivalent to

lnW(T+1) ≥ −ηmin

i T

X

t=1

(t)i .

For the upper bound, we consider the weight changes in step t. We have

W(t+1) =

n

X

i=1

w(t)i e−η˜`(t)i .

We use that ez ≤1 +z+z2 for−1≤z≤1. So, we have e−η`˜(t)i ≤1−η`˜(t)i + η`˜(t)i 2

because 0≤η`˜(t)i ≤1. Furthermore, note that we can writew(t)i =W(t)p(t)i to get

W(t+1)

n

X

i=1

w(t)i

1−η`˜(t)i +

η`˜(t)i 2

n

X

i=1

w(t)i

n

X

i=1

w(t)i η`˜(t)i +

n

X

i=1

wi(t) η`˜(t)i 2

=W(t) 1−η

n

X

i=1

p(t)i(t)i2

n

X

i=1

p(t)i

(t)i 2! .

Repeatedly applying this bound and using thatW(1)=n, we get

W(T+1) ≤n

T

Y

t=1

1−η

n

X

i=1

p(t)i(t)i2

n

X

i=1

p(t)i(t)i

2! .

(3)

Again, we take the logarithm to get

lnW(T+1)≤lnn+

T

X

t=1

ln 1−η

n

X

i=1

p(t)i(t)i2

n

X

i=1

p(t)i

(t)i 2! .

We use that ln(1 +z)≤zfor all z∈R to simplify this expression to

lnW(T+1)≤lnn−η

T

X

t=1 n

X

i=1

p(t)i(t)i2

T

X

t=1 n

X

i=1

p(t)i

(t)i 2

.

Combining the two bounds on lnW(T+1), we get

−ηmin

i T

X

t=1

(t)i ≤lnn−η

T

X

t=1 n

X

i=1

p(t)i(t)i2

T

X

t=1 n

X

i=1

p(t)i

(t)i 2

,

which is equivalent to the claim.

4 Analysis of Exp3

Based on Lemma 18.1, the remaining analysis of Exp3 works almost the same way as the one for the basic algorithm.

Theorem 18.2. If η ≤ γn, Exp3 has expected cost at most

mini T

X

t=1

`(t)i +lnn

η +ηnT +γT .

Proof. Once again, we first fix I1, . . . , IT arbitrarily. This also fixes ˜`(1), . . . ,`˜(T), which are fed into the multiplicative-weights part and this way p(1), . . . , p(T) are fixed as well. So, we can invoke Lemma 18.1. Replacingq(t)i = (1−γ)p(t)i +γn, we have

T

X

t=1 n

X

i=1

qi(t)(t)i =

T

X

t=1 n

X

i=1

(1−γ)p(t)i +γ n

(t)i

= (1−γ)

T

X

t=1 n

X

i=1

p(t)i(t)i +γ n

T

X

t=1 n

X

i=1

(t)i

≤(1−γ) min

i T

X

t=1

(t)i +lnn η +η

T

X

t=1 n

X

i=1

p(t)i(t)i

2! +γ

n

T

X

t=1 n

X

i=1

(t)i

≤min

i T

X

t=1

(t)i +lnn η +η

T

X

t=1 n

X

i=1

qi(t)

(t)i 2

+ γ n

T

X

t=1 n

X

i=1

(t)i .

Next, we consider how the values ˜`(t)i are derived from the`(t)i . To this end, keepI1, . . . , It−1

fixed. Like in the analysis of our black-box transformation, we have

Eh

(t)i

I1, . . . , It−1

i

=Pr[It=i|I1, . . . , It−1]`(t)i qi(t)

+Pr[It6=i] 0 =`(t)i =`(t)i .

(4)

So, also

E

" n X

i=1

qi(t)(t)i

#

=

n

X

i=1

Eh

qi(t)(t)i i

=

n

X

i=1

Eh qi(t)i

`(t)i =Eh

`(t)I

t

i . Now, we also have quadratic terms. For these, we can derive

E

(t)i 2

I1, . . . , It−1

=Pr[It=i] `(t)i qi(t)

!2

+Pr[It6=i] 0 =

`(t)i 2

qi(t) . This gives us for any choice of I1, . . . , It−1

E

" n X

i=1

q(t)i

(t)i 2

I1, . . . , It−1

#

=

n

X

i=1

q(t)i

`(t)i 2

q(t)i

=

n

X

i=1

`(t)i 2

.

As the right-hand side is independent ofI1, . . . , It−1, this identity also holds for the unconditional expectation

E

" n X

i=1

qi(t)

(t)i 2#

=

n

X

i=1

`(t)i 2

.

Taking the expectation over the bound from the multiplicative weights part, we get

E

" T X

t=1

`(t)I

t

#

≤E

"

mini T

X

t=1

(t)i

# + lnn

η +η

T

X

t=1 n

X

i=1

E

q(t)i

(t)i 2

n

T

X

t=1 n

X

i=1

Eh

(t)i i

Inserting the above identities, this implies

E

" T X

t=1

`(t)I

t

#

≤min

i T

X

t=1

`(t)i +lnn η +η

T

X

t=1 n

X

i=1

`(t)i 2

+ γ n

T

X

t=1 n

X

i=1

`(t)i .

Finally, we use that `(t)i ≤1 for all iand t. This lets us bound the double sums by nT. (This is not too wasteful because they are multiplied byη or γn, which are small.) Therefore

E

" T X

t=1

`(t)I

t

#

≤min

i T

X

t=1

`(t)i +lnn

η +ηnT +γT .

Corollary 18.3. Setting η= qlnn

nT, γ =nη, the external regret of Exp3 is at most 3√

nT lnn.

5 Lower Bound on the Regret

Theorem 18.4. Even for n= 2, no algorithm guarantees external regret o(√ T).

Proof. Let T be an even square number. We generate a random sequence `(1), . . . , `(T). For each t, we set `(t) independently to (1,0) or to (0,1) with probability 1/2 each. Observe that in each step, no matter how the algorithm chooses the probabilities, its expected cost will be 1/2. So Eh

L(TAlg)i

=T /2, where the expectation is also over the randomization of the sequence.

We have to compare this toEh

miniL(Ti )i

. We will show thatEh

miniL(Ti )i

=T /2−Ω(√ T).

Note thatL(T1 )andL(T2 )are identically distributed, namely according to a binomial distribution

(5)

with parameters T and 1/2. So they are the number of times we see heads in T independent fair coin tosses.

Furthermore, L(T1 )+L(T2 )=T. So, miniL(Ti ) never exceedsT /2. Therefore, we can write

E

mini L(Ti )

≤Pr

mini L(Ti )< T 2 −α

T T

2 −α

√ T

+Pr

mini L(Ti ) ≥ T 2 −α

√ T

T 2

≤ T 2 −α√

T +α√ TPr

mini L(Ti )≥ T 2 −α√

T

.

We have miniL(Ti )T2 −α√

T if and only if T2 −α√

T ≤L(T1 )T2 +α√ T, so

Pr

mini L(Ti )≥ T 2 −α

√ T

=Pr T

2 −α

T ≤L(T1 )≤ T 2 +α

√ T

.

We have to show thatL(T1 ) is not always close to its expectation (which isT /2). Pictorially, we have to show that in the gray area there is at least a constant probability.

T 2 −α√

T T2 −α√ T

As L(T1 ) is binomially distributed, we have

Pr T

2 −α

T ≤L(T1 )≤ T 2 +α

√ T

=

T 2

T

X

j=T2−α T

Pr h

L(T1 ) =j i

andPr h

L(T1 ) =j i

= 1 2T

T j

.

We have to bound the binomial coefficient. We can do this using Stirling’s approximation, which says √

2π kk+12e−k ≤ k!≤ e kn+12e−k for all k. This gives us T /2T

πe2T

T. Using the monotonicity of binomial coefficients, we have Tj

πe2T

T for all j. So

Prh

L(T1 )=ji

= 1 2T

T j

≤ e π

√1 T and therefore

Pr

mini L(Ti )≥ T 2 −α√

T

≤2α√ T· e

π

√1

T = 2αe π and also

E

mini L(Ti )

≤ T 2 −α

√ T +α

√ T2αe

π . Using, for example,α= 12, we getEh

miniL(Ti )i

T212(1−πe)√

T ≥ T2 −0.06√ T.

6 Reference

Peter Auer, Nicol´o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2003. The Nonstochas- tic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (January 2003), 48-77

Referenzen

ÄHNLICHE DOKUMENTE

The fidelity of the hybridization reac- tions can be improved significantly and as an example of a set of 24 words of 16-mers we show that the optimal set has unique

In order to further emphasise the significance of the work in the explosives security area, the Council has approved several conclusions: In April 2010 the Council endorsed

The representations induced by these asymptotic functionals (the particle weights) are highly reducible, so the obvious task is to work out a disintegration theory in terms

The idea is that instead of using a predetermined conversion method (as in, e.g., ROC weights) to obtain surrogate weights from an ordinal criteria ranking, the decision-maker will

The emergency team of the FH Graubünden will examine each case individually in consultation with the cantonal health authorities and if deemed necessary, will take any

FH Graubünden students who are going on an exchange semester in the spring semester will be specifically advised by the International Office of precautionary measures and

This behaviour is analogous to that of many bipyridinium diquater- nary salts [1] and to the reduction of diquaternary salts of 2,2'-oxybispyridine [3] and 3,3'-oxybispyri- dine

[r]