• Keine Ergebnisse gefunden

9. Emphasizing the “positive” in positive reinforcement: using nonbinary rewarding for

9.5 Results

9.5.2 Example 2: Improving feedback and increasing alertness

9.5 Results

fixation window of 2.7° radius (Fig. 26A, left). Starting with session 16, we introduced an outer fixation window of 6.7° radius, to reestablish proper performance by reducing the overall number of errors. The monkey received the same amount of reward as for staying within the inner fixation window such that the effective reward scheme was still binary (Fig.

26B, middle). With these very moderate requirements, the monkey performed more reliably and significantly increased its number of hits per session (Fig. 26, C and D, middle). In ~60%

of the trials, he managed to stay within the inner fixation window. However, with the uniform task requirements during sessions 16-29, M2 did not improve fixation accuracy and still made a lot of saccades, although no distracting objects were presented. Because of its very low error tolerance, we used NB-PRT to avoid an increase in eye errors following more rigid requirements for fixation accuracy and started to differentiate the amount of reward the monkey was given for successful trials with either precise or unprecise fixation (Fig. 26, A and B, right). This reward-based guidance toward precise fixation improved fixation accuracy quickly and allowed us to slowly decrease the outer fixation window, yet without affecting the monkey’s error rate (Fig. 26D, right). At the end of the training block, the monkey performed more than 80% of all trials with precise fixation.

9.5 Results

requiring him to respond as fast as possible. With a binary reward scheme, this would be achieved by reducing the length of the response time window. Consequently, a fraction of previously successful responses will become future errors. In the example shown in Fig. 27A, reducing the maximal response time from 750 to 450 ms would label ~9% of the trials as misses. Hence, the monkey must perform many trials to arrive at one “teacher trial,” i.e., a trial with a previously rewarded but now unrewarded behavior that provides essential information for the animal to adapt its responses. Increasing the number of teacher trials for faster learning (by requiring responses within even shorter time) is obligatorily linked to a

Fig. 27: Reaction time (RT)-dependent rewarding. A: empirical RT distribution for speed-change detection with binary reward, averaged over 2 days before introduction of RT-dependent rewarding. Shaded horizontal bar indicates future errors if the response window is limited to 450 ms. Dashed gray lines indicate response time windows as chosen in B. VOT, valve opening time (reward amount); gray arrow, median RT. B: empirical RT distribution averaged over 2 days following a 1-session introduction of RT-dependent rewarding. Shaded horizontal bars indicate VOT for fast, medium, and slow RTs. C: ex-Gaussian fits of RT distributions from 4 sessions before introduction of RT-dependent rewarding and 4 sessions afterward. Inset shows µ (mean) and σ (variability) values of single sessions. Bar plots at right show the mean µ and σ values of the Gaussian component and the mean τ (skew) of the exponential component before (open bars) and after (solid bars) introduction of RT-dependent rewarding. D-F: RT distributions during binary and nonbinary reward regimes. In D, the nonbinary schedule was chosen to label around one-third of all trials as fast, medium, and slow and associate them with high, medium, and low reward, respectively. The schedule was calculated to provide the same amount of reward per 100 trials as during the binary schedule if the monkey kept its RT distribution the same. In E and F, edges to define RT fractions were chosen to better separate fast from slow trials, to allow the monkeys to associate different behavior with different reward amount. Insets show the difference in the percentage of trials in each of the 3 RT fractions for the nonbinary compared with the binary schedule. Data in A-C were derived from M1, data in D and E from M3, and data in F from M2. All errors bars are SD.

9.5 Results

higher number of errors. For animals with low error tolerance, this strategy may counteract the monkey’s willingness to perform the task if errors exceed a critical number.

As already shown in example 1, NB-PRT allows emphasis on the positive, to-be-facilitated behavior, the fast response. Rather than relying on the failure to respond in time, providing a large reward for every fast response results in a considerably higher number of teacher trials.

For example, 25% of all trials shown in Fig. 27A had an RT of < 350 ms. To speed up response times, we associated such trials with a high reward, whereas trials with an RT of >

440 ms received a small reward. Trials with an RT in between received a medium reward, yet slightly less than in previous training sessions, to make the reward difference between fast and medium RTs clearly distinguishable. This new regime was highly effective and caused a decrease in the median RT from 380 to 350 ms within a single training session. Figure 27B shows the RT distribution of the following 2 sessions, revealing a total of 45% successful responses being faster than 350 ms. Because we did not shorten the absolute length of the response period, the rate of misses stayed at 0% during all sessions, including the one during which RT-dependent reward was introduced. For statistical analysis, we compared the RT distributions from before and after the new reward regime by using a Mann-Whitney U-test.

We found a significant shift toward faster RT within the RT-dependent reward regime (Z = 11.028; P < 10–27, n = [922, 1024]), with an effect size of R = 0.25. Based on U values, the Mann-Whitney statistics indicate an estimated probability of 64.4% for getting a faster response when applying RT-dependent rewarding.

Additionally, we fitted ex-Gaussian probability density functions [Heathcote et al. 1991] to the RT distributions of four sessions each before and after the RT-dependent reward was introduced. Ex-Gaussians constitute a convolution of a Gaussian and an exponential distribution to separately fit the Gaussian part (with parameters µ and σ) and the exponential part (with parameter τ) of the skewed RT distribution, and allowed us to investigate which parameters of the distributions were affected by the change in reward regime. Pooled over all RTs, these fits illustrate a clear, leftward shift of the RT distribution after introduction of the RT-dependent reward (Fig. 27C). The mean of the Gaussian component [fitting the left (fast) part of the skewed RT distribution] decreased from 348 to 324 ms (Wilcoxon rank sum test, Z

= 2.16, P = 0.03, n = 4), whereas the mean of the exponential component [fitting the right (slow) part of the skewed distribution] was unchanged (τbinary: 41 ms; τnonbinary: 41 ms; Z = 0, P

9.5 Results

= 1). Thus emphasizing the desired, fast response by RT-dependent rewarding as an alternative to training down the undesired, slow response by reward rejection caused an essentially instantaneous and lasting acceleration of response times. To be effective, NB-PRT depends on a well-justified reward regime. It should allow the animal to associate different amounts of reward with clearly different behaviors. To illustrate this more directly, we performed a simple fixation task with monkey M3 under two different nonbinary regimes.

M3 had a long expertise with the task as well as with NB-PRT. Each condition consisted of five sessions (Monday to Friday), where sessions 1 and 2 were used to make the monkey familiar with the current schedule, and sessions 3-5 were used for data analysis. Both conditions where preceded by a block of sessions with a binary schedule, following the same outline, to provide reference RT distributions.

We first tested a NB-PRT regime in which about one-third of trials each were associated with high, medium, and low reward. To this end, we first gathered the reference RT distribution using a binary schedule. The monkey was given a medium reward for every correct response, whereas error trials resulted in immediate trial termination. With the binary regime, M3 yielded 98% correct responses (including eye errors) and responded faster than 400 ms in 96% of the trials (Fig. 27D). For the following NB-PRT block, we divided the RT distribution into three fractions of ~33% of trials with fast, medium, and slow RT, and calculated the reward schedule as to provide the same amount of reward per 100 trials as during binary rewarding. Yet, because of the skewed RT distribution, the fraction of trials with medium RT consisted of only 2 bins, separating fast from slow responses by just 20 ms. This makes it hard (if not impossible) to associate behavior and reward, and leaves many trials with a reasonably quick response with low reward. As a consequence, with this schedule M3 performed significantly worse. The ratio of fast trials did not change, but many more trials were performed with slow RT (Fig. 27D). Thus, instead of keeping its previous RT distribution (and thus keeping the same amount of reward per 100 trials) or compensating for the low-reward trials by increasing the ratio of fast trials (and thus increasing the absolute reward amount), the monkey responded very slow in many trials and obtained fewer reward per 100 trials than under the binary schedule.

We then tested the effect of NB-PRT when RT fractions were chosen more purposefully. We first switched back to binary rewarding and reestablished the previous RT distribution (Fig.

9.5 Results

27E). We then applied an NB-PRT regime that only provided a low reward for RTs longer than 400 ms. Trials with RTs faster than 300 ms received a high reward, and all other RTs were rewarded as during the preceding block of binary rewarding. This regime is much more likely to allow the monkey to distinguish the association between different behaviors and different reward amounts. Note, however, that only 15.1% of trials were treated differently than during the preceding session with binary rewarding, whereas all other trials received exactly the same reward as before. Nevertheless, this NB-PRT regime had a strong effect on the monkey’s performance and caused a clear leftward shift of the entire RT distribution (Fig.

27E). Based on Mann-Whitney statistics (Z = 7.32; P < 10–12, n = [1,352, 1,658], R = 0.13), the estimated probability EP for getting a faster response in the NB-PRT regime was 0.58.

We applied a similar regime to M2 (Fig. 26F) after it had finished fixation training and obtained essentially the same result (Z = 6.77; P < 10–10, n = [851, 1,198], R = 0.15, EP = 0.59). Thus, even with a simple fixation paradigm, graded rewarding exerted clear effects on RT distributions, indicating that the trial-wise reward outcome has a direct impact on an animal’s performance. If properly chosen, graded rewarding can support the willingness of the animal to spent effort and provides a useful tool to guide the animal toward the desired behavior.