• Keine Ergebnisse gefunden

9. Emphasizing the “positive” in positive reinforcement: using nonbinary rewarding for

9.6 Discussion

9.6 Discussion

behaviors, training procedures get more difficult and time consuming and may face unforeseen pitfalls, potentially causing additional delays.

The method we describe is a modification of the traditional reward schedule used for PRT. It was partly motivated by recent studies showing that monkeys integrate information about reward probabilities to bias their choices in free-choice paradigms [Feng et al. 2009;

Kubanek & Snyder 2015; Rorie et al. 2010] and quickly learn selection rules based on stimulus-reward associations [Gaffan et al. 2002; Lennert & Martinez-Trujillo 2011]. NB-PRT relies on this natural expertise and uses it for interacting with the animal. NB-NB-PRT can be applied both for training a new task and for maintaining high performance levels when learning has finished. The present study reports the potential of this approach for different training situations, ranging from simple to complex, in three individuals. This report attempts to share the experiences gathered with this approach but does not provide a systematic comparison of the pros and cons of binary vs. nonbinary reward schedules. Because of the many factors influencing success and progress in training (trainer’s experience, animal’s experience, temperament, age, social rank, task requirements, etc.), a systematic comparison needs a large cohort of animals to average out such differences, which is beyond the resources of our laboratory. However, this is also beyond the conclusions we want to draw:

because of the many factors influencing the training of an individual, there is not just one and only true approach to train an animal, but rather the requirement to design the training schedule to optimally meet the individual needs of the animal. NB-PRT, therefore, should be thought of as an additional instrument in the PRT toolbox that might be considered in addition to established protocols to allow for optimal training progress.

9.6.1 Potential and benefits of NB-PRT

On the basis of the experience reported in this article, we identified three general hallmarks of NB-PRT for laboratory training of both simple and complex cognitive tasks. The first is that NB-PRT provides more differentiated feedback to the monkey. During trial-and-error learning, monkeys learn how not to behave for preventing trial termination and rejection of reward. This learning of behavioral errors is critical for the progression of the training and its overall success [Sutton & Barto 1998]. Yet, training of complex tasks involves introduction of new error sources at some critical steps. For the animal, such unfamiliar errors easily cause

9.6 Discussion

confusion with the current task rules, not only with regard to the new task component but also with respect to already established behaviors (cf. Fig. 29B). NB-PRT provides the opportunity to introduce such new error sources in a soft way and to keep the animal’s confidence with the general task rules. Instead of trial termination and rejection of reward, undesired behavior related to a new task rule is rewarded, but much less than the desired behavior (e.g., Results, Example 1: Preventing high error rates). This makes the feedback more distinct: previously learned error sources still cause trial termination, but the new error source is systematically “taught” to the animal based on small rewards. After the animal learns the desired behavior (by figuring out how to get the most reward), the reward scheme can eventually be reset by fully rejecting reward for the undesired behavior first, and subsequently associating it with trial termination, such that new reward-behavior associations may be defined for the next training step. The more informative feedback provided by NB-PRT may be exemplified by a metal detector when searching for a coin on a beach. With a binary device, a tone signal may indicate that the coin is somewhat close, but no signal provides no hint about where the coin is. A graded feedback provides this hint: the higher the tone, the closer the coin. It allows the target zone to be enlarged such that less time is spent outside of it, and within the target zone, the signal provides information about its center.

The second hallmark is an increase in the learning rate. Because, in the laboratory, trial-and-error learning with binary feedback cannot signal more than “right” and “wrong,”

unsuccessful trials will make the difference for the animal to adapt its behavior. Yet, their relative number must not exceed a critical ratio, to prevent a loss in task confidence and in the overall willingness to cooperate. This causes a conflict, because keeping the number of errors within a critical range also limits the number of trials for learning. With a graded reward, it is possible to put emphasis on correctly performed trials. By a careful choice of criteria to obtain, e.g., high, medium, and low reward, NB-PRT provides the animal with a larger number of informative trials (example 2) and guides it toward the desired behavior more purposefully than error-based learning.

The third hallmark is the introduction of a “gambling factor” for tasks of otherwise uniform structure that helps to increase the animal’s overall alertness (e.g., Fig. 27, E and F). As mentioned previously, in neuroscience research, monkeys are usually required to perform hundreds of consecutive trials. Even when the focus is on individuals that are very good

9.6 Discussion

performers in the laboratory, there is not much reason to believe that performing the same task again and again is constantly thrilling. NB-PRT provides the opportunity for the monkey to “win” a trial by rewarding very good performance better than medium performance.

Associating the task outcome with the animal’s performance makes the reward more unpredictable and, as such, provides a lasting incentive for the animal to stay focused even when learning has finished.

9.6.2 Limitations and side aspects of NB-PRT

It is worth mentioning, however, that NB-PRT has limits and side aspects that need to be taken into account. First, depending on the research question, during neuronal data acquisition, variable reward amounts may impair the interpretability of neuronal responses.

For example, neurons may be modulated by cognitive processes and reward value or reward expectancy at the same time [Gottlieb 2007; Leon & Shadlen 1999; Stǎnişor et al. 2013]. For meaningful data analysis, investigating the response of such neurons usually requires all factors but the one under examination to be fixed, making variable rewards an unfavorable condition. Second, animals may get saturated more quickly if they receive higher reward for the desired behavior, yielding fewer trials overall. Albeit in the current study we observed no such effect, or even the opposite compared with the animal’s performance in a binary regime, we cannot exclude this possibility because of the small number of animals tested. Third, combination, adaptation, and gradation of reward amounts entails possible pitfalls. For example, if a reward bonus for precise fixation is combined with an RT-dependent reward schedule, fast RTs in trials with unprecise fixation potentially yield a similar reward than medium RTs in trials with precise fixation. This makes the feedback ambiguous again.

Furthermore, frequent adaption of criteria for high, medium, and low reward in an RT-dependent reward schedule to fix the ratio of high-reward trials likely constitutes a punishment for focused performance, because achieving higher rewards gets more and more difficult. Under such circumstances, monkeys may easily learn that mediocre performance makes a high reward more easily accessible than good performance. Similarly, if the choice of criteria for different reward amounts makes it hard or even impossible for the animal to associate them with different behaviors, NB-PRT may exert undesired effects (Fig. 27D).

9.6 Discussion

Thus, like any other training approach, successful application of NB-PRT requires careful planning of the individual training steps and taking the perspective of the animal. When based on careful choice of parameters and daily data inspection, NB-PRT constitutes a powerful technique that provides additional options to guide the animal’s behavior. Because it is based on success rather than failure, it adds an additional motivating factor, helps to prevent animals from developing a very low or very high error tolerance, and constitutes a promising approach for refining laboratory methods for nonhuman primates.

Author contribution

D.W. conceived and designed research; B.F. and D.W. performed experiments; B.F. and D.W.

analyzed data; B.F. and D.W. interpreted results of experiments; B.F. and D.W. prepared figures; D.W. drafted manuscript; B.F. and D.W. edited and revised manuscript; B.F. and D.W. approved final version of manuscript.

Grands

This work was supported by Deutsche Forschungsgemeinschaft Grants WE 5469/2-1 and WE 5469/3-1, a Zentrale Forschungsförderung grant from the University of Bremen, and a scholarship from the German Academic Scholarship Foundation.

Acknowledgments

We acknowledge support by Constantin Neagu, Sally Lott, Miriam Schillner, Peter Bujotzek, Ramazani Hakizimani, and Katrin Thoss regarding various aspects of the study. We thank Eric Drebitz and Malte Persike for comments on an earlier draft.

Disclosures

No conflicts of interest, financial or otherwise, are declared by the authors.