/($51,1*$1''(&,6,210$.,1*
'LVVHUWDWLRQ
zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften (Dr. rer. nat.)
an der Universität Konstanz
Mathematisch-naturwissenschaftliche Sektion Fachbereich Psychologie
vorgelegt von
im Mai 2013
Tag der mündlichen Prüfung: 9.7.2013 1. Referent: Prof. Dr. Marco Steinhauser 2. Referent: Prof. Dr. Ronald Hübner
Benjamin Ernst
Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-243593
Danksagung
Der erste und größte Dank gebührt natürlich meinem Betreuer, Prof. Marco Steinhauser. Ohne seine Betreuung, Unterstützung und Toleranz bestünde diese Arbeit entweder aus einem leeren Blatt Papier oder 500 Seiten schlechtem Englisch, in welches häufig das Wort „Feedback“ eingestreut wäre.
Ebenfalls möchte ich bei Prof. Ronald Hübner für die Unterstützung und die Begutachtung meiner Arbeit bedanken. Ich hoffe der Aufwand war nicht zu groß!
Weiterer Dank gilt meiner Familie und meinen Kollegen, die mir über die Jahre mit Rat und Tat zur Seite standen: Michael Dambacher, Kai Robin Grzyb, Sabine Hügelschäfer, Martin Maier, Manuela Sirrenberg und Lisa Töbel.
Ein besonderer Dank gebührt zum einen Melanie Renner – zu versuchen eine Wette zu verlieren kann auch motivierend sein – und zum anderen meiner langjährigen Freundin Maj- Britt Isberner – her sensible help proofed to invaluable and should not be overestimate [sic].
Zu guter Letzt sei noch LisaKübler gedankt, welche als studentische Hilfskraft sehr motiviert und organisiert bei der Datenerhebung mitgewirkt hat.
Table of Contents
Zusammenfassung V
Abstract VII
Table of Abbreviations VIII
I. INTRODUCTION ... 1
I.1 Dual Process Models and Feedback Processing ... 3
I.2 Automatic Feedback Processing ... 5
I.2.1 The Reward Prediction Error and the Dopamine System. ... 6
I.2.1.1 The Reward Prediction Error 6 I.2.1.1The Anatomy of the Dopamine System 8 I.2.1.2 The Effect of Dopamine on the Neuronal Level 10 I.2.1.3 Reward Signaling by Dopamine Neurons 10 I.2.2 Models of Automatic Feedback Processing ... 12
I.2.2.1 A Model of Basal Ganglia Functioning – the BD-DA Model 12 I.2.2.2 The Reinforcement Learning Model by Holroyd & Coles (2002) 15 I.2.2.2.1 EEG correlates of ACC activity. 16 I.2.3 Summary ... 17
I.3 Controlled Feedback Processing ... 18
I.3.1 Basal Ganglia and Working Memory ... 18
I.3.2 The Augmented BG-DA Model ... 19
I.3.3 Recent Support for the Contribution of Controlled Processes to Feedback Processing ... 20
I.3.4 Controlled Feedback Processing in Educational Settings ... 23
I.3.5 Summary ... 24
I. 4 EEG Components Involved in Error and Feedback Processing ... 24
I.4.1 Basics of EEG Research ... 25
I.4.2 The FRN ... 26
I.4.2.1 Basic Characteristics of the FRN 26 I.4.2.2 The Neural Origin of the FRN 28 I.4.2.3 FRN Theories 29 I.4.2.4 FRN-Modulating Variables 30 I.4.2.4.1 The FRN as an indicator of feedback valence 30 I.4.2.4.2 The FRN’s sensitivity to quantitative differences in reward. 31 I.4.2.4.3 The FRN’s sensitivity to feedback expectancy. 32 I.4.2.4.4 Effect of medication on the FRN. 34 I.4.2.4.5 Individual differences, psychopathology, and the FRN. 34 I.4.2.5 The Association between the FRN and Behavioral Adjustment 36 I.4.2.6 Summary 38 I.4.3 The P300 ... 38
I.4.3.1 Properties of the P300 39
I.4.3.2 Neural Origin of the P300 40
I.4.3.3 Determinants of the P300 Amplitude 41
I.4.3.4 Variables Predicted by Differences in the P300 Amplitude 43
I.4.3.5 P300 Theories 44
I.4.3.6 Summary of Theoretical Positions 46
I.4.3.7 The Feedback-P300 47
I.4.3.8 P300, Pe and Conscious Awareness of Errors 49
I.4.3.9 Summary 50
I.5 Synopsis ... 51
I.5.1 Overview of the Studies ... 51
II. STUDIES ... 53
II.1 Study 1:Feedback-Related Brain Activity Predicts Learning from Feedback in Multiple-Choice Testing ... 53
II.1.1 Abstract ... 53
II.1.2. Introduction ... 54
II.1.2.1 ERPs Related to Memory Processing 55 II.1.2.1 ERPs Related to Feedback Processing 56 II.1.2.3 The Present Study 58 II.1.3. Method ... 59
II.1.3.1 Participants 59 II.1.3.2 Stimulus Material 59 II.1.3.3 Design and Procedure 61 II.1.3.4 Electrophysiological Recordings 62 II.1.3.5 Data Analysis 63 II.1.3 Results ... 64
II.1.3.1 Behavioral Data 64 II.1.3.2 Positive vs. Negative Feedback in Feedback-Learning Blocks 67 II.1.3.3 Predictors of Successful Learning in Feedback-Learning Blocks 71 II.1.4 Discussion ... 73
II.1.4.1 The feedback-locked P300 75 II.1.4.2 The frontal positivity 76 II.1.4.3 The FRN 77 II.1.4.4 Conclusion 77 II.2. Study 2: Effects of Invalid Feedback on Learning and Feedback-Related Brain Activity in Decision-Making ... 79
II.2.1 Abstract ... 79
II.2.2 Introduction ... 80
II.2.3 Method ... 83
II.2.3.1. Participants 83 II.2.3.2 Stimulus Material 83 II.2.3.3 Design and Procedure 83 II.2.3.4 Electrophysiological Recordings 86 II.2.3.5 Data Analysis 86 II.2.4 Results ... 88
II.2.4.1 Behavioral Data 88
II.2.4.2 Feedback-Locked ERP Data 89
II.2.4.2.1 Irrelevant feedback. 90
II.2.4.2.2 Relevant feedback. 91
II.2.5 Discussion ... 94
II.3 Study 3: The effect of feedback validity and feedback reliability on learning and feedback-related brain activity ... 98
II.3.1 Abstract ... 98
II.3.2 Introduction ... 98
II.3.3 Method ... 101
II.3.3.1 Participants 101 II.3.3.2 Stimulus Material 101 II.3.3.3 Design and Procedure 102 II.3.3.4 Electrophysiological Recordings 104 II.3.3.5 Data Analysis 104 II.3.4 Results ... 105
II.3.4.1 Behavioral Data 105 II.3.4.2 Feedback-Locked ERP Data 107 II.3.5 Discussion ... 111
II.3.5.1 Behavioral Results 111 II.3.5.2 FRN results 112 II.3.5.3 Conclusion 114 III. GENERAL DISCUSSION ... 115
III.1 The present research ... 115
III.1.1 Overview ... 115
III.1.2 Findings ... 116
III.2 Implication of the results for existing research ... 118
III.2.1 Implications for the Dual-Process Account ... 118
III.2.2 Implications for Research on Feedback-Related ERPs ... 121
III.2.2.1 Relevance for FRN research 121 III.2.2.1 Relevance for P300 research 122 III.2.3 The Effect of Feedback Valence on Performance ... 126
III.3 Outlook for Future Research ... 127
IV. REFERENCES ... 130
V. APPENDIX ... 183
V.1. Appendix A: Overview of Published FRN Studies ... 183
V.2 Appendix B: ERP results of the test phase in Study 2 ... 191
V.3 Appendix C: List of Figures ... 193
V.4 Appendix D: Contributions of the Authors ... 197
Zusammenfassung
Feedback ist für adaptives Verhalten in Entscheidungssituationen unerlässlich. Die jüngste Forschung und Modelle zu den neurokognitiven Grundlagen von Lernen und Entscheidungsfindung legen nahe, dass Feedbackverarbeitung von einer Zwei-Prozess- Perspektive her betrachtet werden sollte, d.h. das Information über Handlungsergebnisse sowohl auf eine automatische, als auch auf eine kontrollierte Art und Weise verarbeitet werden können. Während für letztere Aufmerksamkeitsressourcen und Arbeitsgedächtnis- aktualisierung wichtig sind, stehen bei ersterer Verstärkungslernprozesse im Mittelpunkt.
Jedoch haben sich vorangegangene Studien vor allem auf die Details der automatischen Feedbackverarbeitung und das elektrophysiologische Korrelat des Verstärkungslernens, die Feedbacknegativierung (FRN), konzentriert, während sich erst jüngst Studien mit den Details kontrollierter Prozesse und der Feedback-P300, welche ein Korrelat kontrollierter
Verarbeitung zu sein scheint, beschäftigt haben.
Die vorliegende Dissertation versucht das Verständnis kontrollierter
Feedbackverarbeitung und die Auswirkung von top-down Prozessen auf die automatische Feedback-verarbeitung zu vertiefen, indem sie sich speziell auf feedbackkorrelierte
ereignisevozierte Potenziale (EKPs) fokussiert. Zu diesem Zweck wurden EKP-Daten erfasst während Versuchsteilnehmer Feedback nutzten, um den korrekten Stimulus aus einer
Kombination von vier (Studie 1) bzw. zwei Stimuli (Studien 2 und 3) zu erkennen und zu lernen. In Studie 1 konnte festgestellt werden, dass die Feedback-P300 und eine anteriore Positivierung in Multiple-Choice-Aufgaben das Lernen aus Fehlern und korrektivem Feedback vorhersagen kann, während die FRN das nicht vermag. Studie 2 zeigte, dass mehrdeutiges Feedback – d.h. Feedback, welches auch irrelevante und potentiell invalide Information beinhaltet – der Testleistung abträglich ist. Überdies war die auf irrelevantes Feedback folgende FRN unbeeinflusst von der Feedbackvalenz, wohingegen die auf
relevantes Feedback folgende Feedback-P300 geringer ausfiel, wenn das irrelevante Feedback invalide war. Schließlich konnte in Study 3 gezeigt werden, dass ein FRN-Effekt nur dann beobachtet werden kann, wenn es wahrscheinlich ist, dass das Feedback valide ist, jedoch nicht wenn es wahrscheinlicher ist, dass es invalide ist. Dies zeigt, dass Top-Down-Prozesse die automatische Feedbackverarbeitung beeinflussen können um zu verhindern, dass aus potenziell invalidem Feedback gelernt wird.
Insgesamt stützen diese Befunde die Zwei-Prozess-Perspektive der
Feedbackverarbeitung indem sie die Unabhängigkeit beider Prozesse aufzeigen. Des Weiteren sind sie nützlich für die ERP-Forschung, da sie zeigen, dass die FRN top-down beeinflusst
werden kann und indem sie nahelegen, dass die Amplitude der Feedback-P300 ein Indikator für das Ausmaß an Information ist, welche ein Feedbackstimulus liefert.
Abstract
For adaptive behavior in decision-making situations feedback is vital. Recent research and models of the neurocognitive underpinnings of learning and decision making suggest that feedback processing should be addressed from a dual-process perspective, i.e., information about action outcomes can be processed both in an automatic and in a controlled fashion.
While for the latter attentional resources and working memory updating are essential,
reinforcement learning processes are more central to the former. However, prior studies have mainly focused on the details of automatic feedback processing and an electrophysiological correlate of reinforcement learning, the feedback-related negativity (FRN; Miltner, Braun, &
Coles, 1997), whereas only recently studies addressed the details of controlled processes (e.g., (Collins & Frank, 2012; Frank & Claus, 2006; Walsh & Anderson, 2011) and the feedback- P300, which appears to be a correlate of controlled processing (e.g., Chase, Swainson, Durham, Benham, & Cools, 2011; Sailer, Fischmeister, & Bauer, 2010; Yeung & Sanfey, 2004).
The present dissertation aims to further the understanding of controlled feedback processing and the effect of top-down processes on automatic feedback processing by specifically focusing on feedback-correlated event-related potentials (ERPs). To this end, ERP data were recorded while participants utilized feedback to identify and learn the correct stimulus out of four (Study 1) or two alternatives (Studies 2 and 3), respectively. Study 1 established that for multiple-choice tests, the feedback-P300 and an anterior positivity are predictive for learning from errors and corrective feedback, whereas the FRN was not. Study 2 showed that ambiguous feedback, i.e., feedback that also contains contradicting, but irrelevant information, is detrimental to test performance. Moreover, the FRN following irrelevant feedback was unaffected by feedback valence, whereas the feedback-P300 following relevant feedback was attenuated when the irrelevant feedback information was invalid. Finally, Study 3 showed that an FRN effect can only be observed when feedback is likely to be valid, but not when it is more likely to be invalid. This shows that top-down processes can bias automatic feedback processes to avoid learning from potentially invalid feedback.
Together, these results support the dual-process perspective of feedback processing by showing by showing an independence of both processes. Furthermore, these studies provide input for ERP research as they show that the FRN effect can be subject to top-down
influences and by suggesting that the feedback-P300 amplitude might be an indicator for the information provided by a feedback stimulus.
Table of Abbreviations
The following abbreviations were used in this thesis:
ACC = anterior cingulate cortex ANOVA = analysis of variance
ADHD = attention deficit / hyperactivity disorder
BG-DA theory = basal ganglia theory of dopaminergic function xxx COMT = Catechol-O-methyltransferase
COVIS = COmpetition between Verbal and Implicit Systems CRN = correct-related negativity
CS = conditioned stimulus DA = dopamine
dACC = dorsal anterior cingulated cortex DLPFC = dorsolateral prefrontal cortex EEG = electroencephalography
ERN = error-related negativity ERP = event-related potential
fMRI = functional magnet resonance imaging FRN = feedback-related negativity
GPe = globus pallidus externa GPi = globus pallidus interna LC = locus coeruleus
LTP = long-term potentiation LTD = long-term depression MEG = magnetoencephalography µV = microvolts
NAcc = nucleus accumbens
OCD = obsessive-compulsive disorder OFC = orbitofrontal cortex
Pe = error positivity PFC = prefrontal cortex PD = Parkinson’s disease
rACC = rostral anterior cingulated cortex
RL theory = reinforcement learning theory (of the ERN/FRN) SNpc = substantia nigra pars compacta
US = unconditioned stimulus VTA = ventral tegmental area
I. INTRODUCTION
"A learning experience is one of those things that says, 'You know that thing you just did? Don't do that.’"
— Douglas Adams (The Salmon of Doubt: Hitchhiking the Galaxy One Last Time)
It is hard to overestimate the importance of feedback for learning and decision making.
Without any information about the appropriateness of a decision in a given context, there is absolute uncertainty whether behavior should be repeated or altered in the future. This essential knowledge of action outcomes can be acquired by considering feedback, i.e.,
information about the consequences of one’s actions. Feedback needs to be processed, that is, the information inherent in the feedback stimulus has to be transformed into knowledge of action outcomes and used to adapt future behavior. However, the rather obvious fact that learning is often not perfect implicates that feedback processing can be flawed in multiple ways. Consider the course of feedback processing and learning: Feedback has to be perceived and its meaning understood. Then, this information has to be linked to a previous decision or response and both have to be encoded and stored into memory together. Finally, both decision and feedback have to be recalled and applied appropriately. At each of these individual steps certain factors influence feedback processing. For instance, a feedback stimulus might be difficult to perceive on a physical level. It might be unclear, whether feedback is in fact relevant or irrelevant and thus potentially misleading. Its message might be ambiguous or hard to understand or additional information might be necessary to interpret it correctly. In addition, the person receiving the feedback might not have enough cognitive resources at that moment to sufficiently process the feedback, or he or she might not be motivated to do so.
Individual characteristics of the person as a state or trait, or characteristics of the situation might bias feedback processing, favoring positive over negative feedback or vice versa.
Finally, the person might fail to correctly recall the feedback later or might erroneously link it to the wrong decision.
In order to improve learning from feedback it is not only necessary to know which factors are detrimental or beneficial to performance, but to gain insight into why these factors have these effects. For this, a thorough understanding of the underlying processes and their
interaction with different variables is crucial. It is indeed justified to speak of multiple processes as there is evidence that there are two types of systems involved in learning from feedback: On the one hand, feedback can be processed in a controlled, serial and resource- intensive fashion that strongly relies on working memory (Baddeley, 1986) and on the other hand it can be processed in an automatic and effortless fashion involving the reinforcement- learning system (e.g., Schultz, Dayan, & Montague, 1997). Past research has led to
sophisticated models of feedback processing that also include neuroanatomical,
neurochemical and neuropsychological details (e.g., Holroyd & Coles, 2002; Holroyd &
Yeung, 2011). However, the focus of this research, at least in the neuroscience community, has been on reinforcement-learning processes (e.g., Frank, Seeberger, & O'Reilly, 2004;
Schultz, 2002) while working memory-based processes have only recently received some attention (e.g., Collins & Frank, 2012, 2013; Frank & Claus, 2006; O'Reilly & Frank, 2006;
Walsh & Anderson, 2011).
This dissertation aims to investigate the role of feedback and errors in decision making and learning with a special focus on learning and decision situations in which controlled processes are likely to be strongly involved (e.g., multiple-choice tests)1. To this end, event- related potentials (ERPs) of the human encephalogram (EEG) will be considered which can be linked to controlled and automatic processes, respectively: the P300 (Donchin, 1981;
Donchin & Coles, 1988; Polich, 2007) and the feedback-related negativity (FRN, Miltner et al., 1997). By considering these ERP components under circumstances that might be
detrimental to feedback processing, we aimed to gain insight into the contribution of
controlled and automatic feedback processes to learning and decision making, as well as into the interaction of these two processes. Besides dual processes and ERPs, an additional focus will be on the processing of negative feedback and thus errors. This is because the strength of controlled feedback processing lies in learning from errors and swift adaption compared to automatic feedback processing which might be even prone to learning of errors (Steinhauser, 2010; Steinhauser & Hübner, 2006, 2008). Thus, it is in the processing of corrective feedback to errors where the different contributions of both systems are most likely to be identified.
1 It is necessary to remark that besides its basic function as information about action outcomes, feedback has also been investigated in the organizational context where feedback also has a more motivational function and is embedded in a social situation (for a review, see Kluger & DeNisi, 1996). However, these additional aspects of feedback are beyond the scope of this dissertation and will not be discussed in detail.
This dissertation presents three original studies on feedback processing and feedback- related brain activity. For a better understanding of the theoretical underpinnings of these studies, a rough outline of the dual-process account will be given in the first section of the following introduction, followed by a discussion of research on reinforcement learning-based feedback processing and controlled feedback processing. Finally, because EEG components can only be evaluated appropriately in the context of previous research, the
electrophysiological indicators relevant for automatic (FRN) and controlled feedback processing (P300) will be discussed.
I.1 Dual Process Models and Feedback Processing
The underlying framework of the research presented in this dissertation is that of a dual process account of feedback processing. Dual process theories posit that information is processed by two distinct systems that operate in a controlled or in an automatic fashion, respectively. In this context, controlled processing relies on attention, working memory, and conscious control and, as a consequence, it is resource-intensive and relatively slow. In contrast, automatic processing does not require cognitive control or resources and is thus efficient and fast. Dual-process theories can be found in such diverse fields of psychology as attention (Shiffrin & Schneider, 1977), learning (e.g., Ashby & Maddox, 2005; Sun, Slusarz,
& Terry, 2005) and memory formation (e.g., Squire, 2004), reasoning (Evans, 2003), persuasion (Petty & Cacioppo, 1986), impression formation and self-regulation (e.g., Bargh
& Chartrand, 1999). Although different terms are used in these theories for controlled (e.g., explicit, rule-based, deliberative, rational) and automatic processes and systems (e.g., implicit, reflexive, impulsive), both systems are commonly described as differing with respect to cognitive resource demands and flexibility. Accordingly, the processes are differently suited for certain tasks. For instance, for dual-processes in memory formation, Squire (2004) identified two different principles:
“In the case of declarative memory, an important principle is the ability to detect and encode what is unique about a single event, which by definition occurs at a particular time and place.
In the case of nondeclarative memory, an important principle is the ability to gradually extract the common elements from a series of separate events.” (Squire, 2004, p. 174)
Transferred to the topic of feedback processing, this implies that feedback can be used by the controlled process system to rapidly adjust behavior - however, this ability comes with the need for cognitive resources. In contrast, the automatic system does not depend on cognitive resources and uses the feedback from multiple learning opportunities to identify underlying commonalities resulting in behavioral preferences in a complex environment. In brief, this system can process feedback effortlessly and integrate feedback information over time;
however, it is less flexible with respect to abrupt changes in the environment.
While the evidence for the existence of these two processes is abundant (for an
overview, see for example (Sanfey & Chang, 2008), several important research questions still remain unanswered: Namely, whether and how both processes interact and how conflict is resolved when the outcomes of controlled and automatic processes differ and thus implicate conflicting behavior. While research on the role of feedback in learning of categories has recognized the interaction of an explicit reasoning system and an implicit learning system as an important aspect that contributes to learning performance (e.g., Ashby, Maddox, & Bohil, 2002; Ashby & O'Brien, 2007; Maddox, Ashby, & Bohil, 2003; Maddox, Ashby, Ing, &
Pickering, 2004; Zeithamova & Ashby, 2006)), resulting in a sophisticated model (COVIS, COmpetition between Verbal and Implicit Systems; Ashby & Alfonso-Reese, 1998; Ashby, Paul, & Maddox, 2011), past cognitive neuroscience research on decision making has focused mainly on the characteristics of the automatic system alone, which is in this case mostly equivalent to the reinforcement learning system (e.g., Schultz, 1998; R. S. Sutton & Barto, 1998). As a consequence, most studies favored reinforcement learning paradigms (e.g., guessing task, probabilistic learning task, time estimation task, gambling task) over explicit learning paradigms, neglected the contribution of controlled, working memory-based processes to feedback processing, and rarely accounted for conflict and interaction between both systems. However, the recent years saw the emergence of models that acknowledge and integrate both processes (e.g., Frank & Claus, 2006; Holroyd & Yeung, 2011) and inspired a number of studies that rely on these models (e.g., Chase et al., 2011; Collins & Frank, 2012;
Doll, Hutchison, & Frank, 2011; Doll, Jacobs, Sanfey, & Frank, 2009; Moustafa, Sherman, &
Frank, 2008; Walsh & Anderson, 2011).
It is the main goal of the original studies reported in this dissertation to add to this still limited, but growing literature, by addressing specific issues that arise from the dual-process account: First, if there are different systems underlying feedback processing, then the question arises whether specific neural correlates (i.e., components of the human ERP) can be
identified that relate to each respective system. Second, while some decision making tasks are more suited for automatic processing, performance in others should strongly rely on
controlled processes. Accordingly, it is necessary to devise an experimental decision making paradigm that is suited for the investigation of controlled processes. Third, both processes might affect each other and research on category learning implies that the controlled system is more dominant (Maddox et al., 2004). It remains to be seen whether such top-down influence can also be shown in a learning and decision making task. Last, a crucial role of controlled processes might be error correction or even learning from errors (Anderson & Craik, 2006), whereas the automatic system appears to tend more towards learning of errors (Baddeley &
Wilson, 1994; Steinhauser, 2010; Steinhauser & Hübner, 2006, 2008). Accordingly, indices of controlled processing should be related to the quality of error correction and learning from negative feedback.
The following sections will present both systems and the underlying processes in more detail. Note that because of the focus of prior feedback research on reinforcement learning, the neuroscience literature on automatic processes is far more extensive compared to research on controlled feedback processing.
I.2 Automatic Feedback Processing
The neural basis of reinforcement learning – which is, essentially, the core of automatic feedback processing – has received great attention in the past one and a half decade. The resulting research led to sophisticated models that are especially informative for the understanding of the role of error and feedback for learning and decision making. A common ground of these models is the so-called prediction error and its relation to the dopamine system, which is why both will be discussed in detail in the following before two prominent models for learning and decision making are presented.
I.2.1 The Reward Prediction Error and the Dopamine System
2.
According to Thorndike’s Law of Effect (Thorndike, 1911) the association between a stimulus and a response is not established by mere repetition, but because “the consequences of a connection work back upon it to influence it” (Thorndike, 1927, p.215). “Connection”
can be interpreted as the connection between a stimulus and a response, but Thorndike already considered it as a connection between groups of neurons: “Connections between neurons are strengthened every time they are used with indifferent or pleasurable results and weakened every time they are used with resulting discomfort.” (Thorndike, 1907, p.166).
I.2.1.1 The Reward Prediction Error
The idea of ‘pleasurable’ and ‘discomfortable’ results was already refused by Skinner (1953)3 and later replaced by reward expectancy. Based on research on the blocking effect (Kamin, 1969), Rescorla and Wagner (1972) introduced the idea that learning depends not on the mere pairing of stimulus or action and reward, but on the extent to which a reinforcer was unpredicted and to which a stimulus or action is predictive of the reinforcer. As further
fleshed out in later theories (Schultz, 1998; R. S. Sutton & Barto, 1998), the degree to which a reward occurs unpredictably, the so-called prediction error, can inform a system to what extent associations between the cognitive presentations of an event and the reward should be adjusted. When a reward is initially presented in a novel situation, its occurrence is
unpredicted and thus the prediction error is large. Further, when in this situation a stimulus preceded this reward, the prediction error can be used to increase the association between said event and the reward by a certain degree. The consequence of the newly established
association is that the reoccurrence of the stimulus activates the representation of the reward – that is, the stimulus predicts the reward. The association, however, is initially not perfect and thus the reward is not fully predicted, which again leads to a prediction error, although to a lesser degree. Eventually, when stimulus and reward have been presented together
consistently, the stimulus predicts the reward perfectly, the resulting the prediction error is
2 Throughout this thesis, the term “dopaminergic” system will also be employed. Note that while the dopamine system refers to DA neurons themselves, “dopaminergic system” also refers to brain regions that are affected by DA.
3 “If we then go on to say that a stimulus is reinforcing because it is pleasant, what purports to be an explanation in terms of two effects is in reality a redundant description of one” (Skinner, 1953, p. 82, italics in the original).
zero, and the association is not altered in any way. However, it is obviously adaptive to have a learning system that implements changes in contingency – for instance, that an event might no longer be predictive of a reward. Under circumstances where the reward fails to occur after the stimulus, the prediction error becomes negative and the association between stimulus and reward is reduced.
The concept of the prediction error proved to be very productive and was is realized in neuronal network models as the so-called delta rule (Rumelhart, Hintont, & Williams, 1986;
R. S. Sutton & Barto, 1981; Widrow & Hoff, 1960) where it represents the difference between a desired and an actual output and is used to adjust synaptic weights between nodes in the network. Over time, it was further refined in reinforcement algorithms that make use of temporal difference error(R. S. Sutton & Barto, 1998) which also incorporates information about the exact timing of reinforcement, allowing for the integration of temporal discounting.
In this model, it is assumed that a CS, i.e., a stimulus that was associated with the reward before, can act as reinforcer itself and thereby instigate a prediction error. By this, “the effective reinforcement signal moves backward in time from the primary reinforcer to the earliest reinforcer-predicting stimulus” (Schultz, 1998, p. 13), a fact that will be important in the later discussion of correlates of the prediction error in this thesis (Holroyd & Coles, 2002).
Moreover, the temporal difference error plays an important role in neuronal approaches addressing the question of control and decision making, namely the actor-critic architectures (Barto, 1995; Witten, 1977). In these, a so-called ‘critic’ evaluates the choice of an ‘actor’ and sends reinforcement signals to it that serve to optimize the actor’s future choices. This
reinforcement signal results from the evaluation of the action that resulted from the actor’s choice. The signal takes the form of a temporal difference error, that is, the critic uses a representation of the expected outcome and determines whether the actual outcome of an action, as conveyed by feedback from the environment, is better or worse than expected.
Actor-critic architectures play a central role when it comes to the function of the basal ganglia and the dopaminergic system and have recently been extended to describe the interplay of several brain areas, namely the basal ganglia, the anterior cingulated cortex (ACC), the orbitofrontal cortex (OFC) and the dorsolateral prefrontal cortex (DLPFC; (Holroyd &
Yeung, 2011). But before these models are described in further detail, the following section will give a rough overview of the dopamine system and its connection to the prediction error.
While the prediction error can explain learning on an algorithmic level an important step was the discovery of how it is implemented on a neurobiological level. Schultz (Schultz, Apicella, & Ljungberg, 1993) pointed out the connection between the prediction error and the dopaminergic system and proposed in his seminal paper (Schultz, 1998) that a mesencephalic DA signal encodes the prediction error.
I.2.1.1The Anatomy of the Dopamine System
. Dopamine is neurotransmitter and neuromodulator in the central nervous system and early research on intracranial self-stimulation in rats (Olds & Milner, 1954) identified brain areas associated with DA as being involved in the processing of reward (for an overview, see (Berridge & Robinson, 1998; Wise & Rompré, 1989). Central for the reward system are nuclei of the midbrain – the ventral tegmental area (VTA) and the substantia nigra pars compacta (SNpc) – from which DA neurons project to other brain regions4. As depicted in Figure 1, there are three major projection pathways: the mesocortical DA system that involves DA neurons that originate from the VTA and end in the prefrontal cortex (PFC), the cingulate and the perirhinal cortex, while neurons of the mesolimbic system terminate in the nucleus accumbens (NAcc), but also innervate the hippocampus and the amygdala (for an overview, see (Arias-Carrión, Stamelou, Murillo-Rodríguez, Menéndez-González, & Pöppel, 2010;
Björklund & Dunnett, 2007). Finally, the nigrostriatal pathway projects from the SNpc to the basal ganglia, specifically the striatum. Electrical stimulation of mesencephalic neurons – and prominently, the NAcc – induces appetitive behavior in rats (e.g., Corbett & Wise, 1980;
Olds, 1958; Olds & Milner, 1954; Wise, 1981), unless DA antagonists, like haloperidol (Wauquier, Niemegeers, & Lal, 1974) or clozapine (Atrens, Ljungberg, & Ungerstedt, 1976), are administered, that abolish this behavior (see, also Gallistel, 1986; Nakajima & McKenzie, 1986). Moreover, drugs like cocaine (Phillips, Stuber, Heien, Wightman, & Carelli, 2003;
Wise, 1998) and amphetamines (Knutson et al., 2004), but also nicotine (Rice & Cragg, 2004) and indirectly opiates (Di Chiara, 1995; Di Chiara & Alan North, 1992) and cannaboids (Cheer, Wassum, Heien, Phillips, & Wightman, 2004; Freund, Katona, & Piomelli, 2003) target the reward pathways and act as DA agonists, leading to increased activity in reward-
4 DA also plays a role in the hypothalamus (see, (Ben-Jonathan & Hnasko, 2001)), which, however, is irrelevant for this thesis.
related regions like the NAcc and the striatum (Breiter et al., 1997; Dixon et al., 2005;
Knutson et al., 2004; Völlm et al., 2004).
Figure 1. Overview of the basal ganglia and projections of dopamine neurons originating from the substantia nigra and the ventral tegmental area (yellow).
Finally, diseases like Parkinson’s disease (PD, Ehringer & Hornykiewicz, 1960) and schizophrenia (van Os & Kapur, 2009) that are marked by altered levels of DA, reportedly involve problems in reward-learning (for findings in PD patients, see Frank, Samanta, Moustafa, & Sherman, 2007; Frank et al., 2004; for schizophrenia patients, see Gold, Waltz,
Prentice, Morris, & Heerey, 2008; Waltz, Frank, Robinson, & Gold, 2007; Waltz, Frank, Wiecki, & Gold, 2010). Taken together, this indicates that DA is involved in reward-learning (see also Beninger, 1983; for an alternative perspective, however, see Redgrave & Gurney, 2006).
I.2.1.2 The Effect of Dopamine on the Neuronal Level
It is, however, crucial for the understanding of the dopaminergic system to be aware that DA is only the messenger, whereas its effect on postsynaptic neurons depends on the presence or absence of different DA receptors and usually involves other afferent neurons (e.g., Goldman-Rakic, Leranth, Williams, Mons, & Geffard, 1989; Smith, Bennett, Bolam, Parent, & Sadikot, 1994). All DA receptors are linked to second-messenger proteins that are released when DA binds to the receptor and instigate short-term and long-term changes in the respective neuron related to long-term potentiation (LTP) or long-term depression (LTD; e.g., Kerr & Wickens, 2001). Roughly speaking, DA receptors can be categorized into two classes according to their function, with D1 and D5 receptors eventually inducing phosphorylation, and D2, D3 and D4 receptors opposing it (Lachowicz & Sibley, 1997). Phosphorylation increases AMPA receptor activity and availability and by this temporarily increases the excitability of the neuron. In short, through phosphorylation, DA affects whether a neuron is activated more or less easily by an afferent neuron. Moreover, the cascade induced by DA can induce protein synthesis in both neurons leading to lasting changes in receptor availability and modulation of the volume of neurotransmitter excreted, which increases the connectivity between pre- and postsynaptic neurons (Kelleher, Govindarajan, & Tonegawa, 2004).
Speaking in terms of neuronal modeling, DA can change synaptic weights.
I.2.1.3 Reward Signaling by Dopamine Neurons
On the basis of this evidence and informed by prior research (Schultz, 1998; Schultz et al., 1997) devised an influential model of reward signaling by DA neurons. After the
presentation of an unexpected reward, mesencephalic neurons briefly increase their firing rate above an original baseline level, whereas omission of an expected reward or presentation of an aversive stimulus reduces the firing rate below the baseline. Importantly, this firing pattern is uninfluenced by the nature of the reward and its stimulus attributes, but depends on
previously learned information which is stored in the system (Schultz et al., 1993). Multiple studies have shown that the extent of a burst (Ljungberg, Apicella, & Schultz, 1992;
Mirenowicz & Schultz, 1994; Montague, Dayan, & Sejnowski, 1996) or a dip (Frank,
Moustafa, Haughey, Curran, & Hutchison, 2007; Hollerman & Schultz, 1998; Schultz, 2002;
Schultz et al., 1993) depends on the expectedness of the event, with larger bursts after the presentation of an unpredicted compared to a predicted reward, and greater dips after the omission of a strongly expected reward5. Schultz argues that by this dopaminergic neurons bidirectionally encode the neuronal equivalent of the reward-prediction error with the firing rate and the resulting DA response being determined by the difference between the actual and predicted reward. Furthermore, research showed that during conditioning the presentation of an unconditioned stimulus (US) initially does not evoke a DA neuron response (e.g., Schultz, 1998; Schultz et al., 1995). Later, however, when the US has become a conditioned stimulus (CS), the presentation of the reward no longer evokes a strong burst - instead, the
presentation of the CS does. Apparently, the prediction error “wandered back in time”, similar to the temporal difference error. However, if the reward was already fully predicted by a different CS, then the DA neuron will not become sensitive to the presence of the US and no learning will occur, similar to the well-established blocking effect (Schultz et al., 1993).
Moreover, the DA neurons show another property of the temporal difference error as they are sensitive to the timing of the CS onset and also exhibit a firing behavior reminiscent of temporal discounting: compared to an early reward (2 s), a later reward (4-16 s later) leads to a monotonically reduced burst intensity, despite having the same objective value (Kobayashi
& Schultz, 2008).
Taken together the dopaminergic neurons convey a temporal difference error to different areas of the brain by inducing a phasic increase in DA in the respective regions.
Mirroring actor-critic models, the midbrain acts in conjunction with neurons at the terminal sites as the critic which sends a teaching signal to neurons that represent the actor, inducing a change in synaptic weights, i.e., connectivity.
5 Recently, Matsumoto and Hikosaka (2009) argued that only a subgroup of DA neurons encode the prediction error and that a larger subset actually reacts to unpredictedness of an event, irrespective of feedback valence (see, however, Frank & Surmeier, 2009; Wang, Tsien, & Tanimoto, 2011)
I.2.2 Models of Automatic Feedback Processing
Schultz’ research on reward signaling by DA neurons contributed to the conception of several theories and models of which two models of basic feedback processing proved to be influential in current research: The reinforcement learning theory of the FRN (RL theory;
Holroyd & Coles, 2002) and the basal ganglia/dopamine model (BG-DA; Frank & Claus, 2006); Both provide an actor-critic framework of how dopamine-based reinforcement
learning affects behavioral adjustment over time. However, they differ in their focus: The RL theory is mostly concerned with the effect of DA on the actor, which is in this theory the anterior cingulate cortex (ACC), and two electrophysiological correlates of this effect, the error-related negativity or error negativity (ERN/Ne; Falkenstein, Hohnsbein, Hoormann, &
Blanke, 1990; Gehring, Goss, Coles, Meyer, & Donchin, 1993) and the feedback-related negativity (FRN; Miltner et al., 1997). The BG-DA model focuses more on the dopamine- mediated updating of reinforcement expectations in the critic, the basal ganglia, and the effect of psychopathology, neuronal diseases, and individual differences in learning from feedback.
Together, they proof to be central for the understanding of automatic feedback processing and thus will be presented in the following.
I.2.2.1 A Model of Basal Ganglia Functioning – the BD-DA Model
Learning is the acquisition and modification of knowledge and ultimately serves adaptive behavior. On a basic level, this boils down to selecting an action over another - for instance, turning left at a crossroad or ticking answer A instead of B in a multiple-choice test.
Action selection in vertebrates, especially on the motor level, is associated with the basal ganglia (Redgrave, Prescott, & Gurney, 1999; Stocco, Lebiere, & Anderson, 2010). The nuclei that constitute the basal ganglia are interconnected with each other (Parent & Hazrati, 1995b) and with different areas of the cerebral cortex, as well as the thalamus and the limbic system (Parent & Hazrati, 1995a).
In a basal ganglia model proposed by Michael Frank and colleagues (e.g., Frank &
Claus, 2006) , information about stimuli and the environment is transmitted from the cerebral cortex to the basal ganglia and activates representations of different actions, or ‘channels’ (see Figure 2 for an illustration of the model). Because only one action can be executed at a time, it is the function of the basal ganglia to resolve the resulting conflict between action channels
by further activating only the appropriate channel, while inhibiting competing actions. Frank and colleagues (M. X. Cohen & Frank, 2009; Frank, 2004; Frank et al., 2004) assume that this involves three pathways; a direct, an indirect, and a hyperdirect pathway. While the direct pathway acts to reinforce the representation of a certain action, the indirect pathway inhibits inappropriate action representations. Formulated more specifically in terms of the authors’
computational model, cortical input units transmit information about encountered stimuli to the premotor cortex and the striatum. This leads to a basic activation of candidate actions in the premotor cortex.
Meanwhile, the so-called Go and NoGo units in the striatum are also activated. Every candidate action is represented by activation of a specific Go and NoGo unit. While the striatal Go units can activate the thalamus via the direct pathway (involving only the globus pallidus interna, GPi), NoGo units inhibit representations of actions in the thalamus via the indirect pathway (involving not only the GPi, but also the globus pallidus externa, GPe).
Because the thalamus activates or inhibits action representations in the premotor cortex, an action is very likely to be implemented, when the respective Go unit is activated, whereas strong activation of the respective NoGo unit would prevent this action. Thereby, the
thalamus and the basal ganglia act as a gate that activates and thus facilitates the execution of a certain response, while inhibiting conflicting responses. Finally, the hyperdirect pathway involves the subthalamic nuclei and initially sends a NoGo signal to all action representations in the GPi and GPe, effectively suppressing premature responses due to random activity in these units (Frank, Samanta, et al., 2007).
Obviously, it is essential that a certain action is implemented in a certain situation, that is, that given a specific input the correct Go and NoGo units are activated. In order to achieve this, the DA signal from the SNpc is used in the striatum to adjust synaptic weights. Go units are associated with D1 receptors which have an excitatory effect on the unit when DA binds to them and when the unit itself is already activated, whereas NoGo units have D2 receptors which act inhibitory upon exposure to DA without additional prerequisites (Hikida, Kimura, Wada, Funabiki, & Nakanishi, 2010). While according to Frank, this further supports action selection, it has a more important effect on learning. When a selected action led to a reward, a DA burst will reach the striatum via the SNpc. Because both – the Go unit associated with the implemented action as well as the input unit – will still be activated, DA will further increase
the activation of this Go unit by docking to its D1 receptors, eventually increasing the
synaptic weight between both units6. The next time when this specific stimulus and situation
Figure 2. Simplified overview of the initial BG-DA model and the augmented model (gray background).
is presented, the input layer will activate this Go unit more strongly, which will increase the likelihood that the respective response will be selected over other actions and be
implemented. In contrast, if the response did not lead to a reward or even resulted in
punishment, a DA dip will reach the striatum. In addition to the fact that the Go unit will not be activated, the NoGo unit will be disinhibited. Due to the decrease in DA, less DA is binds to inhibiting D2 receptor and thus the NoGo unit is activated more strongly. Again, this
6The activation of the D1 receptor will not only result in increased activation of this unit which leads to a short- term change in behavior, but will also result in a second messenger cascade and thus structural changes that effectively result in long-term changes in connectivity.
results in strengthening of synaptic weights, but this time between the stimulus representation in the input layer and the NoGo unit for a specific response. Thereby, the stimulus will
activate the NoGo unit more strongly the next time it is encountered, effectively inhibiting the response. In brief, the temporal difference error coded by the dopamine signal result in
changes in the connectivity in the striatum that lead to progressively more adaptive responses to certain stimuli, while non-adaptive responses are inhibited and thus avoided.
I.2.2.2 The Reinforcement Learning Model by Holroyd & Coles (2002)
As already indicated by the name, the reinforcement learning model was designed as a framework for the neural basis of reinforcement learning, like the aforementioned BG-DA model. Its unique feature, however, is the integration and focus on the ACC, this brain regions strong link to the ERN and the FRN, and its function role in error processing.
As these components will be discussed in detail later, it is for now sufficient to note that the ACC receives projections from multiple brain regions and is believed to be involved in attention (Weissman, Gopalakrishnan, Hazlett, & Woldorff, 2005), error monitoring (e.g., Bush, Luu, & Posner, 2000) and conflict resolution (e.g., Botvinick, Braver, Barch, Carter, &
Cohen, 2001; D'Esposito et al., 1995; Posner & DiGirolamo, 1998). Holroyd and Coles (2002) propose that multiple, parallel motor controllers project to the ACC, which again is connected to the motor system. Motor controllers, in this model, correspond to different subsystems or brain regions (e.g., the OFC, the amygdala, the DLPFC, etc.) that direct the motor system to implement or avoid a certain behavior given a certain stimulus input from the sensory cortex. The ACC acts as filter and decides which of the possibly conflicting behaviors put forward by the different motor controllers are enacted. For instance, in an Eriksen Flanker task, a target letter is presented and accompanied by distractor letters and participants have to react to the target word. Target and distractor can be incongruent, i.e., the distractors suggest a different response (e.g., pressing the left button) than the target stimulus (e.g., pressing the right button). Put in terms of the model, in this situation, two different motor controllers demand for control over the motor system. This conflict is solved by the ACC that ideally enables the correct motor controller to take command.
An important point is how the ACC is trained to delegate motor control appropriately.
The RL model proposes an actor-critic architecture in which the basal ganglia, particularly the VTA, act as an adaptive critic that send a temporal difference error signal to the actor, which
is mainly the ACC, but also the motor controllers (Holroyd & Coles, 2002) . Consequently, learning not only occurs in the motor controllers, but also in the ACC, which can use this reinforcement learning signal to adaptively select the most appropriate motor controller, optimizing task performance. Furthermore, the authors suggest that an efference copy of the motor command might be immediately fed back to the adaptive critic, which would allow for an on the fly adaption of suboptimal behavior.
I.2.2.2.1 EEG correlates of ACC activity.
Initially, the RL model focused mainly on the effect of erroneous behavior. The mesencephalic dopamine signal sent to the ACC inhibits local pyramidal cells, whose apical dendrites in turn receive excitatory input from the motor controllers. As a consequence, the dopamine dip associated with an error leads to a disinhibition of these neurons if a motor controller is activated. Because of the simultaneous disinhibition and the parallel alignment of the pyramidal cells in the ACC, a strong negative scalp potential is generated at frontocentral scalp positions – the ERN or the FRN, respectively. In other words, the ERN reflects adaptive changes in neural activity in the ACC following internal feedback. When the correct response is obvious, motor behavior can be compared to an already established internal reference, which would make external feedback unnecessary – the cognitive system already ‘knows’ that the response was erroneous. In contrast, when the outcome of a behavior is still unclear, the system has to rely on external feedback to evaluate it and accordingly implement adaptive changes. In short, the ERN that follows an incorrect response reflects the evaluation of internal feedback, while the FRN that follows the presentation of performance information obviously reflects the evaluation of external feedback. Both, internal as well as external negative feedback signals indicate that an action resulted in an outcome that is worse than expected. Further, Holroyd and Coles (2002) were able to simulate in their computational model as well as show empirically that the dopaminergic signal underlying both ERP components “propagates back in time” (p. 682) as predicted by the method of temporal difference. External negative feedback after an erroneous response initially evokes a strong frontocentral negativity in the EEG, whereas the response itself does not. Over the course of a learning session the correct behavior is learned and the response itself is predictive of the outcome, whereas external feedback loses its informational value, as its valence is already predicted by the preceding response itself. Conversely, an incorrect response is immediately
followed by a negativity during later trials in the session, while at the same time the feedback negativity is strongly reduced (Holroyd & Coles, 2002).
Later refinements of the RL theory focused on the role of positive feedback (Holroyd
& Krigolson, 2007; Holroyd, Krigolson, & Lee, 2011a; Holroyd, Pakzad-Vaezi, & Krigolson, 2008; Pakzad-Vaezi, Krigolson, & Holroyd, 2006). Although the initial argumentation that the FRN is generated by the disinhibition of apical dendrites in the ACC appears compelling at first glance, it is challenged by the existence of the N200, a negativity similar in timing and scalp maximum. If, by default, feedback was followed by a strong negativity in the typical FRN time window (200 - 400 ms), then the absence of an FRN after positive feedback could reflect a positive component overlapping this N200. This positivity could potentially be a manifestation of the positive prediction error. Comparing the individual amplitudes of the N200 in an oddball task and the FRN in an Eriksen flanker task for each participant, (Holroyd, Pakzad-Vaezi, et al., 2008) found that both components were not only strongly correlated, but that they also shared the same latency and scalp distribution as revealed by a principle component analysis (PCA). With respect to the finding that neutral stimuli also evoke an FRN (Holroyd, Hajcak, & Larsen, 2006), the authors proposed that “these findings suggest that events that fail to indicate that a task goal has been achieved (including the occurrence of both neutral and error feedback stimuli) elicit the N200” (Holroyd, Pakzad- Vaezi, et al., 2008), p. 695), whereas a positive component, the “feedback correct-related positivity” (fCRP), represents a reward signal from the midbrain dopamine system when a task goal has been met.
I.2.3 Summary
Processing of the temporal difference error is central for reinforcement learning. In the brain of vertebrates the TD error is conveyed by a DA signal. Both the BG-DA model and the RL theory of the FRN provide a framework for how the DA signal is utilized in the basal ganglia and the ACC, respectively, to adapt future behavior. Consequently, knowledge of the effect of certain manipulations on the size of the DA signal is vital for research on the
understanding of automatic feedback processing. As a correlate of the DA signal and thus the TD error, the FRN is therefore well suited for feedback research. However, as reinforcement learning is only one process that contributes to learning from feedback, it is important not to
neglect controlled processes and to integrate them with existing models. Further, it is
necessary on the one hand to understand whether and how the FRN is affected by controlled processes and on the other hand to identify a component that is more indicative of the quality of controlled feedback processes.
I.3 Controlled Feedback Processing I.3.1 Basal Ganglia and Working Memory
The presented models are appealing in their ability to bring together research on reinforcement learning, the reward-prediction error and the dopaminergic system and, in the case of the RL theory, point to a correlate of the reward-prediction error: the FRN.
Furthermore, a host of studies provided supporting evidence for the BG-DA model (e.g., Cavanagh, Frank, & Allen, 2010; Frank, 2005, 2008; Frank, Cohen, & Sanfey, 2009; Frank, Doll, Oas-Terpstra, & Moreno, 2009; Frank, Moustafa, et al., 2007; Frank, O'Reilly, &
Curran, 2006; Frank, Samanta, et al., 2007; Frank, Santamaria, O'Reilly, & Willcutt, 2006;
Frank et al., 2004; Gold et al., 2008; Gründler, Cavanagh, Figueroa, Frank, & Allen, 2009;
Kasanova, Waltz, Strauss, Frank, & Gold, 2011; Moustafa, Cohen, Sherman, & Frank, 2008;
Moustafa, Sherman, et al., 2008; Strauss et al., 2011; Waltz et al., 2010; Waltz et al., 2008);
for an overview, see Maia & Frank, 2011) and for the RL theory (respective FRN results will be discussed later in detail in section I.4.2.3).
At first glance, both models and the respective results appear to apply only for basic motor learning and habit formation, with a strong focus on reinforcement learning processes.
In line with this it is important to note that already during the conception of the BG-DA model it was assumed that the basal ganglia may not only orchestrate the activation of responses, but also influence the updating of representations in working memory (Frank, Loughry, & O’Reilly, 2001; O'Reilly & Frank, 2006). Specifically, O’Reilly and Frank (2006) proposed that the basal ganglia can act as a gating mechanism that supports adaptive working memory updating on the basis of reinforcement learning principles (see also Baier et al., 2010; McNab & Klingberg, 2007). This relates the implicit reinforcement learning system to the explicit, working-memory-based system, implicating that variables influencing
reinforcement learning, such as drugs (Cools et al., 2009; Cools, Lewis, Clark, Barker, &
Robbins, 2006), disorders (for PD, see Beeler et al., 2012; Brück et al., 2001; Frank, 2005;
Wiecki & Frank, 2010; for schizophrenia, see Frank, 2008; Gold et al., 2012; for depression, see Cavanagh, Bismark, Frank, & Allen, 2011; for an overview, see also Maia & Frank, 2011), and individual genetic dispositions (Frank & Fossella, 2010; Frank, Moustafa, et al., 2007) can affect higher order cognitive processes. Although this insight is intriguing, it covers only one side of the interaction between both systems, while another important aspect, namely the influence of controlled processes on automatic feedback processing, remains unaddressed.
For this reason and based on prior research on the top-down biasing effect of information maintained in working memory (e.g., J. D. Cohen, Dunbar, & McClelland, 1990; Miller &
Cohen, 2001), Frank and Claus (2006) devised an augmented version of the BG-DA model which captures the working memory-related aspects of feedback processing in more detail.
Obviously, for the discussion of a dual-process account of feedback processing, this model provides vital input.
Accordingly, this model and further refinements of it will be presented in the following, companied by a discussion of related research. The aforementioned research addresses a central aspect of controlled feedback processing, namely flexibility, which allows for (a) swift changes in behavior when contingencies change abruptly, and (b) the integration of prior information gained not by multiple experiences, but by a single instruction or clue.
Further, there is evidence that under these circumstances, an attention-based EEG component, the P300, is better suited to predict adaptive behavior than the FRN. Finally, research on the role of feedback in an explicit learning situation, i.e., in the educational context, will be briefly reviewed. The aim of this final section is not only to utilize prior research from different fields for the question at hand, but to inform the conception of an experimental paradigm suited for the investigation of controlled feedback processes with behavioral and EEG measures.
I.3.2 The Augmented BG-DA Model
The previously presented BG-DA model accounts for the slow integration of feedback information over multiple trials, but neither addresses rapid changes in behavior, nor the importance of gain and loss magnitude conveyed by feedback, nor prior research on the
importance of the orbitofrontal cortex (OFC) for decision making (e.g., Kringelbach & Rolls, 2003; O'Doherty, Kringelbach, Rolls, Hornak, & Andrews, 2001)). Therefore, the recent augmented version of the model proposed by Frank and Claus (2006) deserves special attention as it addresses several of these shortcomings. First, it includes the interaction of the basal ganglia, the premotor cortex, and the amygdala with the medial and lateral OFC (see Figure 2). According to this model, the OFC can hold recent outcomes in working memory, with the medial OFC being sensitive to rewarding stimuli and the lateral OFC to punishing stimuli (however, see Noonan, Mars, & Rushworth, 2012, for an perspective of the function of these brain areas). Second, input from the amygdala allows not only for the representation of the likelihood of a reward, but also of its magnitude. Thereby, the expected utility (a product of reward magnitude and likelihood) of an action is also entered into the model.
Finally, the OFC can bias action selection in the basal ganglia through projections to the striatum, the SNpc and VTA, as well as to the premotor cortex. In sum, the OFC can act as memory store for recent positive and negative outcomes and can influence reinforcement learning and later working memory updating. The authors show that these interactions can explain behavior in a gambling task and during reversal learning better than a model restricted to the basal ganglia. For instance, as contingencies are reversed during a reversal learning task, the participants will initially face several instances of negative reinforcement until the new reward structure is learned and response behavior is adapted. Striatal learning alone would require several trials to integrate the new contingencies. However, the OFC can use information about this recent punishment to bias learning in the striatum, thus allowing for swifter adaptation to changes in the environment (Frank & Claus, 2006; Robinson, Frank, Sahakian, & Cools, 2010).
I.3.3 Recent Support for the Contribution of Controlled Processes to Feedback Processing
Basic support for this dual-process account in general and specifically for the contribution of higher order cognitive processes to performance in basic reinforcement learning comes from Collins and Frank (2012), who found that in contrast to a simple reinforcement learning model, only the augmented model can account for performance variance in early learning trials and the effect of memory load and time delay on learning
performance. They further identified working memory capacity and learning speed in the basal ganglia as important determinants of learning performance and related these variables to individual genes (one encoding the enzyme Catechol-O-methyltransferase (COMT) which is related to frontal dopamine levels and working memory capacity, and the other, GPR6, which is related to the expression of a g-protein receptor and learning in the basal ganglia (see Frank
& Fossella, 2010). Further support for the augmented model regarding the influence of higher cognitive processes on learning comes from Doll and colleagues (Doll et al., 2009) who showed that instructions can influence response behavior in a reinforcement learning task.
They further added prefrontal cortex and hippocampal influences to Michael Frank’s
computational basal ganglia model and compared their simulated results to behavioral results.
This led them to interpret their behavioral results as supportive of the conception that the representation of a rule which is held in the PFC and hippocampus biases learning in the striatum, rather than simply overwriting the influence of the basal ganglia on the premotor cortex. It appears that an explicit rule, even if it is erroneous, does not simply lead to the “gut feeling” being ignored, but influences basic reinforcement learning.A subsequent study found that this higher cognitive influence can distort learning akin to a confirmation bias, leading to maladaptive behavior when the initial instruction about reinforcement probabilities was inaccurate, i.e., contradicted true reward contingencies (Doll et al., 2011). The authors assume that information held in the PFC exerts strong influence on the striatum modifying
reinforcement learning processes in the direction of the held belief established by the
instruction by emphasizing confirming feedback and deemphasizing contradicting feedback.
This assumption is supported not only by the comparison of different computational models, but also by the finding that the a gene polymorphism previously linked to higher prefrontal DA levels and better working memory performance (the Met allele of the COMT gene) is also associated with a stronger confirmation bias – supposedly, the gene variant led to a more stable representation of the instruction in working memory which was apparently more resistant to contradicting information.
This study also suggests that although information held in working memory can influence processing in the striatum, reinforcement learning is still to such an extent independent from higher-order processes that participants eventually abandon their false beliefs in favor of the true contingencies as learned by reinforcement learning. Two studies further support this line of reasoning: Walsh and Anderson (2011) showed that although
instructions can immediately dominate behavior in a probabilistic learning task and eliminate the effect of feedback on behavior, reinforcement learning as indicated by the FRN still occurs as if no instruction was given. Specifically, the FRN was sensitive to outcome probabilities and the TD error as observed in other studies (e.g., Ichikawa, Siegle,
Dombrovski, & Ohira, 2010; Yasuda, Sato, Miyawaki, Kumano, & Kuboki, 2004, see also section I.4.2.5 for an in-depth discussion). The feedback-P300, however, discriminated between conditions as its amplitude was less pronounced when an instruction had been given compared to when an instruction was absent. The authors propose that this reflects the decreased significance of feedback in the instruction condition and the resulting decreased attention paid to it. Similarly, Sailer, Fischmeister and Bauer (2010) reported that in a decision-making task, a reduction of the feedback P300 amplitude indicated that participants had gained insight into a hidden sequence which allowed for a better performance, whereas the FRN decreases in the course of the experiment irrespective of such a deeper
understanding. Here, again, the P300 reduction can be interpreted as reflecting the reduced significance of the feedback as it became less important to participants who became aware of the hidden sequence.
Whereas these studies showed decreased involvement of controlled processes under certain conditions, recent research on behavioral reversal implies that an increased feedback P300 can indicate “updating of stimulus-response associations” (Chase et al., 2011, p. 944) resulting in behavioral adjustment after a change in contingencies. Interestingly, in contrast to an increased P300 on trials preceding a change in response behavior, the FRN after negative feedback was attenuated in these trials. It appears that participants fully expected this negative feedback based because they had come to the correct assumption that contingencies had changed and that, consequently, previously correct responses should now lead to negative feedback.
Together, these recent studies further support a dual process account of feedback processing and the assumption that while the basal ganglia are involved in the updating and maintenance of working memory representations (for a similar view, see McNab &
Klingberg, 2007), representations held in working memory, such as explicit rules or recent events, can also bias learning in the striatum (Doll et al., 2009, but see also Biele, Rieskamp,
& Gonzalez, 2009; Chang, Doll, van't Wout, Frank, & Sanfey, 2010). In short, although there are two systems that can process feedback independently, there is evidence for an interaction
between both, especially for top-down influences on automatic feedback processing.
Importantly, these studies identified the P300 as an indicator of the involvement of controlled processes in learning, presumably because attention and working memory play a pivotal role for these processes.
I.3.4 Controlled Feedback Processing in Educational Settings
It is noteworthy that controlled feedback processes have caught the attention of ERP research only fairly recently, whereas they have been in the focus of interest of educational research for several decades. Moreover, in contrast to neuroscientific approaches, the focus has mainly been on what can be perceived as the controlled aspect of feedback processing.
Based on a review of then existing research, (Kulhavy, 1977) argued that the central function of feedback in an educational setting is to correct erroneous responses and opposed the idea that feedback mainly acts as a reinforcer (e.g., Skinner, 1968). In this sense, feedback in this context is primarily informational, not rewarding or punishing (see also Bangert-Drowns, Kulik, Kulik, & Morgan, 1991). Consequently, in this line of research the role of attention in feedback processing was central. One example for this is the so-called hypercorrection effect.
Kulhavy already acknowledged that feedback is processed in the light of previous knowledge and assumptions about the correctness of the response (see also Butterfield & Mangels, 2003;
Butterfield & Metcalfe, 2001; Butterfield & Metcalfe, 2006; Fazio & Marsh, 2009; Kulhavy
& Stock, 1989; Kulhavy, Yekovich, & Dyer, 1976; Mory, 2003). It seems obvious that high confidence in a response increases the likelihood that it is repeated on a later occasion, and indeed a study by Kulhavy, Yekovich and Dyer (1976) supports this for initially correct responses. Surprisingly, this is not the case for erroneous high-confidence responses – these were in fact more likely to be corrected than responses held with low confidence. This hypercorrectioneffect is due to increased attention to the feedback (Butterfield & Mangels, 2003; Butterfield & Metcalfe, 2001; Butterfield & Metcalfe, 2006) which increases the quality of feedback encoding and results in improved memory for feedback details and the feedback source (Fazio & Marsh, 2009) and improved performance in a subsequent test.
Interestingly, Butterfield and Mangels (Butterfield & Mangels, 2003) found evidence for the sensitivity of the FRN to feedback valence and the violation of expectations, but no relation between the FRN amplitude and subsequent error correction. This is in line with the