• Keine Ergebnisse gefunden

Experiment 1a, b – The Raven’s Paradox and the WST WST

Part II Towards a Flexible Bayesian Logic of Testing Descriptive Rules Testing Descriptive Rules

5 Towards Knowledge-Based – but Normative – Bayesian Modelling Bayesian Modelling

5.5 Experiment 1a, b – The Raven’s Paradox and the WST WST

In Experiment 1 the Bayesian WST debate (Chapter 4 and 5) will be connected to the Raven’s paradox debate (Chapter 3), by using raven material. In regard of both debates, I have argued that the preconditions for the Bayesian models need to be established by knowledge or a given context. In Experiment 1 a MST is used to establish the preconditions of the model (Section 5.3). It would be a novel finding to support the Bayesian resolution of the raven paradox by using ravens material in a WST and by showing an increase of non-raven selections in the predicted condition.

The Raven Paradox and the WST

The intimate connection between the raven paradox debate and the WST debate has been realised by only a few authors:

(a) On the theoretical level, Humberstone (1994) elaborated this analogy from a falsificationist perspective and Nickerson (1996) from a Bayesian perspective,32 but the two debates continued in a rather disconnected way.

The raven paradox is concerned with the question of whether single conjunctive observations, Oa = p ∧ q, Ob = p ∧ ¬q, Oc = ¬p ∧ q, or Od = ¬p ∧ ¬q, are confirmatory, disconfirmatory or irrelevant. In contrast, the WST debate is concerned

32 Although Oaksford and Chater (1994) mentioned the ravens paradox, they did not point out that their original Bayesian approach was incoherent with the standard Bayesian resolution of the paradox (see pp. 53 f.).

with the turning over of cards, for instance a q card, and the resulting hypothetical observations, here either p | q or ¬p | q. In the WST, as in the raven paradox, obser-vations can be confirmatory, disconfirmatory or irrelevant. Moreover, mathematically the evaluation of these hypothetical conditional observations in the WST rests on the evaluation of the conjunctive observations relevant for the raven debate.

(b) Empirically, to my knowledge there has only been one earlier experiment by other authors that was directly concerned with the Bayesian resolution of the raven paradox (McKenzie & Mikkelsen, 2000). But this study neither used a WST nor ravens material.

Only one experiment conducted by von Sydow (2002) has directly investigated WSTs using ravens material. But in that experiment the marginal probabilities had not been clearly fixed and the results were not in accordance with the predictions. In four WSTs, the participants had to check the truth or falsity of the hypothesis ‘all ravens are black’. P(p) and P(q) was manipulated by changing the size of the universe of discourse. The size of the universe of discourse was manipulated subtly by using different labels for the cards. In the high probability (small universe of discourse) conditions, the cards were labelled ‘raven’ (p), ‘dove’ (q), ‘black bird’ (non-p) and

‘white bird’ (non-q). In the low probability (large universe of discourse conditions), they were labelled ‘raven’ (p), ‘table’ (q), ‘a black object’ (non-p) and ‘a white object’

(non-q). Additionally, the context story was varied describing the cards as referring either to ‘any object’ or to the ‘birds of a particular avarium’. The results showed no significant difference between the conditions. Only descriptively same small positive effects were observable.

Von Sydow (2002) argued that the deviations from the model with fixed marginals, might have been due to not explicitly fixing the marginal probabilities in the story. Additionally, a number of possible further problems were suggested as being potentially responsible for the negative results, most of them connected to the ravens material. For example, Humberstone (1994, 399) has suggested, that regarding ornithology in the field (as opposed to the second-hand ornithology of the birds-and-colours observation-recording cards), it is perhaps hard to take very seriously the distinction between observing a swan as white and observing a white bird as a swan.

All you get to see is white swans.

(c) In the following Experiment 1 the aim was to remove all such possible misunderstandings. In contrast to the previous experiment with raven material (von Sydow, 2002) the following additional requirements (cf. p. 91) were met:

• The card-sides that report the species of the birds and the card-sides that report their colour were now presented as clearly distinct sides of observation-recording cards.

• The independence model was more clearly induced as most plausible alternative hypotheses to the independence model.

• In the used MST the probability of P(p) and P(q) are exactly known to the participants.

• The marginal probabilities, P(pres) and P(qres) were fixed by the instructions. In particular in this MST the constancy assumption P(black | MD) = P(black | MI) is made plausible. The truth of the hypothesis that all ravens are black should not increase the assumed number of birds being black.

Method Experiment 1a, b

The goal of Experiment 1 was to obtain the predicted Bayesian frequency effects of the Sydow model for a WST with raven material.

Design and Participants

Experiment 1a was concerned with the dependent variable of p versus non-p selections, while Experiment 1b investigated the dependent variable of q versus non-q selections. Formally, two independent experiments (with different participants) were run to exclude any possible effects of sequential effects (cf. Klauer, 1999; cf. my requirement three, pp. 91 f, cf. 93 f.). For both dependent variable (Experiment 1a and Experiment 1b) a low (‘0.10 → 0.20’) versus a high probability condition (‘0.80 → 0.85’) was tested in a between-subjects design.

64 students of the University of Göttingen took part in the experiment (56 % female, 44 % male, mean age: 24 years). They were recruited on the campus and participated in the experiment voluntarily. They got a little present and were able to win a prize after finishing the task. The largest number of students studied law (36 %), humanities (11 %), and economics (9 %). None of the participants had prior knowledge of the selection task. 16 participants were randomly assigned to each of

the resulting four conditions of Experiment 1a and 1b. Five participants were excluded from further analysis since they did not follow the instructions. (Four participants selected a card, which they were not permitted to in their conditions.

They selected p or non-p cards instead of q or non-q cards or vice versa.) General predictions

Since the used instructions should induce the preconditions of the Sydow model (cf. 5.1) the advocated knowledge-based account predicts both frequency effects, q versus non-q effects (black versus white birds), and p versus non-p effects (ravens versus non-ravens). The predictions for the low (‘0.10 → 0.20’) versus high proba-bility conditions (‘0.80 → 0.85’) can be derived from the model behaviour (see p. 81).

In Table 15, the resulting general predictions are presented. In the high probability condition (relative to the low probability condition) increased numbers of non-p and non-q selections are expected. In contrast, falsificationism for both conditions equally demands p & non-q selections (cf. 1.2, 3.1). Likewise, naïve inductionism – without Bayesian extensions – would not predict any frequency effects (cf. Section 3.2).

Excursus: Procedure and Materials in All Experiments

In all experiments in this work, participants were given a little booklet. It always consisted of a front page, a general introduction, one or sometimes many task pages, and a final questionnaire concerning biographical data and comments. The pages of Table 15

Predicted Increased Portion of Selections Relative to the Corresponding Alternative Probability Condition of the Bayesian Model and the Predictions of Falsificationism and Naïve Inductionism

Exp. 1a:

raven (p) vs. non-raven (non-p) Exp. 1b:

black (q) vs. non-black (non-q) Low

the booklets were very similar in all experiments, apart from the task page(s). Before describing the task pages of Experiment 1, a description of the other pages should be given once for all the experiments. Important differences from the general procedure will be reported for each experiment separately.

(a) On the front page, the title of the study and contact information was provided. The title of the experiments in Part II was ‘Questionnaire – Styles of Human Hypothesis Testing’ and additionally a name indicating the specific material used, for instance ‘Raven Experiment’.

(b) The general introduction page informed the participants that the task is not an intelligence test, but that we are interested in their style of reasoning. This was done in order to obtain as ‘natural’

answers as possible. Participants were told to read the instruction carefully. They were asked to take as much time as they needed. They were allowed to use the margins for comments. Since some tasks were conducted in small groups, it was emphasised that the tasks have to be solved individually. This was enforced by the experimenter. The participants were encouraged to ask the experimenter, if they had any questions concerning understanding of the task. Finally, participants were informed that they could win prizes if they completely finished their task.

(c) In the final questionnaire, the participants were asked for biographical information about gender, age, and their field of study. It was inquired whether the task had been known to them. In the experiments of Part II, participants were asked whether they were versed in probability theory and formal logic and whether they used formal logic or probability theory, or their judgement/intuition to solve the task. They were always asked to comment on the task. Finally, they were asked whether they wanted information on the task and whether they wanted to take part in the lottery to win a prize.

Editorial note: When describing the material of the experiments, italics is used for any text passages which were somehow highlighted in the original instruction.

Procedure and Materials of Experiment 1

The task was placed in the setting of a zoo were the following deterministic rule was to be tested: “If a bird is a raven, then its feathering is always black”. The participants should imagine visiting this zoo with two acquaintances, Carl and Gustav (named in honour of Carl Gustav Hempel, father of the raven paradox). These two acquaintances make two opposed claims, which should correspond exactly the two conflicting hypotheses to be tested in the Sydow model. The instruction read as follows (translation from German):

“With two acquaintances, Carl and Gustav, you are visiting an animal park. Carl describes the beautiful ‘black ravens’. Gustav criticises that it is absurd to speak of

‘black ravens’, since all ravens are black anyway. Carl replied, in such an animal park there may well be ravens that are not black. Adding, it is not even clear, whether the majority of ravens in this park are black. Gustav insists that the ravens in this park are also all black. Finally, they decide to test this empirically.

In the keeper’s cabin, there is a card file, in which all birds of the avarium are listed. In this file, 40 birds are listed on 40 cards. The cards are used on both sides. On one side of each card, the species is noted. Here it is only noted, whether the bird is a raven ( = a silhouette of a raven) or not (R = no raven). On the other side the feather colour of each bird is noted (z = black; { = symbol for a non-black, e.g., brown or white colour).”

You only aim to test, whether the following sentence is true or false (valid for the cards): If a bird is a raven, then its feathering is always black. [set in bold print]”

In order to provide information about P(p) and P(q) and make sure that this is taken as being independent of the truth or falsity of the hypothesis, 40 cards concerning birds were shown as in a standard MST (see p. 92). Firstly, an overview of the 40 p and non-p card sides was displayed, and then the reversed 40 q and non-q sides were shown. The number of shown card sides in the low and the high probability conditions is shown in Table 16. In the two displays the following symbols were used: Raven card, , non-raven card, R , black card, z , non-black card, { .

The two displays of the 40 cards were each ordered in 4 rows ¯ 10 columns. The two displays were said to be sorted independently of each other, so that the cards provide neither positive nor negative evidence for the hypothesis “raven → black”.

The used card order is shown in Table 17. (No seldom card in the first display visually corresponded to another seldom card in the second display.)

Table 16

Number of the Displayed 40 Cards in the Low and High Probability Conditions of Experiment 1

First display Second display Raven

Note: For Experiment 1a and Experiment 1b identical displays were used.

Table 17

Position of the Seldom Cards in the Displays of the P / P and of the Q and Non-Q Card Sides (‘0.90 → 0.925’) Coordinates

of these cards

1,7; 2,4; 3,9; 4,3 1,8; 2,3; 4,5

Note: The positions of cards in the 4 rows ¯ 10 columns matrix are shown. For instance, ‘1,7’ represents a position in the first row and in the seventh column of the matrix.

In Experiment 1a, which is concerned with p versus non-p selections, the first p and non-p display is the one, displayed later and from which one was allowed to make selection:

“The keeper lays out the 40 cards for you. On the 40 cards placed in front of you, only the side is visible which lists the species of the bird.

[Display of the p and non-p card sides.]

You also know something about the reversed sides of the cards, showing the colour of the feathering. Earlier on, the keeper displayed the same cards showing the reversed side of the cards, concerned with the colour of the birds.

[Display of the q and non-q card sides.]

Between the two displays of the cards, the keeper has collected up the cards. In this process the cards get shuffled completely. Hence, the order of the two displays need not correspond to each other. Unfortunately, the keeper is now not prepared to turn over many more cards. He only allows you to turn over one single card separately. – Please encircle the card that you would turn over in order to test (with only this sample) the truth or falsity of the controversial sentence. You are only allowed to turn over one of the cards which are currently displayed (above: ‘ ’ or ‘R’)!”

The test of q versus non-p selections (Experiment 1b) necessitated differences in the context story, since the cards, from which one was allowed to make selections, needed to be in the temporarily final display. At the same time, the graphical presentation order was kept constant.

“The keeper lays out the same 40 cards twice for you. In the first display of the cards he puts the card with the name of the species facing upwards.

[Display of the p and non-p card sides.]

You also aim to see the backside of these cards but the keeper – before you could stop him – grabs all 40 cards, and lays them out anew, now with the sides with the feather colour facing upwards. The cards get totally mixed. In this second display, the card sides can be distinguished by the colour of the feathering of each bird. Because the cards have been mixed, their order no longer corresponds to the first display.

[Display of the q and non-q card sides.]

Unfortunately, the keeper is now not prepared to turn over many more cards. He only allows you to turn over one single card separately. – Please encircle the card that you would turn over in order to test (with this sample) the truth or falsity of the controversial sentence. You are only allowed to turn over one of the cards which are currently displayed (above: ‘z’ or ‘{’)!”

Results of Experiment 1

Main results. The number and percentage of found card selections in Experiment 1a and 1b are reported in Table 18. The results are presented both for the antecedent choice conditions (Exp. 1a: p versus non-p) and the consequent choice conditions (Exp. 1b: q versus non-q), contrasting the selections in the low and high probability conditions.

Descriptively the results support the predicted increase of non-p and non-q selections in the high probability conditions. The changed proportions of card selections are also illustrated by bar graphs in Figure 8.

Table 18

Percentage and Number of Card Selections in the Low and High Probability

Conditions of the P versus Non-P, and the Q versus Non-Q Forced Choice Conditions Exp. 1a : P versus non-p Exp. 1b: Q versus non-q

Note. Selections which are predicted to increase are darkened.

Figure 8a Figure 8b

Figure 8a, b. Bar graphs of the percentage of (a) p or non-p selections and (b) of q or non-q selections in each corresponding low and high probability conditions.

0%

The predicted changes in the proportions of card selections are statistically highly significant. The increase of non-p card selections relative to the p card selections in the high probability condition was reliable (exact Fisher test: df = 1, n = 29, p < 0.01;

rφ =.64). Also the increase of non-q versus q selections was highly significant (Pearson test: n = 30, χ2(1) = 7.03, p < 0.01; rφ = .48).

Additional questions. In the questionnaire, the majority of participants stated they had no knowledge of probability theory (73 %) and no knowledge of formal logic (82 %). Over all conditions the assumed knowledge of probability theory and actual answers which confirmed to the Bayesian model were not positively correlated (rφ = -.14). However, there was also no correlation between the knowledge of formal logic and answers which accorded to the logical-falsificationist norm (rφ = -.03). In any case, when asked for their own strategy, only few participants mentioned logic (5 %) or probability theory (6 %) and most participants answered that they proceeded according to their own reasoning or intuition (75 %).

Discussion Experiment 1 –

Support for a Bayesian Solution of the Raven Paradox

The results of Experiment 1 are discussed here mainly with regard to the dispute concerning the raven paradox and only briefly with regard to the WST debate.

The WST debate will be considered in detail in Experiment 2 and alternative theories of the WST will be examined at the end of this chapter in the general discussion (Section 5.7).

Experiments 1a and 1b support the standard Bayesian resolution of the raven paradox (Section 3.2). In the WST context it has been shown here for the first time that for testing the hypothesis ‘if a bird is a raven then it is black’ changed probabilities, P(raven) and P(black), can increase the portion of non-ravens selections.

Moreover, also the predicted q versus non-q effects were found.

It is consistent with the advocated knowledge-based account that these positive results were achieved in a context in which the marginal probabilities were fixed by the task instruction. In the only earlier WST also using ravens material (v. Sydow, 2002, Exp. 1), these preconditions – as has been outlined before – were not firmly established. The found contrast between the positive results obtained here on the one hand, and the negative result with the previous raven WST (Sydow, 2002; cf. pp. 94 f)

or with the generally negative results for the model with fixed marginals (pp. 83 f) on the other hand, suggest that the improvement may be due to use of a task in which all preconditions for the model were actively induced. This at least indirectly supports the advocated knowledge-based approach.

In contrast, the results of Experiment 1 cannot be explained by a falsificationist psychological account of the raven paradox (Popper, 1934/2002, 1972, Watkins 1957, 1958). Strict falsificationism demands p and non-q selections, invariably under all conditions.

However, even a ‘probabilistic falsificationism’, which allows for frequency effects regarding the falsifying p versus non-q selections (cf. Humberstone, 1994;

Kirby, 1994b), cannot explain the found p versus non-p and q versus non-q frequency effects.

Likewise, naïve tabula rasa empirism (cf. pp. 32 f.), without Bayesian modification, cannot account for the findings. An empiricism that denounces any use of prior knowledge and only advocates enumerative induction (cf. Nicod’s criterion, cf. pp. 40 f.) would invariably predict p and q selections. Likewise, an account which advocates that negative instances are not natural kinds and, hence, assumes that negative instances are generally irrelevant for testing a positively formulated rule (see Quine, 1969; Oaksford and Chater, 1994, 1998b), cannot explain the increased number of non-p and non-q selections in the high probability condition.

With regard to the WST debate, the results provide us with one of the first clear corroborations of the Bayesian model with fixed marginals (von Sydow, 2002; cf.

p. 78). Most previous results were negative or ambivalent (pp. 70 f., 83 f.). This contrast and also the found contrast to the older WST with raven material by v. Sydow (2002), where the marginal probabilites were not fixed, cannot be explained by Oaksford and Chater’s universal approach (1994, 1998, 2003), but only by a knowledge-based account.

5.6 Experiment 2a, b – Letter-Number MST