• Keine Ergebnisse gefunden

5 Data and methodology

5.5 Data sources

of LIKE is modified and LIKE becomes associated, for instance, with a reference group distinct from its reference group in the donor variety, then this would cause a difference between the social profiles of the donor and the recipient variety. Any difference between the use of LIKE in, for instance, AmE and PhiE or JamE would therefore indicate local modification and support the view that innovations are not adopted wholesale but that they are shaped during implementation to match local norms and needs. If the positioning of LIKE in the recipient variety is very similar to the positioning in the donor variety but the social profile is distinctly different, this would indicate that the social meaning of LIKE is (cognitively) more salient than its linguistic constraints and its language‐internal functionality. If, however, the social profile of LIKE in the recipient variety is very similar to the social profile in the donor variety but the positioning is distinctly different, this would indicate that the language‐

internal functionality of LIKE is (cognitively) more salient than its social function. In any case, the degree of variability between varieties in terms of both positioning and social meaning will further our understanding of processes that accompany the global diffusion of pragmatic innovations.

surveys of this marker nor variety‐specific constraints on LIKE use have been established. As systematic accounts of the variation of LIKE usage across varieties of English are few, the present study dedicates itself to filling this. In particular, this study addresses basic questions concerning variety‐specific usage patterns of discourse marker LIKE. This is relevant, as few sociolinguistic studies have focused on the transmission and diffusion of features across wider geographical areas. Buchstaller and D’Arcy (2009:298), for instance, emphasize in their cross‐varietal investigation of be like, which is very similar to the present study in its theoretical outlook and methodology, that…

the direct comparability […] of previous analyses of verbs of quotation in general and of be like in particular remains questionable since they tend to be based on dissimilar methodological premises and applications in terms of the definition of the variable and constraints, the form selected as part of the envelope of variation, quantitative methods, and statistical models.

To overcome the limitations of these previous studies, the present investigation uses a computationally edited version of the ICE and employs coherent methodology to guarantee cross‐varietal comparability of the data.

Hence, the present quantitative analysis of LIKE across varieties of English draws its data from the International Corpus of English (ICE). This family of corpora represents distinct regional varieties of English and, therefore, forms an appropriate starting point for a cross‐varietal study.

The issue of data selection is not trivial, as Buchstaller and D’Arcy (2009) emphasize that “[w]hat is needed, therefore, are reliable and comparable methods applied rigorously and uniformly across datasets to uncover which constraints hold both across and within varieties of English worldwide”

(2009:298). Therefore, the present investigation uses a computationally edited version of the ICE and employs coherent methodology to guarantee cross‐

varietal comparability of the data. This family of corpora represents distinct regional varieties of English and, therefore, forms an appropriate starting point for a cross‐varietal study.

5.5.2 The ICE family of corpora

The ICE family of corpora meet the criteria mentioned above, thus allowing a balanced and extensive overview of LIKE use across regional varieties of English to be attained. Seven of the available components of the ICE are considered: ICE Canada (Canadian English); ICE GB (British English; R1); ICE Ireland (Hiberno‐English); ICE India (Indian English); ICE Jamaica (Jamaican English); ICE Philippines (Philippine English); and ICE New Zealand (New Zealand English)23. In addition, the Santa Barbara Corpus of Spoken AmE was added to the analysis for three reasons. Firstly, the Santa Barbara Corpus contains AmE data and, hence, broadens the range of regional varieties.

Secondly, the Santa Barbara Corpus matches the other ICE components, as it

“forms part of the International Corpus of English (ICE). The Santa Barbara Corpus represents the main data of the American component of ICE24 (http:

//www.linguistics.ucsb.edu/research/sbcorpus. html; April 4th 2010). Thirdly, the data included in the Santa Barbara Corpus consists mainly of private dialogue, i.e. face‐to‐face conversation, and thus contains the data particularly relevant for the present purpose.

      

23 ICE Hong Kong and ICE Singapore could not be analysed, as the respective ICE teams did not grant access to the speaker information which is crucial. ICE East Africa, on the other hand, was not included as it deviates significantly from the other components in terms of both size (the spoken part of ICE East Africa consists of 714,916 words compared to an average of the other ICE corpora of 651,822 words) and annotation.

24 The comparability of the Santa Barbara Corpus also stems from the fact that “[i]n order to meet the specific design specifications of the International Corpus of English (allowing comparison between American and other national varieties of English), the Santa Barbara Corpus data have been supplemented by additional materials in certain genres (e.g. read speech), filling out the American component of ICE” (http:

//www.linguistics.ucsb.edu/research/sbcorpus.html; April 4th 2010).

The most prominent feature rendering the ICE family of corpora relevant for this investigation is that they were designed for comparability25 and, hence, enable contrastive analyses of geographic varieties.26

All regional components of the ICE corpora contain 500 files, 300 of which represent spoken language of various text types and 200 of which represent written language of various text types. The transcribed dialogues are classified according to the type of spoken discourse (private or public dialogue, scripted or unscripted dialogue).27 Therefore, the ICE corpora allow for a detailed analysis of the occurrence of LIKE in specific communicative situations. Beyond enabling large‐scale, quantitative analyses, the ICE family of corpora also enables specialized and fine‐grained exploration, as the linguistic context is available and data are not confined to specific communicative situations. Since discourse markers are most prevalent in informal spoken text types, the study will use data from files with the header S1A exclusively which indicates that the data represents private dialogues and more specifically, either face‐to‐face conversations or transcripts of telephone calls.

      

25 The ICE family of corpora was compiled during the 1990s or later, and the authors and speakers of the texts are aged 18 or above; they were educated through the medium of English, and were either born in the country in whose corpus they are included, or moved there at an early age; and they received their education through the medium of English in the country concerned (http: //ice‐corpora.net/ice/design.htm, 29.4.2010).

26 “To ensure compatibility across the individual corpora in ICE, each team is following a common corpus design, as well as common schemes for textual and grammatical annotation. Each component corpus contains 500 texts of approximately 2,000 words each ‐ a total of approximately one million words. Some of the texts are composite, made up of two or more samples of the same type” (http: //ice‐corpora.net/ice/design.htm, 29.4.2010).

27 Each regional ICE component contains 120 monologues, of which 70 are unscripted (20 spontaneous commentaries, 30 unscripted Speeches, 10 demonstrations and 10 legal presentations) and 50 are scripted (20 Broadcast News, 20 Broadcast Talks and 10 non‐

broadcast Talks).

Table 3: Common design of the ICE components

Mode Conversation type Register Text type Number of

text files

SPOKEN (300)

Dialogues (180)

Private (100) header=S1A

Face‐to‐face conversations

90

Phonecalls 10

Public (80) header=S1B

Classroom Lessons 20 Broadcast Discussions 20 Broadcast Interviews 10 Parliamentary Debates 10 Legal cross‐examinations 10 Business Transactions 10

Monologues (120)

Unscripted (70) header=S2A

Spontaneous commentaries

20

Unscripted Speeches 30 Demonstrations 10 Legal Presentations 10

Scripted (50) header=S2B

Broadcast News 20

Broadcast Talks 20 Non‐broadcast Talks 10

Beyond enabling large‐scale, quantitative analyses, the ICE family of corpora also enables specialized and fine‐grained exploration, as the linguistic context is available and data are not confined to specific communicative situations.

Adding to their advantageous design, the ICE corpora offer extensive information about the speakers: the age and gender of speakers to their level of education, occupation, 1st and 2nd languages, and ethnicity. Based on this additional speaker information, the ICE family offers the opportunity for fine‐

grained sociolinguistic analyses.

Despite their high degree of comparability, their wide range of registers and the detailed speaker information contained in them, the ICE have some shortcomings to be addressed in the present context.

With the exception of the Santa Barbara Corpus, neither of the ICE components contain detailed information about the phonological features of the original spoken data, nor do they offer the original audio files as complementary sources for data coding. This additional material would have been enormously helpful during the coding process: for example, coding of intonation would have permitted a more precise analysis of LIKE usage with regard to its position relative to the embedding intonation unit.

Furthermore, the ICE components are not representatively balanced with respect to the age, gender, and educational background of the speakers:

The corpus contains samples of speech and writing by both males and females, and it includes a wide range of age groups. The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields.

Similarly, various age groups are not equally represented among students or academic authors. (http://ice‐corpora.net/ice/design.htm, 29.4.2010).

This imbalance turned out to be a major obstacle for this study and ultimately led to the decision to focus exclusively on data containing the most informal spoken dialogue, i.e. private face‐to‐face conversations and private phone calls (files headed S1A) while disregarding more formal spoken data (files headed S1B, S2A and S2B).

Another disadvantage of the ICE family of corpora relates to its aim to represent the “national or regional variety of English”, which is to say that the respective ICE components represent the national or regional standard varieties rather than representing a random sample of entire scope of language use in the respective region. Since the national standard is predominantly spoken by the educated elite of a country, the ICE corpora fail to reflect the language use of the population of English speaking individuals at large, but have a rather substantial bias towards educated speakers, i.e. they reflect the language of the elite rather than the speech community. Although this is a

drawback, the sample of speakers included in the respective ICE components was varied enough to allow testing for social stratification, because the speakers included in the data were biased towards higher education but not limited to speakers of such a profile.