• Keine Ergebnisse gefunden

Corpus linguist ics: Introduction

Corpus linguist ics and digital text analysis

4.3 Corpus linguist ics: Introduction

4.3.1 Orientation

The reader will know by now that corpus linguist ics (hence forth ‘CL’) is the software-based, quant it at ive invest ig a tion of a collec tion of elec tronic texts;

such a collec tion is referred to as a “corpus” – a body of texts which is usually compiled in a prin cipled manner.4 There has never been a time when so much English language data has been readily avail able for invest ig a tion.

The World Wide Web contains billions of words of English usage and is increas ingly being trawled for corpus construc tion. Advances in compu ta-tional memory and search soft ware mean that big corpora consist ing of billions of words, derived from the web and else where, can readily be stored and swiftly explored. With these tech no lo gical devel op ments, linguists in the twenty- first century are in an excit ing posi tion to invest ig ate English use on a massive scale. It is no exag ger a tion to claim that the use of corpora has revolu tion ised English language descrip tion.

The invest ig a tion of large amounts of language data in elec tronic form brings signi fic ant advant ages. First, linguists are able to discover things about language use which may other wise remain invis ible. As one of the chief archi tects of corpus linguist ics says:

the language looks rather differ ent when you look at a lot of it at once.

(Sinclair, 1991: 100) Second, invest ig a tion of a corpus provides a quant it at ive, and thus robust, basis for confirm ing or falsi fy ing intu itions about language use. This means

that linguists no longer have to spec u late about how people gener ally use a language, some thing which is obvi ously prone to error. Third, the labour, time- drain and tedium of manual analysis of large quant it ies of language use data have been substan tially shriv elled.

4.3.2 John Sinclair

Many of the ideas from corpus linguist ics that I flag in this chapter emanate from the research of John Sinclair (1933–2007). Here is another corpus linguist, Michael Stubbs, on Sinclair’s achieve ment:

Sinclair is one of the very few linguists who has discovered many things which people had simply not noticed, despite thou sands of years of textual study – because they are observ able only with the help of computer tech niques which he helped to invent.

(Stubbs, 2009: 116) Sinclair prior it ised methods for the analysis of digit ally stored, natur ally occur ring language data rather than a theory (Hunston and Francis, 2000:

14–15). By defin i tion, CL deals with observ able data and is thus within the philo soph ical tradi tion of empir i cism. The salient word for Sinclair is ‘evi- d ence’. Sinclair was an uncom prom ising empir i cist and his under stand ing of language use is based on count less obser va tions of it at scale. With corpora in the millions and increas ingly in the billions of words, we have access to evid ence that is beyond the dreams of linguists living before the latter part of the twen ti eth century.

The first elec tronic corpus (Brown Corpus) was compiled in 1964 at Brown University by Nelson Francis and Henry Kucˇera. It contained a million words of American English from docu ments which had been published in 1961. In the UK at around the same time, Sinclair produced the first elec tron ic ally search able spoken corpus at the University of Edinburgh (1963–1965). It contained 166,000 words of informal conver sa tion in English. In 1970, he co- wrote the first report on research into corpora and many of the seeds for later ideas were contained in this report (Sinclair et al.

2004). Then he took a step back from corpus research because of hard ware and soft ware limit a tions. In 1980, when the tech no logy had developed suffi-ciently to enable extens ive study of corpora, Sinclair organ ised a contract with the publish ers Harper Collins for the produc tion of a new kind of diction ary – one based on large corpora of written and spoken language. The corpus is known as COBUILD (Collins Birmingham University International Language Database). The ground break ing Collins Cobuild diction ary was published in 1987.

In a short time, the COBUILD diction ary’s vision ary use of digitised corpora trans muted lexico graphy. Today, most author it at ive diction ar ies

are groun ded in large elec tronic corpora. The Oxford English Dictionary, for example, is now based on a very large elec tronic corpus – the Oxford English Corpus (OEC). At the time of writing, it consisted of around 2.5 billion words from texts across a wide number of genres such as news, magazine articles and message board post ings in UK, US, Australian and other national vari et ies of English.5 The OEC is predom in-antly a web- based corpus – that is, a corpus derived from a language on the web. Given its range and balance of genres, as well as its size, it is regarded as one of the most author it at ive bases for judge ments about contem por ary language use. Corpus linguist ics, then, has revolu tion ised lexico graphy. This is ‘sexier’ than first appears. Every liter ate person uses a diction ary.

4.3.3 Big is beau ti ful

A funda mental prin ciple of corpus linguist ics is that we should not rely on our intu ition of language use as to what is frequent and what is not. We may be able to work out from intu ition alone that the gram mat ical word ‘the’, say, is usually very frequent in most texts. But this guess ing game becomes harder, and error- prone, when we start to reflect on what might be the tenth, elev enth, twelfth, etc., most common lexical word in stand ard US English usage or its most common five-word expres sions. We do not memor ise inform a tion in this way. Even if we are profi cient speak ers with decades of using a language, we cannot readily access this inform a tion in our mind. Because it is commit ted to looking at language at scale, the great power of corpus linguist ics – just like any branch of the digital human-it ies – is that human-it can render the invis ible visible. Language use is always under our nose, but until corpus linguist ics we did not know what was under a lot of noses.

A good example of this is provided by Michael Stubbs. Stubbs (2007) discovered that world is one of the top ten nouns in the British National Corpus, a corpus of 100 million words. He found through concord ance searches that one reason it is so common is it occurs in frequent expres sions such as ‘the most natural thing in the world’, ‘one of the world’s most gifted scient ists’. Expressions such as these in which super lat ives are used, or rank ings are employed, are very frequent in English, but it is diffi cult without large corpora to intuit this so clearly. Once the evid ence is presen ted, it is common for the cynic in some of us to say ‘well, it’s obvious that use of ‘world’ is so frequent’. With hind sight, corpus linguistic find ings may seem self- evident to profi cient speak ers. All the same, we are kidding ourselves that we would have been able to intuit, with complete confidence, quant i tative- based phrasal facts about language without use of corpora.

Having sketched corpus linguist ics, I move on to flag ging some key concepts and insights that emerge from looking at ‘a lot of language data at

once’. I draw on the 1.5 billion words UKWaC corpus to illus trate some key corpus linguistic insights. UKWaC is accessed via the soft ware, Sketchengine.6

4.4 Key concepts and insights from corpus