Resources     Lessons     Quizzes     Assignments     Discussion     Completion  
Back Button   Next Button
An Introduction to Corpora in English Language Teaching
Your Instructor: Michael McCarthy, Anne O'Keeffe, and Steve Walsh
Lesson 01
Chapter 1



Watching the video requires Quicktime.
Download the newest version for free. Click here.
Difficulties watching the video? Visit the FAQs under Resources.


Michael McCarthy, Anne O'Keeffe, & Steve Walsh talk about the ELT Advantage course An Introduction to Corpora in English Language Teaching (videoscript)

"Hello, my name is Michael McCarthy, my name is Anne O'Keeffe, and my name is Steve Walsh. Welcome to the ELT Advantage course An Introduction to Corpora in English Language Teaching.

In this course, we'll guide you through key issues relating to the use of language corpora in teaching. We'll go into detail about what a corpus is, how corpora are used, and how you can build and use a corpus to enhance your teaching and the experience of your learners. The course will also include extracts of real language to give you a flavor of what you can get from consulting a corpus.

We look forward to working with you."



Anne O'Keeffe introduces corpora in language teaching (videoscript)

"This lesson will introduce you to what a corpus is and how corpora have developed over the years. We'll look at spoken versus written corpora, and at specialized corpora of academic and business English. We'll also talk about some of the basic functions of corpus software."


Definition of a corpus
A corpus is a collection of texts that is stored electronically on a computer or other form of electronic storage. These texts can be from written sources such as books, magazines, junk mail, letters, advertisements, business documents, literature, academic papers, emails and Internet pages.

Corpora can also include spoken language "texts." These involve recordings of real talk that have been transcribed word-for-word. Types of spoken language that we can find in a corpus include everyday conversations, phone calls, university classes, television and radio programs, voice mails, speeches, and parliamentary debates.

For example, here are the categories under which the International Corpus of English (ICE) project built their corpora. This project involves the collection of one million words of many varieties of English from around the world. All of the collections use this as their design plan:

Information about language at our fingertips
A corpus, therefore, is a collection of real language that people use in all types of situations. Because this language is stored on a computer, it can be searched quickly and easily using special software. This makes it very suitable for research into how language is really used.

Using a corpus, we can see what the most commonly used word in a language is. We can see which prepositions follow certain verbs; we can find out if a certain word has gone out of use. For example, the phrase flower power was very popular in the 1960s. If we have an up-to-date collection of language, we can find out whether this is still the case.

In many ways, using a corpus is not unlike using an Internet search engine. It is essentially a large database which you can use to find every occurrence of a word or phrase. Just like an Internet search, the "hits" that result will, in a matter of seconds, pop up on your screen.

For instance, if we ask a corpus to find examples of the word exactly, within seconds, the search software will give us a list that looks like this:

These are called concordance lines and later in this lesson (and throughout this course), we will talk more about their applications for language teaching.

How important is the design of the corpus?
Apart from giving us concordance lines, corpus software can also tell us how many times a word occurs in the whole corpus. However, if this information is to be reliable and useful to us as teachers, materials designers, or researchers, it has to come from a well-designed corpus.

The key word in designing a corpus is representativeness. Here at the beginning of the course, we want to highlight its importance. A corpus is only as good as its design. For example, if we were to design a corpus of classroom interactions, we would need to consider how to make this representative. Here are some of the things we'd want to consider:

We'll talk more about designing your own corpus in a future lesson in the course.

Corpus size
Very often terms like "large," "vast," "specialized," and "small" are used to describe corpora. These terms are relative and subject to change as the biggest corpora become bigger. At the moment, the largest corpora are running at around one billion words!

These mega-corpora are usually held by English-language publishers who make them available to their authors for materials design. Large corpora are also used in making dictionaries. Here are some examples of large publisher corpora:

  • Collins, under the COBUILD project, has a corpus of 450 million words of contemporary written and spoken English. 56 million words of this is available online as part of Wordbanks Online English.
    Source: http://www.collins.co.uk/Corpus/CorpusSearch.aspx
  • Cambridge University Press has a corpus of one billion words which it makes available to its authors.
    Source: http://www.cambridge.org/elt/corpus/cic.htm
  • Longman has a written corpus of 100 million words, five million words of American spoken English and 10 million words of student writing.
    Source: http://www.longman.com/dictionaries/corpus/


Another large corpus development is WEBCORP (which you can find at http://www.webcorp.org.uk/). This is available online, free of charge. It basically uses the whole of the English-language World Wide Web as its corpus.

As a rule of thumb, a "large" corpus means more than five million words and "small" usually means less than five million words.

Browser Back ButtonNext Chapter Button

resources | lessons | quizzes | assignments | discussion | completion
Course content © 1997-2007 by Thomson ELT. All rights reserved. Reproduction or redistribution of
any course material without prior written permission is prohibited.