[ COMPARE COCA TO THE ANC, BNC, BOE, OEC ]

The Oxford English Corpus (OEC) is the largest structured corpus of any language. Unfortunately, the corpus is generally only available to researchers who are working on projects for Oxford University Press. Nevertheless, it may still be useful to make a brief comparison of COCA and the OEC.


Corpus size and historical coverage

The Oxford English Corpus is about 1.9 billion words, or about 4-5 times the size of COCA. This means that for many very low-level phenomena it provides much more data than COCA, in the same way that COCA provides much more data than a corpus that is 4-5 times smaller than itself, such as the 100 million word British National Corpus.

In terms of historical coverage, however, COCA is more developed than the OEC. COCA has texts from 1990-2009 (20 million words each year), while the OEC has texts from just about one-third that number of years -- 2000-2006. No texts have been added to the OEC since 2006, whereas 20 million words of text continue to be added to COCA each year, bringing it right up to the present time.

The following is a summary of the number of words in each corpus in each year from 1990 to 2009:
 

Year COCA OEC
1990 20,459,999  
1991 20,565,501  
1992 20,636,288  
1993 20,834,905  
1994 20,896,229  
1995 20,687,538  
1996 20,344,253  
1997 20,384,808  
1998 20,641,624  
1999 20,907,933  
2000 20,703,668 133,340,604
2001 20,030,139 182,815,084
2002 20,377,984 280,691,405
2003 20,755,087 370,800,134
2004 20,702,033 518,717,955
2005 20,702,447 377,953,350
2006 20,795,229 25,099,165
2007 20,417,804  
2008 20,376,279  
2009 11,102,803  
TOTAL 402,322,551 1,889,417,697


Genres and informal texts

The OEC has texts from a wide range of genres and text types, as well as dialects. COCA is divided evenly (20% each year, and there overall as well) between spoken, fiction, popular magazines, newspapers, and academic journals (see details by year, genre, sub-genre, and even down to the level of each of the 160,000 individual texts).

Some have incorrectly criticized COCA for not have enough "informal" texts, because they have not really understood what is in the spoken texts in COCA. In comparison with the Oxford English Corpus, however, COCA does a very good job of including informal language. There are many phenomena for which COCA has 4-5 times as much material (per million words) as does the OEC. For example, the following shows the number of tokens for the "quotative like" construction (and I'm like, you're crazy) in each year in the US portion of the OEC. The overall average is 0.76 tokens per million words. In COCA, on the other hand, it is 3.94, or more than five times what is in the OEC. This is just one of many examples that could be given.

Years OEC COCA
  tokens size per million tokens size per million
1990-94       130 103,300,000 1.3
1995-99       347 102,900,000 3.4
2000 45 66,455,562 0.68 462 102,600,000 4.5
2001 40 89,913,492 0.44
2002 111 142,621,850 0.78
2003 121 191,239,937 0.63
2004 202 240,840,436 0.84
2005 177 180,930,648 0.98 645 93,600,000 6.9
2006 12 15,442,798 0.78
2007      
2008      
2009      


Genre balance over time

In order to use frequency statistics to look at changes over time -- as we would want to do with a monitor corpus -- each historical period needs to have the same genre composition. To take a worst-case example, suppose that a corpus had only newspapers from the 1990s and then only fiction from the 2000s. For any change that we see from the 1990s to the 2000s, we would not know if the change had actually occurred in the language as a whole, or if it is just an "artifact" of the changing genre composition from one period to the next.

What we find is that COCA is balanced across genres -- almost perfectly -- from year to year. In each and every year from 1990-2009, the corpus has been divided between spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%), and academic journals (20%). Even at the level of sub-genre (e.g. Newspaper-Sports, or Academic-Medicine), the corpus composition changes very little from year to year.

In the OEC, however, the genre composition varies widely from one year (or set of years) to another. For example, the following figures show the percentage of fiction in the US sub-corpus in different time periods:
 
Year Total Fiction % fiction
2000 66,455,562 6,479,988 9.8
2001 89,913,492 14,326,315 15.9
2002 142,621,850 36,938,545 25.9
2003 191,239,937 61,788,465 32.3
2004 240,840,436 53,462,736 22.2
2005 180,930,648 57,083,698 31.6
2006 15,442,798 12,740,916 82.5

Notice how the percentage of fiction varies widely from year to year (10% to 82%), and how even in two adjacent years it varies widely, such as 32% in 2003 and 22% in 2004. Let us briefly look at how this distorts the corpus data for these periods.
 
  Entire corpus Fiction
  2001 2004 2006 2001 2004 2006
mutter
(all forms)
1669
18.6
8552
44.7
1652
107.0
1557
110.1
5927
110.9
1647
129.3
had + VBN
(e.g. had seen)
81811
909.9
245966
1021.3
32178
2083.7
36135
2522.3
135952
2542.9
30535
2396.6

These two forms (mutter and had + VBN) are characteristic of fiction. Notice that in just the US fiction part of the OEC (green cells), the frequency per million words stays about the same between 2001, 2004, and 2006, as we would expect. But in the entire US part of the OEC (all genres; in blue), the normalized frequency (per million words) varies widely. For example, had + VBN increases by more than 100%. Why is this? Well, notice that in the table above that the percentage of the US corpus in the OEC that is fiction increases markedly over time. In other words, the increase in frequency of these phenomena in the corpus is probably just a function of the change in genre balance, rather than any change in "real world" language. (It would, after all, be quite strange if people really did all of the sudden say had eaten, had noticed, etc. 200% as much in 2006 as in 2004!)

In COCA, on the other hand, the relative frequency of these forms in the overall corpus stays quite flat from 1990-94 until 2005-09, because the percentage of texts in the corpus that are from fiction (20% each year) stays the same.
 

mutter

1990-1994

1995-1999

2000-2004

2005-2009

 

 

PER MIL

14.9

13.4

14.8

15.9

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

1542

1378

1516

1484

had [VVN]

1990-1994

1995-1999

2000-2004

2005-2009

 

 

PER MIL

1,173.1

1,066.2

1,059.0

1,095.4

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

121208

109731

108624

102491

Summary

The Oxford English Corpus is great corpus in terms of its size and even the wide range of genres and text types. COCA is not as large, but it does cover more years. Perhaps most importantly, COCA has the same genre balance from year to year, which allows it to be used as a monitor corpus in ways that the OEC could not be.

The bottom line, however, is that the OEC is not really available to the general public, so very few people can actually use it. COCA, however, is freely available to all interested researchers, teachers, and language learners.