The Oxford English Corpus (OEC) is the largest structured corpus of any language. Unfortunately, the corpus is generally only available to researchers who are working on projects for Oxford University Press. Nevertheless, it may still be useful to make a brief comparison of COCA and the OEC. Corpus size and historical coverage The Oxford English Corpus is about 1.9 billion words, or about 4-5 times the size of COCA. This means that for many very low-level phenomena it provides much more data than COCA, in the same way that COCA provides much more data than a corpus that is 4-5 times smaller than itself, such as the 100 million word British National Corpus. In terms of historical coverage, however, COCA is more developed than the OEC. COCA has texts from 1990-2009 (20 million words each year), while the OEC has texts from just about one-third that number of years -- 2000-2006. No texts have been added to the OEC since 2006, whereas 20 million words of text continue to be added to COCA each year, bringing it right up to the present time. The following is a summary of
the number of words in each corpus in each year from 1990 to 2009:
The OEC has texts from a wide range of genres and text types, as well as dialects. COCA is divided evenly (20% each year, and there overall as well) between spoken, fiction, popular magazines, newspapers, and academic journals (see details by year, genre, sub-genre, and even down to the level of each of the 160,000 individual texts). Some have incorrectly criticized COCA for not have enough "informal" texts, because they have not really understood what is in the spoken texts in COCA. In comparison with the Oxford English Corpus, however, COCA does a very good job of including informal language. There are many phenomena for which COCA has 4-5 times as much material (per million words) as does the OEC. For example, the following shows the number of tokens for the "quotative like" construction (and I'm like, you're crazy) in each year in the US portion of the OEC. The overall average is 0.76 tokens per million words. In COCA, on the other hand, it is 3.94, or more than five times what is in the OEC. This is just one of many examples that could be given.
In order to use frequency statistics to look at changes over time -- as we would want to do with a monitor corpus -- each historical period needs to have the same genre composition. To take a worst-case example, suppose that a corpus had only newspapers from the 1990s and then only fiction from the 2000s. For any change that we see from the 1990s to the 2000s, we would not know if the change had actually occurred in the language as a whole, or if it is just an "artifact" of the changing genre composition from one period to the next. What we find is that COCA is balanced across genres -- almost perfectly -- from year to year. In each and every year from 1990-2009, the corpus has been divided between spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%), and academic journals (20%). Even at the level of sub-genre (e.g. Newspaper-Sports, or Academic-Medicine), the corpus composition changes very little from year to year.
In the OEC, however, the genre composition varies widely from one year
(or set of years) to another. For example, the following figures show
the percentage of fiction in the US sub-corpus in different time periods:
Notice how the percentage of
fiction varies widely from year to year (10% to 82%), and how even in
two adjacent years it varies widely, such as 32% in 2003 and 22% in
2004. Let us briefly look at how this distorts the corpus data for these
periods.
These two forms (mutter and had + VBN) are characteristic of fiction. Notice that in just the US fiction part of the OEC (green cells), the frequency per million words stays about the same between 2001, 2004, and 2006, as we would expect. But in the entire US part of the OEC (all genres; in blue), the normalized frequency (per million words) varies widely. For example, had + VBN increases by more than 100%. Why is this? Well, notice that in the table above that the percentage of the US corpus in the OEC that is fiction increases markedly over time. In other words, the increase in frequency of these phenomena in the corpus is probably just a function of the change in genre balance, rather than any change in "real world" language. (It would, after all, be quite strange if people really did all of the sudden say had eaten, had noticed, etc. 200% as much in 2006 as in 2004!)
In COCA, on the other hand, the relative frequency of these forms
in the overall corpus stays quite flat from 1990-94 until
2005-09, because the percentage of texts in the corpus that are from
fiction (20% each year) stays the same.
Summary The Oxford English Corpus is great corpus in terms of its size and even the wide range of genres and text types. COCA is not as large, but it does cover more years. Perhaps most importantly, COCA has the same genre balance from year to year, which allows it to be used as a monitor corpus in ways that the OEC could not be.
The bottom line, however, is that the OEC is not really available to
the general public, so very few
people can actually use it. COCA, however, is freely available to
all interested researchers, teachers, and language learners. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||