The Web is much larger than the Corpus of Contemporary American English, and Google is a great search engine. So why not just use Google to see what's happening in contemporary American English? Well, as good as it is for most searches, there are things that neither Google (nor any other search engine) can do (or which they do only very poorly), but which are possible with our corpus. These include the following:

  • Looking at differences between different styles or types of English. Is a particular grammatical construction or a given phrase used more in informal (e.g. spoken) or formal (e.g. academic) English? Google is pretty good at knowing what domain something comes from (e.g. cbs.com or neh.org), but it can't really relate that (well) to "genre", or "styles of speech".

  • Measuring changes over time. Is a word or phrase used more or less now than in the early 1990s? Which verbs are really on the increase during the last 2-3 years? No way to check this with Google or other search engines.

  • Grammar-based searches. Is end up VERB-ing (e.g. ended up paying too much) on the increase or decrease? Is the passive (be VERB-ed: e.g. was seen) used more in spoken or academic? Google doesn't allow you to search by part of speech or lemma (e.g. all of the forms of a word). You'd have to search for each string individually (e.g. all forms of end + up + every conceivable verb).

  • Semantically-based searches. How are fair, or strike, or sign used in the language? In order to find out, you need to look at collocates (nearby words), since (as corpus linguists are fond of saying) "the words that a word 'hangs out with' can tell you a lot about its meaning". But Google doesn't do collocates.

  • And more semantically-based searches. Since Google can't do collocates, it obviously can't use them to compare word meanings in different genres (e.g. chair in fiction and academic), or to see how they're changing over time (e.g. green = "environmentally friendly").

  • And even more complex semantically-based searches. Google only really knows how to search for specific words and strings. It doesn't let you search by words that are related in meaning, such as all of the synonyms of a given word, or all of the 100+ words in a list you've created (related to fashion, or food, or clothing, or whatever) as part of a query. Our corpus can do both of these.

  • Finding the word when you don't know what the word is. What are the nouns that are found mainly is engineering articles, collocates of hard that are used more in fiction, or synonyms of strong that are found mainly in spoken? Google allows you to find the occurrence of a given form that you already know, but it can't produce a list of words for you that match criteria like these.

  • Searching for strings of words. Sure, on Google you can search for a phrase like "might be taken for a". Go ahead and try it. How many hits does it say there are? Our search today shows 92,400. Start paging through the hits, though, and they run out at about 740. In other words, Google's "guess" is more than 100 times more than what it should be. This is because Google usually doesn't "know" the frequency of anything more than single words -- it's usually just guessing.

So if you want to find web pages dealing with a certain topic, then Google is fine. But using Google as a full-blown linguistic search engine has real drawbacks. None of the preceding types of searches -- which are some of the most interesting ones that you can carry out to see what's going on with the language -- are possible with Google (or any other search engine). But they are all possible -- quickly and easily -- with the Corpus of Contemporary American English.