Language has a genome, and Google is mapping it.

By Kevin Kelleher
December 17, 2010

ISRAEL/So many books, so little time. Nobody will ever be able to read them all, to distill their wisdom into an aggregate pool of knowledge that can be accessed at a moment’s notice.

But that’s not stopping Google from trying. The company is making available a new feature in Google Labs called Books Ngram Viewer that scans millions of books from 1600 to the present in English, French, German, Spanish, Russian and – perhaps most ambitiously – Chinese.

A word of warning: If you are at all a word geek, Ngram can be a bit of a time vortex. I was quickly lost in the task of investigating the decline of outmoded terms (“oriental”) and the rise of new ones (“hipster”), the disappearance and re-emergence of others (“geek”), the comparison of common words (“he” vs. “she” and “I” vs. “you”). For much deeper analysis of what can be mined from Ngram, the Guardian has a good overview.

Google estimates that there are 129,864,880 unique books in the world, and counting. To correct the imbalance in the number of published books in, say, the 17th century and the 20th century, Google normalized the books by sampling 6,000 per year.

Ngram – a term borrowed from the fields of artificial intelligence and genome sequencing – is more than a word toy. It does for 500 years worth of books what Google Trends does for a few decades of web pages – track the usage of words, phrases and ideas. It also adds weight to Google’s insistence that its primary interest in scanning published books and housing their text on its servers wasn’t to drive publishers out of business, but to pursue its self-appointed mandate to organize all the world’s information.

Whether Ngram will easy any discomfort among publishers about working with Google remains to be seen. In the meantime, Google will work out kinks that remain in Ngram. As the company points out, the word “Internet” seems to appear intermittently in previous centuries, not because of time-traveling engineers but because of character-recognition errors.

Intriguingly, the company claims there is one usage of the word “Internet” before 1950. Try finding it without Google’s assistance and you’ll see how far the company’s technology has insinuated itself into our understanding of the world.

One comment

We welcome comments that advance the story through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can flag it to our editors by using the report abuse links. Views expressed in the comments do not represent those of Reuters. For more information on our comment policy, see http://blogs.reuters.com/fulldisclosure/2010/09/27/toward-a-more-thoughtful-conversation-on-stories/

Looking at the pretty charts in Culturomics and the new Google Books interface is nice. But of course there is much more to looking at cultural / language change than just using simple frequency charts of exact words and phrases.

The NEH-funded, 400 million word Corpus of Historical American English (freely available at http://corpus.byu.edu/coha) allows for a much wider ranges of searches. Besides frequency lists like Google Books (with essentially the same results), simple 2-3 second searches can find changes in word meaning and usage (e.g. gay, care, web; or what we’re saying about any topic over time), grammatical changes, and it can find *all words* that are more frequent in one period than another (rather than one by one, as with Google Books), as well as much more.

More information at:
http://corpus.byu.edu/coha/compare-cultu romics.asp

Posted by CorpusProf | Report as abusive