Language has a genome, and Google is mapping it.
So many books, so little time. Nobody will ever be able to read them all, to distill their wisdom into an aggregate pool of knowledge that can be accessed at a moment’s notice.
But that’s not stopping Google from trying. The company is making available a new feature in Google Labs called Books Ngram Viewer that scans millions of books from 1600 to the present in English, French, German, Spanish, Russian and – perhaps most ambitiously – Chinese.
A word of warning: If you are at all a word geek, Ngram can be a bit of a time vortex. I was quickly lost in the task of investigating the decline of outmoded terms (“oriental”) and the rise of new ones (“hipster”), the disappearance and re-emergence of others (“geek”), the comparison of common words (“he” vs. “she” and “I” vs. “you”). For much deeper analysis of what can be mined from Ngram, the Guardian has a good overview.
Google estimates that there are 129,864,880 unique books in the world, and counting. To correct the imbalance in the number of published books in, say, the 17th century and the 20th century, Google normalized the books by sampling 6,000 per year.
Ngram – a term borrowed from the fields of artificial intelligence and genome sequencing – is more than a word toy. It does for 500 years worth of books what Google Trends does for a few decades of web pages – track the usage of words, phrases and ideas. It also adds weight to Google’s insistence that its primary interest in scanning published books and housing their text on its servers wasn’t to drive publishers out of business, but to pursue its self-appointed mandate to organize all the world’s information.
Whether Ngram will easy any discomfort among publishers about working with Google remains to be seen. In the meantime, Google will work out kinks that remain in Ngram. As the company points out, the word “Internet” seems to appear intermittently in previous centuries, not because of time-traveling engineers but because of character-recognition errors.
Intriguingly, the company claims there is one usage of the word “Internet” before 1950. Try finding it without Google’s assistance and you’ll see how far the company’s technology has insinuated itself into our understanding of the world.