Sunday, December 19, 2010

How to use ngrams, part 1

The new ngrams tool from Google Labs is awesome. But many of the examples floating around the blog world are flawed. I am not a linguist, I just like data. The Lousy Linguist blog has more stuff to say about the validity of the tool.

The purpose of this post is to help you avoid several basic mistakes in formulating queries or interpreting results. I haven't read the Google paper in Science, so I can't really speak to methods or sources.

What does an ngram measure? Word frequency / all words. Start with the ngram for "the" and "and", where you can observe that these common words are roughly 1/20 and 1/40 of all printed English words, respectivley. Nearly any word will appear to flatline when scaled against "the". Consider the ngram for politics, paired with "the" and alone. At least right now, there appears to be no option to use a log scale.

Looking for famous people? Query formation is critical. Consider this ngram for "Franklin D Roosevelt". It looks like his popularity is growing. Check again, this time including "Franklin Roosevelt". Of course, FDR was most well known by his intials. For this reason, Ngrams isn't set up right now for an easy comparison of presidential popularity. OCR errors, of course affect search results, particularly for early periods. Consider this ngram for FDR and fdr. Ngrams are, of course, case sensitive.

Some conclusions can't be easily tested in ngrams. Is the rising popularity of "Friedman" due to "Milton Friedman" or something else? An exhaustive search of pairs would be difficult using the ngram browser.

No comments:

Post a Comment