Thursday, August 28, 2008

Can Google Predict Election '08 Results?

Today's post on TechCrunch about possibly predicting election results using web traffic analysis reminded me of my thoughts from a few months ago.

Over the course of election season, starting with the presidential primaries through Super Tuesday and going into the national conventions, Google began releasing increasing amounts of election-related content and services. Reading the blog posts linked above and noticing the trend from Google leads to me the conclusion that the most relevant exit-poll for November's presidential election this year might actually come from Google.

Google likely already tracks user searches and email data from GMail. Combine that with some pretty basic data mining around traffic analysis, comments on their targetted election sites and popularity of their election related content/services and you could now get a pretty good 'pulse-of-the-nation' with respect to red state-vs-blue state. My guess is that the big push around election related content and services is an attempt to not only increase pageviews, but also harness large amounts of interesting data that can be mined for various interesting purposes. Additionally, this gives them a large data set to use to refine their own data mining technologies. I strongly suspect that Google will hold on to the raw data past election time and used it as test-data to track progresses in their mining technologies. Good data to test your technologies is *very* hard to to come by, and it would be foolish to let such rich data simply fall by the wayside.

Whether they publicly release their 'exit-poll' is another story altogether. I don't think Google really has much to gain by releasing the data even if they do manage to mine it. I know I wouldn't - I would mine the data, distribute predictions among team members for entertainment purposes only, see how close I got to the result, refine my algorithms and run it over the same data again to see how much closer I am now, and repeat. Releasing the data publicly, even if it turned out to be dead-on accurate, wouldn't buy me much. If it wasn't dead-on accurate, there is some amount of geek-cred that would be lost.

I guess November will prove if my theory above is correct.

P.S. I hope you picked up on the paradox in that last sentence above... :)