Goodbye Polling, Hello Big Data
Whenever a pundit rushes to proclaim the “death of” something, that’s the surest sign it’ll probably outlive the person making that bold prediction.
Nonetheless, as a general rule, I tend to bet on the future and the old incumbent industries and ways of doing things (eventually) being dislodged, even if progress in that direction is all too slow (Exhibit A: TV vs. online advertising in 2010). At a minimum, the trendlines become clear, even if the actual moment of transition isn’t yet.
With that in mind, I think we should be paying closer attention to what Facebook (and to some degree, Foursquare) was able to do on Election Day as an alternative to traditional polling.
On Election Day, Facebook placed an “I Voted” button on its home page. Over 12.4 million clicked it. That’s roughly one in seven people who voted on November 2nd. It’s also more than double the 5.4 million who clicked the same button in 2008, when overall turnout was roughly 50% higher.
The coolest thing about the button, speaking as a political data geek, wasn’t the fact of its very presence. It’s the analysis Facebook was later able to do on turnout patterns by age and political affiliation and even degrees of connection with other voters.
The chart of turnout levels by age and political party are exactly what you would expect. A steep rise from the low 20′s among young voters to nearly 50% in the 60-65 age bracket. And the enthusiasm gap was evident in these numbers. At almost every age level, Republicans were more likely to vote.*
The breakdown of political party affiliation by state also strikes me as perfectly valid:
This is also the first year I really didn’t look at the exit polls much if at all. Since 2004, it’s become abundantly clear with the rise of early voting and in their well-documented issues in predicting the Presidency of John F. Kerry that they are no more valid than a regular opinion poll conducted over the phone, and in some ways, have tended to miss the mark dramatically in ways no regular pollster would tolerate (I have a hard time believing that a phone pollster would have come up with Kerry by 20 in Pennsylvania or within the margin of error in South Carolina). And, they still have to be adjusted to match actual results a week after the vote! Shouldn’t a poll of 17,000 people, weighted properly, be able to produce results within 1% of the actual results without the benefit of such “adjustments?” Analysts routinely raise questions when the exit polls show voting preferences among groups like Hispanics off from all other polling. If the accuracy of the underlying data can’t be trusted, why would we take the “adjusted” figures at face value as the political community seemingly does?
This isn’t to say that I distrust all polling. As discussed on the podcast the other week, I love polling and consume it religiously in the run up to every election. High profile failures like the 2008 New Hampshire Democratic primary and the 2010 Nevada Senate race aren’t reflective of the overall accuracy of polls in predicting most races. For the most part, they give us a pretty good read on who is likely to win and by how much, and I don’t find them as problematic as the exit polls.
Nonetheless, even with the vastly increased volume of polls, they miss important things, like:
- Individual House races didn’t get polled as much as they should have to get a true and accurate read on the state of play in the House. We instead rely on the pseudo-science of Cook and Rothenberg to fill in the blanks, and they always seem to be playing catch up.
- Polling in primaries can be very spotty, with months if not weeks between public polls. Low-budget House campaigns don’t have the budget to do much more than a baseline and then one or two brushfire surveys to augment the corpus of public polling, leaving them mostly in the dark about real conditions on the ground.
- Polling can’t give you the kind of granular data down to the county level you really need to optimize your GOTV efforts, only by broad regions like “Southern California” or the “San Francisco Bay Area.”
- Trying to build an RCP or Pollster-like average for different demographic groupings or for core questions like Party ID that are actually pretty crucial to gauging overall dynamics is virtually impossible because of the different methodologies pollsters use to weight and even define these groups. Some pollsters hold party ID constant, others don’t. You can hedge against uncertainty by averaging the ballot test between polls but the sample sizes on subgroups are often so small that they are practically worthless in developing overall strategy.
This is why I find what Facebook did with their election data so appealing. They have no sample size issues, as they reflect an overall sample of one seventh of the electorate. Only self-selection issues. And increasingly I’ll trade less scientific data for a more insightful, larger data set that gives me granularity a poll can’t. It’s like the difference between a 100×50 thumbnail and a digital photograph in full 12 megapixel glory. You’re likely to get the basic idea from the thumbnail, but good look reading the text on that sign in the background.
Likewise, the “I Voted” project we were part of via Foursquare gave us data a poll couldn’t, visualizing for the first time I’ve seen anywhere online when people vote during the day. Even with all the timezones, you get a clear picture that most people really do tend to vote during the evening, with the 50% mark of total votes cast being reached at around 3pm.
You can nitpick this for a host of demographic reasons, by saying that seniors are not likely to be accounted for, etc. etc. — but what’s the alternative? No data? Flawed exit poll data? When people vote is actually a pretty crucial fact if you’re a field director and the entire campaign comes down to your turnout operation. And if we’re fully transparent about known problems in how people tend to use these services and thus how data is recorded, we can at least try to hedge against them or conduct longitudinal comparison only amongst those subgroupings most likely to have valid data, which is still pretty darn useful.
Nor is self-selection an unknown problem in the world of polling. With refusal rates being what they are, actually taking an entire survey seems to me to be a form of self-selection — how do you know you’re not biasing the results towards folks who are just plain lonely, or don’t have kids who demand their attention? The problem of polling cell phone-only households has also been much discussed, and the fix most pollsters have settled on is to reweight youth and minority numbers up, assuming that the cell phone-only voters in those groups match up nicely with landline voters. (Nate Silver’s post on this is a must-read.)
As services like Facebook get better about collecting anonymous data on tens of millions of users and cross-referencing it to party affiliation and variables most pollsters haven’t even thought of yet — how do MST3K fans break down? — I can see us moving away from polls as the be-all end-all for demographic research and moving to study large troves of data based on millions of user profiles. Self-selection and self-ID remain valid concerns, but less and less so as Facebook penetrates deeper into every age and ethnic group and region of the country. Three years ago, I was able to use Facebook data to study how fans of popular movies, TV shows, and bands broke down ideologically, and how ideology shifted for individual ages (not just age groups, ages) year to year. I bet the data today would be even more interesting.
At Engage, we’ve started conducting experiments with large datasets we encounter based on actual voter behavior and not surveys. We’ve been able to track the extent of an opponent’s media buy by looking at Google search query data and the likelihood of voters in individual counties to interact with a candidate in a teletownhall setting, based on a sample sizes in the tens of thousands. The former allowed us to get a better sense of the precise day the polling started to move and latter prediction turned out to be eerily prescient in predicting the final results. There are countless other experiments one could do with access to the right data, which is becoming more and more available.
None of this is to say that the discipline of marrying data mining and traditional survey research isn’t messy. Relying on metrics like counting Facebook fans or Google search query volume can be downright misleading because they’re subject to campaigns themselves manipulating the numbers or the digital equivalent of highway onlookers slowing down to gawk at a car wreck. You might be getting a lot of attention, but not for the right reason. Models will need to be built that account for the effect of celebrity candidates, with these less reliable data points occasionally discarded (as Nate Silver has said in predicting the Academy Awards, don’t let the model make you predict something you know is wrong).
Despite the obvious drawbacks, I find the opportunity presented by Big Data — the kind with millions, rather than just hundreds or thousands, of records — intensely exciting. Obama ’08 was a Big Data campaign. Instead of only relying on polls, they used trends collected daily in hundreds of thousands of Voter ID to allocate money in real time. Done right, we can use access to data to route around some of the shortcomings of traditional polls (cost, sample size limits, speed of data collection) in the same way that blogs and social media, albeit messier, have routed around the failures of elite media.
* The dropoff among very old voters, which manifests some in the real electorate, but not as dramatically as on Facebook can likely be explained by diminished overall online usage among the elderly. If you’re 80 and on Facebook, it’s demographically likely not as many of your peers are on it, so you’re less likely to use it daily and hence click the button, among other factors.