Why you shouldn’t always trust data

We live in a world of big data.

Every day, new data sets are released and new jobs are created for data scientists.

People are predisposed to trusting data because, well, it’s about numbers, right? Data is objective, free from personal influence. Three is three. I can’t just tell you that three is two. You wouldn’t believe me. You’re not that easily fooled.

But what happens when the numbers are right, but the analysis is off?

The Providence Journal recently gave us an example of using solid numbers, but failing to consider all aspects of the situation during the analysis:

The U.S. Census Bureau says Rhode Island has about 770,000 adult citizens and that 73.5 percent of them are registered to vote.

So how many voters does the state have on its rolls?

Need help? Here’s a hint: 73.5 percent of 770,000 is about 566,000.

If that’s your guess, you’re wrong.

Rhode Island’s voting list claims more than 748,000 people.

That’s 180,000 more than the Census Bureau numbers suggest belong there.

And 20 of Rhode Island’s 39 municipalities, from the largest city to the smallest town, had more registered voters than it had citizens old enough to vote.

The article goes on to explain the possible reason for this crazy disparity: It’s hard to track voters. People move and people die, and these important life events are not always recorded.

Most people assume that if you change your address with the post office or the DMV, your voter registration is automatically updated. But that’s not actually the case. Changing your postal address has nothing to do with your voter registration and, in Pennsylvania, you have to check a special box on your DMV form to change voter registration.

Didn’t know that? It’s OK. The helpful polling officials at my voting location didn’t know it either. When I asked for a voter registration change-of-address form, they told me just to change my address with the post office, and my voter registration would be updated.

You might be asking yourself at this point, ‘Why does this matter? Who cares if there are more registered voters than citizens in a given municipality? That doesn’t mean that there’s voter fraud.’

But that’s exactly why it does matter. People can look at numbers like 770,000, 73.5 percent and 748,000 and infer wrongdoing, even if there isn’t any.

Such perceived wrongdoing could spur any number of reactions, including seeing a need for Voter ID laws.

Another example of dangerous data analysis is Google Flu Trends. Here’s Google’s explanation of the project:

“We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real-time.”

Awesome! You can literally see the flu spreading in your geographic area with Google’s Flu Trends maps. Google even has years of data that give credibility to their suggested correlation between actual flu outbreak and their search term tracking algorithm.

Screen Shot 2014-11-13 at 1.56.12 PM.png

So what went wrong in 2012? One explanation is that Google was relying too heavily on only one form of data collection, search terms. Another explanation is, people are hypochondriacs who, if given the chance, will WebMD themselves into a myocardial infarction.

What is the takeaway here? Are we supposed to distrust those seemingly faithful numbers we love so much?

Certainly not.

However, we need to realize that when we look at data sets that relate to humans we must be careful to consider that relevant variable, the subjective human, in our equations.

Reach Allie Kanik at 412-350-0264 or akanik@publicsource.org.