Wednesday, February 28, 2007

Ignoring Base Rates

One irksome thing I encounter in my job is somebody getting unduly excited about a statistic that they’ve read, like that “participation in kayaking doubled in the last 10 years and is now the fastest growing recreational activity in the country” or “female participation in hunting is up 70% from 1990,” that basically comes down to growth from a small base. It’s very easy for something to be up in percentage terms when the original number is very low. But it’s not as exciting to report “in 2000, 5% of the US adult population participated in kayaking, up from 2% in 1990” or similarly unimpressive percentage point increases. [Please note that I made up all these numbers.]

The situation is worsened by the fact that frequently, given a reasonably high sample size, small differences between two high or two low percentages are statistically significant. And people have real trouble understanding that statistical significance and substantive significance (or meaningfulness) aren’t the same thing. Statistical significance can tell you how sure you can be that the difference exists, but not how large the difference is or whether it means anything. For example, if I could talk to every single adult in this country and find out that 64% of women like dogs and 65% of men like dogs, I could say with assurance that men are more likely to like dogs than woman are, but the difference is tiny and what would be the implication of this difference? Probably nothing.

So next time you hear a statistic reported like this, remember, it might not mean much of anything at all, despite how impressive the numbers sound. Base rates matter. For the number of football players majoring in electrical engineering at American universities to double may only require one more guy doing it.

2 comments:

Anonymous said...

The statistics thing that drives me the most nutso is when the numbers differ in the direction predicted, although the difference is not statistically significant. That means the difference is uncomfortably likely to be due to random factors. Uh, why are you computing statistics if you are going to ignore them? If the difference is real, it isn't big enough to show with such a small sample, so if you're going to insist you're right, you need to get a bigger sample.

When I was a typist, for professors of science--hard science even--I once had to type a sentence like that in a manuscript. I don't type crap without an argument. He still made me type it.

Sally said...

Yeah, that's a good one too. You'll see stuff like "the findings are directional, though not statistically significant." I can understand how this happens (though I don't excuse it); people often haven't done a power analysis ahead of time to figure out what kind of sample they need, or they misjudge how the split in the numbers will turn out or something, and it's not easy to go back and increase your sample later. And when you end up with "obviously" different numbers like 27% vs. 43% and a p=.055, it just kills you to have to let that go.

This is sort of related to that other thing where people will act as though failing to reject a null hypothesis is the same as proving that the null hypothesis is correct, even when they didn't have the power to detect a difference even if it existed. (And it doesn't really work that way anyway.)