Earlier in the week a couple of Microsoft researchers released a study of cybercrime financial loss statistics (Sex, Lies and Cybercrime Surveys - PDF link). Effectively their research indicated that bad sampling, survey and statistical methods have led to a number of dubious results. I think most of us who are involved in the industry have known this intuitively for a while. Any time you have metrics purportedly for the same thing that vary by factors of 10-1000, that says something isn't quite right.
The conclusion of the paper is essentially that estimates of the cybercrime economy are grossly exaggerated. And they make the point well enough that I won't belabor it here. Go read the actual article (linked above). I'm more interested in how this applies to other areas and studies. Here are a couple of points I think are particularly relevant, as well as a couple of others.
- Heavy Tails. Means (averages) are most useful when all the data are clustered closely around that number. When the distribution is very wide, you're going to have a problem getting people to understand what the results mean. For example, if I said that the average cost of a DVD player is $100 it doesn't tell you anything meaningful about the market for DVD players. That's because the costs range broadly, so the mean is almost arbitrarily in the middle somewhere.
- Garbage In, Garbage Out (GIGO). Since the data in these studies is typically collected by sending surveys, it's impossible to verify its integrity. In some cases people outright lie, but in others they simply don't know true costs and are just guessing. They may be higher or lower than the actual, but since there are never negative values, the overall trend almost necessarily has to push the number higher than the true value. But by how much it's impossible to know.
- Attribution. It's not easy to know where fraud came from. How do you know that somebody stole your credit card number from an online database, versus going through your trash or copying it at the restaurant down the street? This kind of attribution is especially hard for consumers who often can only know about an incident after actual fraud or if they are issued a new credit card. If both things happen within a year or so, the consumer is likely to think one caused the other, though as we know correlation does not imply causation.
- Self-Selecting Population. The people who respond have at least one thing in common - they return surveys. They may have other things in common, like a tendency to overestimate numbers, to be particularly susceptible to cybercrime, or any number of dozens of things that could influence the validity of these studies.
This isn't just a problem that affects cybercrime statistics, though. The Ponemon institute annually puts out a similar report on losses due to breaches (as well as a report on cybercrime). Their methodology is similar to the ones discussed in the Microsoft paper, and therefore suffers from some of the same flaws. To get consistent results over time that show a trend consistent with expectation, I suspect that some data manipulation goes on, which would add yet another layer of bad science (if true - I only have my gut instinct to go on, not any facts).
One group that tries obsessively to get the science right is Verizon Business who puts out an annual breach report. This uses much better science and statistics and can be counted on to have some rigor. Results can vary wildly year-over-year because they are always introducing new data populations (several groups contribute their figures, most with a different demographic that they serve), but that will normalize somewhat in the next 5 years as their data set gets big enough that new populations skew it less. The raw data is collected and published openly so there can be some good peer review of it, an important step for ensuring validity of results and conclusions.
But these studies don't have to be done poorly. By changing the way the researchers go about it, they could get much better results. For example, if instead of going to consumers with a survey, the authors had been able to get the information from banks the numbers would have likely been very different. The banks would have objective measurements of customer losses, a properly large sample size, a randomly distributed population, and would likely know more about the source of the fraud.
Although many of these studies fail at basic science, I'm hopeful that the information security industry will get better. Both at true academic research and at coming up with accurate public metrics for many of the most important data. We'll get there as we mature as an industry, but it will take a while. Until then, stay skeptical.