Wednesday, 30 September 2009

What a Statistician's Investigative Instincts Can Tell Us About Modelling Risk

Man, James Kwak wasn't kidding about a neat little time waster ... he laid a link honeypot over at Baseline Scenario that led to here, and I chased it, and I must have blown ... oh, God all the minutes. And I really had stuff to do today. Drat.

At the heart of "Are Oklahoma Students Really This Dumb? Or Is Strategic Vision Really This Stupid?" lies the question: Did a pollster finesse a little data to please its conservative client?

Here's what's going on (and please do chase the link, but be forewarned -- it's a time eater): Nate Silver, author of FiveThirtyEight.com (the number of electors in that quaint anachronism we call an electoral college), smelled a rat when an outfit called Strategic Vision released the results of another one of those "you won't believe how stupid Americans are" polls. It showed Oklahoma high school students were pretty close to drop-dead dumb (only 23% reportedly could identify George Washington as the first president). Not a single one of 1,000 students could even correctly answer as many as eight of ten basic questions, mainly civics stuff (example: "We elect a U.S. senator for how many years?").

The poll was commissioned by what Silver describes as a conservative-leaning Oklahoma group. Obviously, the results supply ammunition to those who decry the parlous state of public schools. Which, come to think of it, pretty well describes a significant chunk of right wingers. Hmm, Silver thought, something stinks here. After all, it's kind of hard to fathom that not one Oklahoma high schooler out of 1,000 happened to be one of those annoying Poindexter types who's always got his hand up in class and seems to blurt out the answer to everything. What are the odds?

Exactly. I enjoyed Silver's statistical analysis, though found it a little hard to digest on a first scan. Judging by the comments (and he got a slew!), I think his argument sailed over a lot of heads. But there's a really neat statistical lesson embedded in the detective work he did, so I'm going to revisit it here.

Silver, after sensing something fishy, decided to try to square two bits of information. (1) The results of the ten questions that pollsters claimed to have asked. For instance, 28% of the students got question number 1 right, 26% question number 2, 27% question number 3, etc. (2) The breakdown of how many students answered how many questions correctly. For instance, 24.6% got two questions right, but only 8% managed to get five. And so on.

You may wonder, what is there to "square" here? Well, we have a distribution issue at hand that turns out to be quite fascinating. Namely, how are all these correct answers distributed among the 1,000 students? To illustrate why this matters, look at the first two questions alone. The first was answered correctly by 280 of 1,000 students (28%), the second by 260 (26%). Now consider two radically different scenarios:

1. It's Harrison Bergeron High School, and all students have exactly the same chance of getting any question right or wrong. So that means 280 got the first question right, and 26 percent (73 people) of that group also knew the answer to the second question. So, based on my all-students-are-equal assumption, the distribution of correct answers now looks like:
0 correct=540 (54%)
1 correct=467 (46.7%)
2 correct=73 (7.3%)

2. For the sake of making a point, let's say these 1,000 students are wildly disparate. There's a group of 200 pretty smart students and 800 not-so-bright ones. Let's say the 200 smart kids got both questions right, but nobody in the not-so-bright group did. In this case the distribution looks like:
0 correct: 660 (66%)
1 correct: 140 (14%)
2 correct: 200 (20%)

Notice that the breakdowns are hugely different. Of course I've exaggerated the groups to make a point. But what's more realistic? In any group of 1,000 people, we'd expect to find the smart, the not-so-smart, the kind-of-dumb: a variety of intelligences, in other words. So what Silver does is run a hypothetical distribution of data points representing how each student did, based on two different populations: (1) the unreal population I created in example number one (2) a normal population (his example is much more normal than mine above) of kids who are smart and dumb and everything in between.

And he makes a real "ah hah!" discovery: the distribution for the "unreal" population produces a cluster of points almost identical to what Strategic Vision found. Which makes you wonder if these are, well, really real kids and a real poll ... or someone in a back room just whipped up some data.

Great statistical analysis.

The relevancy to finance: Silver's analysis can be viewed in the context of correlation: Is it more likely you'll get question two correct if number one was? The answer to that should be clear: Of course it is. Smarter students are more likely to get both correct; kids who slept through half of grade school and flunked every other course have a greater chance of getting neither right. There is a positive, meaningful correlation.

Before this financial crisis, there were a lot of busy modelers behind the scenes who thought they had managed to evaporate risk into the mist. One element of their flawed models now worth revisiting: the degree to which variables are correlated, in ways that we aren't immediately aware of. Is the chance of A happening truly independent of the chance of B, or is there some kind of subtle multiplier effect resonating through the equation?

As Buffett famously said, "Beware of geeks bearing formulas." I'm very suspicious of clean models that attempt to predict complex systems. I think there are a lot of interactions that aren't understood well at all, especially in this world of modern finance.