Generally speaking, frequentist approaches posit that the world is one way (e.g., a parameter has one particular true value), and try to conduct experiments whose resulting conclusion -- no matter the true value of the parameter -- will be correct with at least some minimum probability.

As a result, to express uncertainty in our knowledge after an experiment, the frequentist approach uses a "confidence interval" -- a range of values designed to include the true value of the parameter with some minimum probability, say 95%. A frequentist will design the experiment and 95% confidence interval procedure so that out of every 100 experiments run start to finish, at least 95 of the resulting confidence intervals will be expected to include the true value of the parameter. The other 5 might be slightly wrong, or they might be complete nonsense -- formally speaking that's ok as far as the approach is concerned, as long as 95 out of 100 inferences are correct. (Of course we would prefer them to be slightly wrong, not total nonsense.)

Bayesian approaches formulate the problem differently. Instead of saying the parameter simply has one (unknown) true value, a Bayesian method says the parameter's value is fixed but has been chosen from some probability distribution -- known as the prior probability distribution. (Another way to say that is that before taking any measurements, the Bayesian assigns a probability distribution, which they call a belief state, on what the true value of the parameter happens to be.) This "prior" might be known (imagine trying to estimate the size of a truck, if we know the overall distribution of truck sizes from the DMV) or it might be an assumption drawn out of thin air. The Bayesian inference is simpler -- we collect some data, and then calculate the probability of different values of the parameter GIVEN the data. This new probability distribution is called the "a posteriori probability" or simply the "posterior." Bayesian approaches can summarize their uncertainty by giving a range of values on the posterior probability distribution that includes 95% of the probability -- this is called a "95% credibility interval."

A Bayesian partisan might criticize the frequentist confidence interval like this: "So what if 95 out of 100 experiments yield a confidence interval that includes the true value? I don't care about 99 experiments I DIDN'T DO; I care about this experiment I DID DO. Your rule allows 5 out of the 100 to be complete nonsense [negative values, impossible values] as long as the other 95 are correct; that's ridiculous."

A frequentist die-hard might criticize the Bayesian credibility interval like this: "So what if 95% of the posterior probability is included in this range? What if the true value is, say, 0.37? If it is, then your method, run start to finish, will be WRONG 75% of the time. Your response is, 'Oh well, that's ok because according to the prior it's very rare that the value is 0.37,' and that may be so, but I want a method that works for ANY possible value of the parameter. I don't care about 99 values of the parameter that IT DOESN'T HAVE; I care about the one true value IT DOES HAVE. Oh also, by the way, your answers are only correct if the prior is correct. If you just pull it out of thin air because it feels right, you can be way off."

In a sense both of these partisans are correct in their criticisms of each others' methods, but I would urge you to think mathematically about the distinction. There don't need to be Bayesians and frequentists any more than there are realnumberists and integeristos; there are different kinds of methods that apply math to calculate different things. This is a complex subject with a lot of sides to it, of which these examples are a tiny part -- books on Bayesian analysis could fill many bookshelves, not to mention classical statistics, which would fill a whole library.

------------

Here's an extended example that shows the difference precisely in a discrete example.

When I was a child my mother used to occasionally surprise me by ordering a jar of chocolate-chip cookies to be delivered by mail. The delivery company stocked four different kinds of cookie jars -- type A, type B, type C, and type D, and they were all on the same truck and you were never sure what type you would get. Each jar had exactly 100 cookies, but the feature that distinguished the different cookie jars was their respective distributions of chocolate chips per cookie. If you reached into a jar and took out a single cookie uniformly at random, these are the probability distributions you would get on the number of chips:

A type-A cookie jar, for example, has 70 cookies with two chips each, and no cookies with four chips or more! A type-D cookie jar has 70 cookies with one chip each. Notice how each vertical column is a probability mass function -- the conditional probability of the number of chips you'd get, given that the jar = A, or B, or C, or D, and each column sums to 100.

I used to love to play a game as soon as the deliveryman dropped off my new cookie jar. I'd pull one single cookie at random from the jar, count the chips on the cookie, and try to express my uncertainty -- at the 70% level -- of which jars it could be. Thus it's the identity of the jar (A, B, C or D) that is the

**value of the parameter**being estimated. The number of chips (0, 1, 2, 3 or 4) is the

**outcome**or the observation or the sample.

Originally I played this game using a frequentist, 70% confidence interval. Such an interval needs to make sure that

**no matter**the true value of the parameter, meaning no matter which cookie jar I got, the interval would cover that true value with at least 70% probability.

An interval, of course, is a function that relates an outcome (a row) to a set of values of the parameter (a set of columns). But to

*construct*the confidence interval and guarantee 70% coverage, we need to work "vertically" -- looking at each column in turn, and making sure that 70% of the probability mass function is covered so that 70% of the time, that column's identity will be part of the interval that results. Remember that it's the vertical columns that form a p.m.f.

So after doing that procedure, I ended up with these intervals:

For example, if the number of chips on the cookie I draw is 1, my confidence interval will be {B,C,D}. If the number is 4, my confidence interval will be {B,C}. Notice that since each column sums to 70% or greater, then no matter which column we are truly in (no matter which jar the deliveryman dropped off), the interval resulting from this procedure will include the correct jar with at least 70% probability.

Notice also that the procedure I followed in constructing the intervals had some discretion. In the column for type-B, I could have just as easily made sure that the intervals that included B would be 0,1,2,3 instead of 1,2,3,4. That would have resulted in 75% coverage for type-B jars (12+19+24+20), still meeting the lower bound of 70%.

My sister Bayesia thought this approach was crazy, though. "You have to consider the deliverman as part of the system," she said. "Let's treat the identity of the jar as a random variable itself, and let's

*assume*that the deliverman chooses among them uniformly -- meaning he has all four on his truck, and when he gets to our house he picks one at random, each with uniform probability."

"With that assumption, now let's look at the joint probabilities of the whole event -- the jar type

**and**the number of chips you draw from your first cookie," she said, drawing the following table:

Notice that the whole table is now a probability mass function -- meaning the whole table sums to 100%.

"Ok," I said, "where are you headed with this?"

"You've been looking at the conditional probability of the number of chips, given the jar," said Bayesia. "That's all wrong! What you really care about is the conditional probability of which jar it is, given the number of chips on the cookie! Your 70% interval should simply include the list jars that, in total, have 70% probability of being the true jar. Isn't that a lot simpler and more intuitive?"

"Sure, but how do we calculate that?" I asked.

"Let's say we

**know**that you got 3 chips. Then we can ignore all the other rows in the table, and simply treat that row as a probability mass function. We'll need to scale up the probabilities proportionately so each row sums to 100, though."

She did:

"Notice how each row is now a p.m.f., and sums to 100%. We've flipped the conditional probability from what you started with -- now it's the probability of the man having dropped off a certain jar, given the number of chips on the first cookie."

"Interesting," I said. "So now we just circle enough jars in each row to get up to 70% probability?" We did just that, making these credibility intervals:

Each interval includes a set of jars that,

*a posteriori*, sum to 70% probability of being the true jar."Well, hang on," I said. "I'm not convinced. Let's put the two kinds of intervals side-by-side and compare them for coverage and, assuming that the deliveryman picks each kind of jar with equal probability, credibility."

Here they are:

**Confidence intervals:**

"See how crazy your confidence intervals are?" said Bayesia. "You don't even have a sensible answer when you draw a cookie with zero chips! You just say it's the empty interval. But that's obviously wrong -- it has to be one of the four types of jars. How can you live with yourself, stating an interval at the end of the day when you

**know the interval is wrong?**And ditto when you pull a cookie with 3 chips -- your interval is only correct 41% of the time. Calling this a '70%' confidence interval is bullshit.""Well, hey," I replied. "It's correct 70% of the time, no matter which jar the deliveryman dropped off. That's a lot more than you can say about your credibility intervals. What if the jar is type B? Then your interval will be wrong 80% of the time, and only correct 20% of the time!"

"This seems like a big problem," I continued, "because your mistakes will be correlated with the type of jar. If you send out 100 'Bayesian' robots to assess what type of jar you have, each robot sampling one cookie, you're telling me that on type-B days, you will expect 80 of the robots to get the wrong answer, each having >73% belief in its incorrect conclusion! That's troublesome, especially if you want most of the robots to agree on the right answer."

"PLUS we had to make this assumption that the deliveryman behaves uniformly and selects each type of jar at random," I said. "Where did that come from? What if it's wrong? You haven't talked to him; you haven't interviewed him. Yet all your statements of

*a posteriori*probability rest on this statement about his behavior. I didn't have to make any such assumptions, and my interval meets its criterion even in the worst case."

"It's true that my credibility interval does perform poorly on type-B jars," Bayesia said. "But so what? Type B jars happen only 25% of the time. It's balanced out by my good coverage of type A, C, and D jars. And I never publish nonsense."

"It's true that my confidence interval does perform poorly when I've drawn a cookie with zero chips," I said. "But so what? Chipless cookies happen, at most, 27% of the time in the worst case (a type-D jar). I can afford to give nonsense for this outcome because NO jar will result in a wrong answer more than 30% of the time."

"The column sums matter," I said.

"The row sums matter," Bayesia said.

"I can see we're at an impasse," I said. "We're both correct in the mathematical statements we're making, but we disagree about the appropriate way to quantify uncertainty."

"That's true," said my sister. "Want a cookie?"

I very much like your story about cookies. But I don't think Bayesia has any reason to claim "Type B jars happen only 25% of the time." I think a fairer characterization of the Bayesian position would be "I have no reason to think Type B jars are especially common, so I don't think poor performance on Type B is that big of a problem." Regardless, nice post.

ReplyDeleteP.S. Did you happen to post a version of this on a HackerNews comment many years ago? Or maybe someone else did? I have a faint recollection of this 4x4 confidence interval story to illustrate the difference between frequentists and Bayesians.

The story is great at explaining the conceptual differences between the two approaches. However, the way the performances are currently compared leaves the impression that it is a matter of taste which method to prefer. I think this is misleading. You never explicitly say how one should evaluate the performance of a given method but you create the impression that it is simply a matter of how often the true value is contained in the interval. However, by that criterium one can have a perfect method: simply always predict the interval that contains all jars. This is guaranteed to be true 100% of the time. But obviously such a prediction would be useless. The value of a prediction is given by how much it narrows down the possibilities. So one inherently is trading risk of a wrong prediction against making more specific predictions.

ReplyDeleteOnce one realizes that, it becomes rather telling that the Bayesian method not only makes a prediction for every case (i.e. it does not refuse to make a prediction when one got zero chips) but the intervals are significantly shorter on average as well.

If one wants to compare the methods in a rigorous way one would have to specify what the value V(n) is of a prediction that narrows the number of possibilities down from 4 to n if it turns out to be correct, and what the cost C(n) is, if it is incorrect.

Once one does that, you will find that the optimal intervals are the ones that maximize the Bayesian posterior expectation of V-C.