First, the format of the book. It is a short paper of 127 pages, plus 40 pages of glossary, appendices, references and index. I eventually found the name of the publisher, Sebtel Press, but for a while thought the book was self-produced. While the LaTeX output is fine and the (Matlab) graphs readable, pictures are not of the best quality and the display editing is minimal in that there are several huge white spaces between pages. Nothing major there, obviously, it simply makes the book look like course notes, but this is in no way detrimental to its potential appeal. (I will not comment on the numerous appearances of Bayes’ alleged portrait in the book.)
“… (on average) the adjusted value θ^{MAP} is more accurate than θ^{MLE}.” (p.82)
Bayes’ Rule has the interesting feature that, in the very first chapter, after spending a rather long time on Bayes’ formula, it introduces Bayes factors (p.15). With the somewhat confusing choice of calling the prior probabilities of hypotheses marginal probabilities. Even though they are indeed marginal given the joint, marginal is usually reserved for the sample, as in marginal likelihood. Before returning to more (binary) applications of Bayes’ formula for the rest of the chapter. The second chapter is about probability theory, which means here introducing the three axioms of probability and discussing geometric interpretations of those axioms and Bayes’ rule. Chapter 3 moves to the case of discrete random variables with more than two values, i.e. contingency tables, on which the range of probability distributions is (re-)defined and produces a new entry to Bayes’ rule. And to the MAP. Given this pattern, it is not surprising that Chapter 4 does the same for continuous parameters. The parameter of a coin flip. This allows for discussion of uniform and reference priors. Including maximum entropy priors à la Jaynes. And bootstrap samples presented as approximating the posterior distribution under the “fairest prior”. And even two pages on standard loss functions. This chapter is followed by a short chapter dedicated to estimating a normal mean, then another short one on exploring the notion of a continuous joint (Gaussian) density.
“To some people the word Bayesian is like a red rag to a bull.” (p.119)
Bayes’ Rule concludes with a chapter entitled Bayesian wars. A rather surprising choice, given the intended audience. Which is rather bound to confuse this audience… The first part is about probabilistic ways of representing information, leading to subjective probability. The discussion goes on for a few pages to justify the use of priors but I find completely unfair the argument that because Bayes’ rule is a mathematical theorem, it “has been proven to be true”. It is indeed a maths theorem, however that does not imply that any inference based on this theorem is correct! (A surprising parallel is Kadane’s Principles of Uncertainty with its anti-objective final chapter.)
All in all, I remain puzzled after reading Bayes’ Rule. Puzzled by the intended audience, as contrary to other books I recently reviewed, the author does not shy away from mathematical notations and concepts, even though he proceeds quite gently through the basics of probability. Therefore, potential readers need some modicum of mathematical background that some students may miss (although it actually corresponds to what my kids would have learned in high school). It could thus constitute a soft entry to Bayesian concepts, before taking a formal course on Bayesian analysis. Hence doing no harm to the perception of the field.
]]>_________________________________________________________________
Problem A
Let be a random variable with the density function where . For each realized value , the conditional variable is uniformly distributed over the interval , denoted symbolically by . Obtain solutions for the following:
_________________________________________________________________
Problem B
Let be a random variable with the density function where . For each realized value , the conditional variable is uniformly distributed over the interval , denoted symbolically by . Obtain solutions for the following:
_________________________________________________________________
Discussion of Problem A
Problem A-1
The support of the joint density function is the unbounded lower triangle in the xy-plane (see the shaded region in green in the figure below).
Figure 1
The unbounded green region consists of vertical lines: for each , ranges from to (the red vertical line in the figure below is one such line).
Figure 2
For each point in each vertical line, we assign a density value which is a positive number. Taken together these density values sum to 1.0 and describe the behavior of the variables and across the green region. If a realized value of is , then the conditional density function of is:
Thus we have . In our problem at hand, the joint density function is:
As indicated above, the support of is the region and (the region shaded green in the above figures).
Problem A-2
The unconditional density function of is (given above in the problem) is the density function of the sum of two independent exponential variables with the common density (see this blog post for the derivation using convolution method). Since is the independent sum of two identical exponential distributions, the mean and variance of is twice that of the same item of the exponential distribution. We have:
Problem A-3
To find the marginal density of , for each applicable , we need to sum out the . According to the following figure, for each , we sum out all values in a horizontal line such that (see the blue horizontal line).
Figure 3
Thus we have:
Thus the marginal distribution of is an exponential distribution. The mean and variance of are:
Problem A-4
The covariance of and is defined as , which is equivalent to:
where and . Knowing the joint density , we can calculate directly. We have:
Note that the last integrand in the last integral in the above derivation is that of a Gamma distribution (hence the integral is 1.0). Now the covariance of and is:
The following is the calculation of the correlation coefficient:
Even without the calculation of , we know that and are positively and quite strongly correlated. The conditional distribution of is which increases with . The calculation of and confirms our observation.
_________________________________________________________________
Answers for Problem B
Problem B-1
Problem B-2
Problem B-3
Problem B-4
_________________________________________________________________
]]>____________________________________________________________
Problem 1a
There are two identical looking bowls. Let’s call them Bowl 1 and Bowl 2. In Bowl 1, there are 1 red ball and 4 white balls. In Bowl 2, there are 4 red balls and 1 white ball. One bowl is selected at random and its identify is kept from you. From the chosen bowl, you randomly select 5 balls (one at a time, putting it back before picking another one). What is the expected number of red balls in the 5 selected balls? What the variance of the number of red balls?
Problem 1b
Use the same information in Problem 1a. Suppose there are 3 red balls in the 5 selected balls. What is the probability that the unknown chosen bowl is Bowl 1? What is the probability that the unknown chosen bowl is Bowl 2?
____________________________________________________________
Problem 2a
There are three identical looking bowls. Let’s call them Bowl 1, Bowl 2 and Bowl 3. Bowl 1 has 1 red ball and 9 white balls. Bowl 2 has 4 red balls and 6 white balls. Bowl 3 has 6 red balls and 4 white balls. A bowl is chosen according to the following probabilities:
The bowl is chosen so that its identity is kept from you. From the chosen bowl, 5 balls are selected sequentially with replacement. What is the expected number of red balls in the 5 selected balls? What is the variance of the number of red balls?
Problem 2b
Use the same information in Problem 2a. Given that there are 4 red balls in the 5 selected balls, what is the probability that the chosen bowl is Bowl i, where ?
____________________________________________________________
Solution – Problem 1a
Problem 1a is a mixture of two binomial distributions and is similar to Problem 1 in the previous post Mixing Binomial Distributions. Let be the number of red balls in the 5 balls chosen from the unknown bowl. The following is the probability function:
where .
The above probability function is the weighted average of two conditional binomial distributions (with equal weights). Thus the mean (first moment) and the second moment of would be the weighted averages of the two same items of the conditional distributions. We have:
See Mixing Binomial Distributions for a more detailed explanation of the calculation.
____________________________________________________________
Solution – Problem 1b
As above, let be the number of red balls in the 5 selected balls. The probability must account for the two bowls. Thus it is obtained by mixing two binomial probabilities:
The following is the conditional probability :
Thus
____________________________________________________________
Answers for Problem 2
Problem 2a
Let be the number of red balls in the 5 balls chosen random from the unknown bowl.
Problem 2b
I also regret not mentioning that Bayes’ formula was taught in French high schools, as illustrated by the anecdote of Bayes at the bac. And not reacting at the question about Bayes in the courtroom with yet another anecdote of Bayes’ formula been thrown out of the accepted tools by an English court of appeal about a year ago. Oh well, another argument for sticking to the written world.
]]>Example 1
As indicated in the diagram, Box 1 has 1 red ball and three white balls and Box 2 has 2 red balls and 2 white balls. The example involves a sequence of two steps. In the first step (the green arrow in the above diagram), a box is randomly chosen from two boxes. In the second step (the blue arrow), a ball is randomly selected from the chosen box. We assume that the identity of the chosen box is unknown to the participants of this random experiment (e.g. suppose the two boxes are identical in appearance and a box is chosen by your friend and its identity is kept from you). Since a box is chosen at random, it is easy to see that .
The example involves conditional probabilities. Some of the conditional probabilities are natural and are easy to see. For example, if the chosen box is Box 1, it is clear that the probability of selecting a red ball is , i.e. . Likewise, the conditional probability is . These two conditional probabilities are “forward” conditional probabilities since the events and occur in a natural chronological order.
What about the reversed conditional probabilities and ? In other words, if the selected ball from the unknown box (unknown to you) is red, what is the probability that the ball is from Box 1?
The above question seems a little backward. After the box is randomly chosen, it is fixed (though the identity is unknown to you). Since it is fixed, shouldn’t the probability that the box being Box 1 is ? Since the box is already chosen, how can the identity of the box be influenced by the color of the ball selected from it? The answer is of course no.
We should not look at the chronological sequence of events. Instead, the key to understanding the example is through performing the random experiment repeatedly. Think of the experiment of choosing one box and then selecting one ball from the chosen box. Focus only on the trials that result in a red ball. For the result to be a red ball, we need to get either Box 1/ Red or Box 2/Red. Compute the probabilities of these two cases. Then add these two probabilities, we will obtain the probability that the selected ball is red. The following diagram illustrates this calculation.
Example 1 – Tree Diagram
The outcomes with red border in the above diagram are the outcomes that result in a red ball. The diagram shows that if we perform this experiment many times, about 37.5% of the trials will result in a red ball (on average 3 out of 8 trials will result in a red ball). In how many of these trials, is Box 1 the source of the red ball? In the diagram, we see that the case Box 2/Red is twice as likely as the case Box 1/Red. We conclude that the case Box 1/Red accounts for about one third of the cases when the selected ball is red. In other words, one third of the red balls come from Box 1 and two third of the red balls come from Box 2. We have:
Instead of using the tree diagram or the reasoning indicated in the paragraph after the tree diagram, we could just as easily apply the Bayes’ formula:
In the calculation in (as in the tree diagram), we use the law of total probability:
______________________________________________________________
Remark
We are not saying that an earlier event (the choosing of the box) is altered in some way by a subsequent event (the observing of a red ball). The above probabilities are subjective. How strongly do you believe that the “unknown” box is Box 1? If you use probabilities to quantify your belief, without knowing any additional information, you would say the probability that the “unknown” box being Box 1 is .
Suppose you reach into the “unknown” box and get a red ball. This additional information alters your belief about the chosen box. Since Box 2 has more red balls, the fact that you observe a red ball will tell you that it is more likely that the “unknown” chosen box is Box 2. According to the above calculation, you update the probability of the chosen box being Box 1 to and the probability of it being Box 2 as .
In the language of Bayesian probability theory, the initial belief of and is called the prior probability distribution. After a red ball is observed, the updated belief as in the probabilities and is called the posterior probability distribution.
As demonstrated by this example, the Bayes’ formula is for updating probabilities in light of new information. Though the updated probabilities are subjective, they are not arbitrary. We can make sense of these probabilities by assessing the long run results of the experiment objectively.
______________________________________________________________
An Insurance Perspective
The example discussed here has an insurance interpretation. Suppose an insurer has two groups of policyholders, both equal in size. One group consists of low risk insureds where the probability of experiencing a claim in a year is (i.e. the proportion of red balls in Box 1). The insureds in other group, a high risk group, have a higher probability of experiencing a claim in a year, which is (i.e. the proportion of red balls in Box 2).
Suppose someone just purchase a policy. Initially, the risk profile of this newly insured is uncertain. So the initial belief is that it is equally likely for him to be in the low risk group as in the high risk group.
Suppose that during the first policy year, the insured has incurred one claim. The observation alters our belief about this insured. With the additional information of having one claim, the probability that the insured belong to the high risk group is increased to . The risk profile of this insured is altered based on new information. The insurance point of view described here has the exact same calculation as in the box-ball example and is that of using past claims experience to update future claims experience.
______________________________________________________________
Bayes’ Formula
Suppose we have a collection of mutually exclusive events . That is, the probabilities sum to 1.0. Suppose is an event. Think of the events as “causes” that can explain the event , an observed result. Given is observed, what is the probability that the cause of is ? In other words, we are interested in finding the conditional probability .
Before we have the observed result , the probabilities are the prior probabilities of the causes. We also know the probability of observing given a particular cause (i.e. we know ). The probabilities are “forward” conditional probabilities.
Given that we observe , we are interested in knowing the “backward” probabilities . These probabilities are called the posterior probabilities of the causes. Mathematically, the Bayes’ formula is simply an alternative way of writing the following conditional probability.
In , as in the discussion of the random experiment of choosing box and selecting ball, we are restricting ourselves to only the cases where the event is observed. Then we ask, out of all the cases where is observed, how many of these cases are caused by the event ?
The numerator of can be written as
The denominator of is obtained from applying the total law of probability.
Plugging and into , we obtain a statement of the Bayes’ formula.
Of course, for any computation problem involving the Bayes’ formula, it is best not to memorize the formula in . Instead, simply apply the thought process that gives rise to the formula (e.g. the tree diagram shown above).
The Bayes’ formula has some profound philosophical implications, evidenced by the fact that it spawned a separate school of thought called Bayesian statistics. However, our discussion here is solely on its original role in finding certain backward conditional probabilities.
______________________________________________________________
Example 2
Example 2 is left as exercise. The event that both selected balls are red would give even more weight to Box 2. In other words, in the event that a red ball is selected twice in a row, we would believe that it is even more likely that the unknown box is Box 2.
______________________________________________________________
Reference
It occurred to me that Bayesian inference can thought of as filtering: the objects of interest are the model parameters but, instead of being measured directly, their measurement is implicit in the data.
Consider standard linear regression:
where is an vector of observations, is an matrix, is a parameter vector and is an noise vector. Typically, we take normally distributed noise, , and here we’ll assume the covariance matrix is known. Thus our probabilistic model is
In Bayesian inference, what we are after is This connects to filtering if you think of the pair as an implicit measurement of given the model. Bayes’ formula tells us
where is our prior for the parameters given . Typically, however, our prior beliefs about will be independent of i.e.
For simplicity, we’ll assume a normal prior: , and, in a later post, we’ll compute the posterior for , which is a nice little mathematical problem in its own right! Till then, I’ll only point out that the posterior is also a normal:
Our job is to compute and
]]>_____________________________________________________________
Discussion of Problem 1
Problem 2 is found at the end of the post.
Problem 1.1
This is an example of a joint distribution that is constructed from taking product of conditional distributions and a marginial distribution. The marginal distribution of is a uniform distribution on the set (rolling a fiar die). Conditional of , has a binomial distribution . Think of the conditional variable of as tossing a coin times where the probability of a head is . The following is the sample space of the joint distribution of and .
Figure 1
The joint probability function of and may be written as:
Thus the probability at each point in Figure 1 is the product of , which is , with the conditional probability , which is binomial. For example, the following diagram and equation demonstrate the calculation of
Figure 2
Problem 1.2
The following shows the calculation of the binomial distributions.
Problem 1.3
To find the marginal probability , we need to sum over all . For example, is the sum of for all . See the following diagram
Figure 3
As indicated in , each is the product of a conditional probability and . Thus the probability indicated in Figure 3 can be translated as:
We now begin the calculation.
The following is the calculation of the mean and variance of .
Problem 1.4
The conditional probability is easy to compute since it is a given that is a binomial variable conditional on a value of . Now we want to find the backward probability . Given the binomial observation is , what is the probability that the roll of the die is ? This is an application of the Bayes’ theorem. We can start by looking at Figure 3 once more.
Consider . In calculating this conditional probability, we only consider the 5 sample points encircled in Figure 3 and disregard all the other points. These 5 points become a new sample space if you will (this is the essence of conditional probability and conditional distribution). The sum of the joint probability for these 5 points is , calculated in the previous step. The conditional probability is simply the probability of one of these 5 points as a fraction of the total probability . Thus we have:
We do not have to evaluate the components that go into . As a practical matter, to find is to take each of 5 probabilities shown in and evaluate it as a fraction of the total probability . Thus we have:
Calculation of
Here’s the rest of the Bayes’ calculation:
Calculation of
Calculation of
Calculation of done earlier
Calculation of
Calculation of
Calculation of
Calculation of
_____________________________________________________________
Probem 2
Let be the value of one roll of a fair die. If the value of the die is , we are given that has a binomial distribution with and (we use the notation ).
_____________________________________________________________
Answers to Probem 2
Problem 2.3
Problem 2.4
]]>