Book review: Bernoulli’s Fallacy

On 2022-04-08 by ThatGeoGuy

The last time I did a book review, I lamented about how long it took me to get through the book. The topic of dependent types was both new and unfamiliar, and maneuvering through the exercises was long and required engaged thought. This time, I’ve approached “Bernoulli’s Fallacy: Statistical Illogic and the Crisis of Modern Science.” The book broaches subjects that are neither wholly new or unfamiliar to me, someone who practices engineering and science. It presents a fascinating perspective into the history of probability as well as a condemnation of many stastical norms, or orthodoxies. Controversial that may sound, and controversial it is! But I think it underscores some very important mistakes made in modern statistical practice, and on reflection of my own education, I think it’s worth discussing!

This book, authored by Aubrey Clayton, clocks in around ~300 or so pages of actual text, with notes / bibliography / index comprising about 50 pages thereafter. Unlike my last review, where I spent many months reading through the Little Typer, I blew through Bernoulli’s Fallacy in about 3 days. Needless to say, I quite enjoyed it. Which brings me to the point:

The short form of my opinion on the book is that you should definitely read this book. If you want to do that without being tainted by my review, this is the place to stop!¹

Where I’m coming from

My education in statistics was largely embroiled within the orthodoxy; that is, the frequentist version of statistics (I’ll get to what that means later). Out of all my courses during both my Bachelor’s and Master’s degrees I only ever took one “pure” engineering statistics course. This was a speciality biomedical engineering statistics course, and in hindsight it was somewhat lackluster. The course itself taught both Bayes theorem as well as many other orthodox or frequentist methodologies in statistics, but didn’t really raise any philosophical distinction nor point out much to do with what probability means. “Bernoulli’s Fallacy” pits the Bayesian philosophy against a frequentist interpretation, and lays out why the Bayesian approach is an extension of formal inductive logic, and how the frequentist interpretation side-steps this as an attempt to make probability “objective.”

Needless to say, I don’t think this impression was ever given to me in my formal education. The idea was always that Bayes’ Theorem was just something you do when you have to update a probability and you have known priors from some experiment. A “hack” to correct for the base-rate fallacy, but not fundamentally an extension of logic or some epistemic process unto itself. As I said, in hindsight my education was lackluster. Not terrible, I still had some foundation in the mathematics, but in the way that a robot can know the math and not understand the underlying principles for why that math is used.

Future courses in Geomatics and Surveying were not tailored to statistics, but required thinking statistics as one of the underlying tools. Eventually this culminated in a graduate level course on “robust” statistics in least-squares and Kalman filtering. I won’t criticize that course too harshly, but just like my statistics course I felt that there were many topics which seemed a little too specific or required some special interpretation of what “data” counted. I don’t think I could ever reconcile that or explain it well, but Bernoulli’s Fallacy did at least give me a good starting point to start thinking about it from an epistemic perspective (i.e. reasoning under uncertainty).²

So that’s fundamentally where I’m coming from when I read through Bernoulli’s Fallacy. While I’ve been mostly subject to the orthodoxy of present-day statistics classes, I also had an easier time bridging the gap to the Bayesian way of thinking. This is important context for later, and may drive some pieces of the book I didn’t fully understand or may wish to criticize.

What is Bernoulli’s Fallacy?

The book could be fundamentally thought of as a mathematical text; However, is perhaps closer to the realm of philosophy (specifically, epistemology). Clayton argues that the orthodox methods of probability and statistics, namely the frequentist interpretation of probability, are fundamentally illogical and as a result are not useful when trying to learn something about the world (i.e. form inferences and weigh hypotheses from data).

Let’s get the definition of “frequentist” out of the way. The frequentist interpretation is the pre-cursor to the eponymous fallacy. Specifically, the “frequentist” interpretation of probability can be summed up as:

Probability is the frequency of occurance of an event in proportion to the total number of possible events that could have occured.

Put mathematically:

\[P(A) = \frac{\textsf{# of event A outcomes}}{\textsf{# of total possible outcomes}}\]

Specifically, Jacob Bernoulli defined probability this way as a logical extension of an exercise in drawing coloured stones from an urn. But this is not the entire fallacy. Bernoulli then goes on to say that given a large enough sampling frequency defined as probability above, we can then make an inference as to the true probability. If we denote our sample size as \(P_{\textsf{sample size}}(A)\), then Bernoulli’s fallacy was assuming:

\[P_{\textsf{true size}}(H | D) = \lim_{n \to \infty} P_{n}(D | H)\]

In short, that we could infer the frequency of the distribution of the true sample (i.e. our real world sample \(P_{\textsf{true size}}(H | D)\)) from the frequency of our collected sample (\(P_n(D | H)\)). In short, as that sampled frequency gets larger, we get closer to the “true” ratio of coloured pebbles in an urn.

The Bayesian alternative then, defines probability as a relative measure of knowledge (i.e. uncertainty) about some fact about the world. It does away with the notion of “sample sizes,” possible worlds, etc. and only consider actual data collected from an experiment. This would change the relationship of conditional probabilities above to:³

\[P(H_i | D \chi) = P(H_i | \chi) \frac{P(D | H_i \chi)}{P(D | \chi)}\]

There’s a million articles about Bayesian probability on the internet, so I’ll try to summarize these terms briefly.

\(\chi\): Our knowledge about the world. In the Bayesian school of thought, we condition all our knowledge on what we already know, subjectively. This is an extension of the philosophy of epistemics. We can’t “know” something in a vacuum, it has to be based on something.
\(P(H_i | \chi)\): the probability of a given hypotheses \(H_i\) or explanation given our knowledge about the world \(\chi\). This is the base rate of \(H_i\).
\(P(D | H_i \chi)\): the sampling probability of collected data \(D\), given that our hypothesis \(H_i\) is true and our knowledge about the world \(\chi\).
\(P(D | \chi)\): The sum of all pathway probabilities for all possible hypotheses. This can be defined as the probability that our data could be collected given our knowledge of the world, regardless of whether our hypothesis was true. This may be the most confusing bit of this equation, and I will return to this point in my criticisms of the book.
\(P(H_i | D \chi)\): The probability that our hypothesis is true given both our existing knowledge about the world \(\chi\) and considering the data we collected.

From a certain perspective this looks pretty close to the frequentist interpretation. In fact, you could even say this is a more general form of the frequentist interpretation, and that the frequentist interpretation falls out of it if \(P(H | \chi)\) and \(P(D | \chi)\) are both 1. Bernoulli’s fallacy then, is pretending that our base rate is always 1, and that data (\(P(D | \chi)\) is always objective (i.e. if the data was collected, it supports a hypothesis objectively and independently, not considering other hypotheses or interpretations).

The book makes a wonderful argument for why the Bayesian school of thought works, and more importantly how the above formula is all you really need to make inferences about the world. So why, then, does the frequentist interpretation persist today, and how deeply is that fallacy engrained in our present day practice?

A bad history

Chapters 3 and 4 of Bernoulli’s Fallacy dive into the history of modern statistical practice, largely looking at three people: Galton, Pearson, and Fisher. These three men are some of the most prominent names in the practice, having invented many of the same techniques taught in university level classes even today.

Clayton reveals that these three men all had a vested interest in the ideas behind eugenics in the early 20th century. They clung to the frequentist interpretation for largely political reasons, namely racism. There was a vested interest in colonial ideologies of the day, and all three men vociferously rejected Bayes’ theorem because they felt that adding subjectivity to knowledge (as opposed to the alleged “objective” frequentist interpretation of probability) would invalidate their claims that certain races were better, or had “objective” measures that made them so.

Clayton calls this “The Frequentist Jihad,” and even names chapter 4 as such. Both in terms of the educational structures at the time as well as what was publishable in Pearson’s journal, Biometrika, which was considered the “gold standard” of publishing. The unfortunate echoes of this slant towards eugenics are heard today in the names we use for many statistical concepts, such as:

Regression: What we call “linear regression” today was named for an observation that genes “regressed” to some average. It was a way of saying, “don’t mix with those people, or your children will regress and make society worse for it.”
Correlation: Literally from co-relation, or how related two “desirable” traits were in a population. You might guess which traits eugenicists thought made one desirable.

and so on. There’s a much larger list of phrases, techniques, etc. that were coined for explicitly loaded perogatives. The racist and eugenicist past of the field were certainly never taught when I learned statistics.⁴

Chapters 3 and 4 largely paint the background for why the social and political conditions of the time had an incentive to push the frequentist interpretation of probability. After reading these chapters an important take-away is to perhaps ask:

Why those social and political conditions pushed otherwise intelligent people to incorporate bad ideas from an aristocracy with and agenda?
How despite any notion of whether the math was right or wrong; subjective or objective; or otherwise, that the fields of probability and statistics evolved to what we are taught today?

Clayton does a great job of citing his sources for these two chapters. It was perhaps shocking to learn of the degree in which early eugenicism in the colonial era influenced the direction of purported “objective” science. Recognizing the aforementioned rhetorical device used here does support one of Clayton’s later conclusions, however. Namely, that we should consider the hypothesis [we] didn’t assume. Just don’t walk away thinking that because you used Bayes’ theorem you’re not skewing your priors to support ideas on the wrong side of history, you still have to justify and write out those priors as well.

Logic and the replication crisis

I have little to say about this other than Clayton does a wonderful job at portraying the Kafka-esque process that is trying to write a paper using orthodox probability and statistics. He examines several common issues that do not work under frequentist methods, but present no problem at all for Bayesian methods.

Chapters 5 and 6 were the point in the book where I realized that maybe much of what Clayton was arguing was not oriented for someone in photogrammetry and Geomatics. Earlier hints in the book suggested that Gauss and Laplace laid the future for the fields of surveying, astronomy, and physics; In contrast, Galton, Pearson, and Fisher laid the groundwork for biology, medicine, and many of the softer sciences.

Notably, many of the problems of base-rate neglect, extreme or unlikely data, and optional stopping aren’t really an issue in Geomatics problems. Although many in the field of Computer Vision will almost exclusively use a uniform set of priors (of which I could write a whole article about), it is rare in photogrammetry to ignore the prior and posterior uncertainties. My class on “robust” least-squraes came in handy here after all since I could imagine ways to estimate the “lost spinning robot” or unknowns with “extreme or unlikely data” with distributions that were not normal, or contained data with obvious outliers.

Unfortunately, I don’t necessarily think that the intuitions of Bayesian methods are evenly distributed across everyone working in my field. While we do start with the right tools, there is a tendency to assume uniform priors where there is real information at hand. Likewise, I rarely if ever see anyone solve the optional stopping problem in least-squares — few people will utilize what was taught to me as the “summation-of-normals” formulation of least-squares. Importantly, we shouldn’t treat several “sampled” experiements different than a single experiment where all the samples are part of the samples. Summing our inferences across many smaller experiments should be no different than putting all the data into the same inference problem.

Criticisms

The case is pretty strong for Bayesian methodologies to take over for our current orthodox methods. I agree with most of the messaging of the book. The history seems to check out, and the copious number of examples in the book strengthen the case that what we’re doing today is wrong.

Yet, I may still have misunderstood some parts of the book. I think that the book speaks with pretty broad strokes, and for the purpose of replacing our current illogical methods, I think that’s fine. But there are some nuances that I think are worth bringing up.

Take-aways from the preface

Clayton makes some claims in the preface of the book, that are labeled as being intentionally controversial at the outset. Further, he claims that by the end of the book we should be much more accepting of them, with a deeper perspective on the underlying logic. While I think the book does make good on attempts to support them for the most part, there is one which I’m not sure I can necessarily square up:

No special care is required to avoid “overfitting” a model to the data, and validating the model against a separate set of test data is generally a waste.

Most of what I work on with statistics is in the realm of optimization. Namely, I am often performing photogrammetric bundle adjustments to model optical sensors according to some mathematical model. A common practice is to leave out some points in a photogrammetric network and use these as test or check points. By using the final optimized parameters and “un-projecting” image points back into 3D space, we can use the coordinates of the check points (observed through some separate control survey) to evaluate the relative accuracy (not precision) of the final solution.

This is an important step, because many of the parameters we are optimizing for suffer from a form of projective compensation (more about this in a later criticism). It is entirely possible to be unable to observe (not directly observe, but in the sense of an observable parameter) many of the unknowns in the optimization. We don’t use check points or test points to validate over/under-fitting data, we use it to determine if our least-squares optimization resulted in a local optimum or if it converged to an answer that agrees with geometric reality!

Of course the above quote says that it is “generally a waste.” Perhaps this is meant to be indicative of the fact that methods resembling this are abused in softer sciences. Needless to say, I can tell I’m experiencing confusion here, because my intuition of how to compare my model of a camera to geometry in the real world seems to clashing with a broader condemnation of the method.

On the topic of overfitting, I would also (partially?) disagree. I’ve written numerous articles for my employer, Tangram Vision detailing ways in which classical computer vision models (such as using \(f_x\) and \(f_y\)) are incorrect models, precisely because they overfit observable quantities. More recently, I’ve gone into detail as to why those models are worse-off than the alternative, because they introduce more projective compensations throughout the final solution. It’s not that these models can’t be useful, but they aren’t nearly as repeatable, nor are they more indicative of the actual physics in practice. More on this in a later criticism, but I give Clayton the benefit of the doubt here. If I had zero prior knowledge of a better way to model some quantity, then worrying about overfitting before you have any indication that your model is even close to correct is premature.

The Way Out, or what we should do from now on

The last chapter in the book is called “The Way Out,” and serves as a conclusion not just on what went wrong, but likewise what should be done to correct ourselves and move away from the errors of the past. In short, Clayton’s recommendations are:

Abandon the frequentist interpretation and its language
Don’t fear the prior
Ignore the data you didn’t get; focus on the hypotheses you didn’t assume
Get used to approximate answers
Give up on objectivity; try for validity instead

I actually don’t have much criticism for many of these, and find myself in strong agreement. The first point is perhaps the only one I will question here, in the following sub-section.

Abandoning the frequentist interpretation & language

Abandoning the frequentist interpretation is actually quite easy. Bayes’ theorem has not exactly evolved over time, and the case for a more unified practice of probability and statistics is made quite well.

Changing our language, however, I am much more skeptical of. Clayton actually enumerates what he finds to be the most important changes in our vocabulary. I’ve provided the following table to summarize these:

Orthodoxy says…	We should replace with
Random variable	Unknown
Standard deviation	Uncertainty (or if using the inverse — precision)
Variance / Covariance	Second central moment
Correlation⁵	N/A (don’t use this word ever)
Linear regression	Linear modeling
Significant difference	N/A (instead, report a probability distribution)

This is pretty compact as far as these kinds of changes go and I have to wonder if he didn’t hold back in this part of the book somehow (I mean, there is a LOT of jargon in statistics). Nevertheless, I actually agree with the first and second (unknowns and uncertainty). In geomatics parlance, we actually already prefer unknown and uncertainty, although standard deviation will be used colloquially quite often. I’m not sure if that’s a problem in Canada / North America or if that’s very general across my own field, so perhaps I could be better informed here.

As for (co)variance, correlation, or regression — I’m seriously unconvinced this language will ever change. It may have a terrible past, but I genuinely think the field is damned to keep these forevermore. With respect to correlation, I prefer projective compensation. Mostly because in any optimization / machine learning context, the definition we apply to it is a bit better used:

Projective compensation is a relative measure of the degree in which residual errors in the modeling of two unknowns will correspond with one another. In the estimation of unknowns, it is an effect where the estimate of one parameter shifts as a result of compensating for residual error that has been projected into it due to its functional (model) or observed relationship (observed data) with another parameter.

This is the working definition that I use for projective compensation when dealing with it in the photogrammetry realm. If this doesn’t make a lot of sense, I’ve written a much better take at defining it on my employer’s blog.

As for (co)variance and regression — I’m not sure we can rightly replace these terms. There is far too much use in the active machine-learning / deep-learning space that I would be quite optimistic to believe it could be changed in short order. My main criticism of this here is mostly that variance is pretty benign in my opinion. The variance across a selection of points in space does not bring to mind the kind of work that Galton or Fisher were doing. Moreover, I can’t imagine our primary APIs in the multitude of mathematical libraries / languages / etc. changing for this. For example, swapping out numpy.cov with numpy.second_central_moment is unlikely to work out.

Perhaps my “criticism” is quite weak here though, after all I am advocating for doing nothing in the face of a book I largely agree with and am happy to have read. Maybe something a bit more succinct than “second central moment” is worth mulling over.

Conclusion

“Bernoulli’s Fallacy” is a wonderful introduction to the history of probability and statistics, and a scathing damnation of orthodox statistics as they exist today. While the book is not perfectly comprehensive, and is written in a tone towards those in the fields of medicine / biology / soft-sciences, it remains a good read with a lot of worthwhile food for thought for anyone who has ever touched statistics in their life.

The book discusses a racist past of statistics and how they were used to promote “eugenics” and any associated ideologies that endorse eugenics. Beyond pure history, it demonstrates the confusion of frequentist methodologies, and how they map onto reality. The book is eloquent, and kept me gripped to it from beginning to end.

Needless to say, I quite enjoyed it despite my criticisms, and suspect that many of my criticisms may be disingenuously nit-picky or may be misunderstanding something in particular. The topic is extremely relevant in my own day-to-day work, and I’ve done as much as I can to incorporate it to what degree I can. Not because it is morally right, but because Bayesian statistics are just more effective.

Lastly, if you’ve managed to tag along on this review this far, thank you. This has been one of my longer posts and on a topic few will engross themselves in for “fun.” As always, feel free to contact me if you have some comment on the review, or if you just wanted to share your insights from the book as well.

Also if you’re the author, Aubrey Clayton: hi! I hope I haven’t grossly misinterpreted anything you’ve said. I hope you enjoy the review. ↩
I’d be remiss if I didn’t also mention that in 2016 / 2017, near the end of my Master’s degree, I started reading the LessWrong sequences. This was perhaps the first experience I had where the Bayesian approach was not only heavily used in practical contexts, but generally admired and to some degree even worshipped. It was through this that I eventually tied together the concept that Least-Squares was fundamentally based on a Bayesian process.

I’m also happy to say that this was not a unique insight. Bernoulli’s fallacy does indeed go through the history of how Gauss and Laplace independently came up with the least-squares method using the Bayesian interpretation of statistics. It was refreshing having this intuition and then learning the history afterwards. I’m sure it should have perhaps been the other way around (history first, insights second), but it was somewhat validating to know that how I was applying my tools had the epistemic backing that the Bayesian approach argues for. ↩
There is also a sum-rule and product-rule for probabilities as well. These are:
\[P(A | \chi) + P(\neg A| \chi) = 1 \textsf{ ;; Sum Rule}\]
and
\[P(A \land B | \chi) = P(A | \chi) \cdot P(B | A \land \chi) \textsf{ ;; Product Rule}\]
Clayton copies these directly from Edwin Jaynes’ book, “Probability theory: the Logic of Science.” From these, we can derive Bayes’ theorem as well as pretty much any probabilistic logic. ↩
Where have I heard that before? Oh right, just about every history lesson. :(

It may be worth stepping back for a moment and recognizing both the rhetorical trick being done by Clayton here, and an evaluation of whether that changes how to perceive the content of the book. The rhetorical device in play here is “the orthodoxy is founded on racist/colonial/awful ideas and they were really against Bayes’ theorem.” What is left between the gaps is a sense that the Bayesian school of thought isn’t racist/colonial/awful. I think it’s important to not paint a clear canvas over history there either. Bayes’ theorem may be more mathematically correct, but coming away from the book with a notion that frequentist = racist, Bayesian = not racist would be a mistake.

Overall, I think a history mired in eugenics is probably bound to be problematic in a lot of ways, but in the spirit of critical thought it’s probably at least worth mentioning that Bayesian methods are not acquitted of any wrongdoing. You do have to at least be honest about your priors, so you may be more fault tolerant to bad hypotheses, but I have yet to be convinced that there’s any math that can change a man’s mind if he doesn’t want it changed. ↩
I added this in here myself, but given the text throughout the book Clayton is quite scathing of this term. ↩

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

ThatGeoGuy

Geomatics. Technology. Life.