The Bayes classifier

The Bayes classifier is a Data Scientist’s unbeatable arch enemy. When faced with the challenge of finding an optimal solution to a classification problem, it is the ultimate (theoretical!) benchmark when evaluating the performance of a classifier. The problem with the Bayes classifier is that it uses information that generally is not available to us in the real world. Therefore, it is more like a theoretical tool that shows that an optimal classifier always exists. Also, it shows us that given enough information about a problem, then a simple idea can lead to the optimal solution.

This blog post was inspired by exercise 7 of chapter 3.5 of the book Understanding Machine Learning. Since this blog post contains some hints for solving this exercise, it is a good idea to try and solve it yourself first. If you get stuck, feel free to return to this blog post.

Means and medians revisited

The Bayes classifier is related to the ordinary median; since we are about to find out why the Bayes classifier is always optimal (given that we precisely know the distribution of our data) let us first revise what optimization problem the median solves. Since the proof will be somewhat computation heavy, let us do the same for the mean first. This may be seen as an add-on to the blog post Means and medians where a general theme about these two invariants has already emerged: While both serve similar purposes, usages of the median in theoretical contexts are always more involved.

Theorem. Let \(X\) be a real valued random variable.

  • The mean minimizes the \(L^2\)-distance to \(X\), i.e., it satisfies \[ \mathbb{E}[X] = \mathop{\mathrm{argmin}}_{c \in \mathbf{R}} \mathbb{E}\left[(X-c)^2\right]^{1/2}. \]
  • The median minimizes the \(L^1\)-distance to \(X\), i.e., it satisfies \[ \mathbb{M}[X] = \mathop{\mathrm{argmin}}_{c \in \mathbf{R}} \mathbb{E}[|X-c|], \] where \(|\cdot|\) denotes the absolute value.

The theorem tells us that both the mean and the median are real numbers that are closest to the random variable; however, they are closest in different measures.

A short digress on different distance metrics

In case you are still wondering about what the different notions of distances are supposed to mean, the following should get you covered. There is an intrinsic need for different distance metrics: While a plane may travel the direct path from one point of the world to another, a car has to stick to roads and a train is bound to rails. So what exactly is a distance? Or, more precisely, what are the essential features of the different distance metrics that all of them share? This is where mathematicians get excited: Extracting similarities from a priori different entities and finding an appropriate abstraction that unites them is one of our main selling points. Here’s what mathematicians have come up with a long time ago:

A distance metric is a function assigning a real value (the distance) to any two points in space such that

  1. the distance from any point to itself is zero,
  2. the distance from any point \(a\) to another \(b\) is the same as the distance from \(b\) to \(a\) (distances should be symmetric), and
  3. the distance from any point \(a\) to another \(b\) is always smaller than or equal to the sum of the distances from \(a\) to yet another point \(c\) and from \(c\) to \(b\) (making roundabouts increases distance).

A related measure is length. Imagine a two dimensional coordinate system. If we use the usual picture of vectors as arrows in that coordinate system then we may assign a real value to each vector by measuring the length of that arrow. This yields the so-called euclidean norm or the \(2\)-norm. The formula for the length of the vector pointing from the origin to the point \((x,y)\) is \(\|(x,y)\|_2 := \sqrt{x^2 + y^2}\). We obtain a distance metric when postulating that this length is the distance from the point \((x,y)\) to the origin \((0,0)\). There are many other kinds of lengths; one of many examples is the \(1\)-norm defined by \(\|(x,y)\|_1 := |x| + |y|\). The distance metric induced by this norm is the distance when you are only allowed to move horizontally or vertically across two dimensional space. The difference between the 1-norm and the 2-norm.

Let us turn to (discrete) random variables and how to make sense of the \(L^1\)- and \(L^2\)-distances. As you might suspect by now, they are related to the \(1\)- and \(2\)-norms, respectively. Let us assume that we are given a random variable \(X\) attaining two values \(x\) and \(y\). Let the probability that \(X = x\) be \(p\) so that \(P(X=y) = 1-p\). By the explicit formula for the mean, we see \[ \mathbb{E}[|X|] = p|x| + (1-p)|y| = \|(px, (1-p)y\|_1\] and, likewise, \[ \mathbb{E}[X^2]^{1/2} = \sqrt{px^2 + (1-p)y^2} = \|(\sqrt{p}x, \sqrt{1-p}y)\|_2.\] This means that the \(L^1\)-distance is a probability-weighted version of the \(1\)-norm, while the same holds for the \(L^2\)-distance and the \(2\)-norm. This might explain why the mean is simpler to use in theoretical contexts or why it might feel more natural (especially given its simple definition): It comes from minimizing a distance metric induced by the natural euclidean distance.

The proof

For the mean

Let us tackle the mean first. There is a pretty straightforward proof that uses calculus: First notice that because the square root is a monotonic increasing function, is suffices to show that the mean minimizes \(c \mapsto \mathbb{E}\left[(X-c)^2\right]\). Then, by linearity of the expected value, this is a quadratic polynomial in \(c\) so that it attains precisely one minimum. By calculus, computing that minimum is now just a matter of taking the derivative and solving a linear equation. However, I want to give you one proof that does not even need calculus. We merely need to use an equation that you might know as the third binomial formula: \(a^2 – b^2 = (a+b)(a-b)\). Here’s how it goes. Getting rid of the square root is done by the same argument as before. Then, for any two real values \(c\) and \(d\), we compute

\mathbb{E}[(X-c)^2] – \mathbb{E}[(X-d)^2] &= \mathbb{E}[(X-c)^2 – (X-d)^2] \\
&= \mathbb{E}[(X-c + X-d)(X-c-X+d)] \\
&= \mathbb{E}[(2X-c-d)(d-c)] \\
&= (2\mathbb{E}[X]-c-d)(d-c).

Here, we have used linearity of the expected value in the first step, the third binomial formula with \(a = X-c\) and \(b = X-d\) in the second and linearity of the expected value in the last as well as the fact that the expected value of a constant random variable (such as \(c\) and \(d\)) equals that constant. This chain of equations holds for any two real values. This means it particularly holds for \(d = \mathbb{E}[X]\). Let’s plug this in:

\mathbb{E}[(X-c)^2] – \mathbb{E}[(X-\mathbb{E}[X])^2] &= (2\mathbb{E}[X]-c-\mathbb{E}[X])(\mathbb{E}[X]-c) \\
&= (\mathbb{E}[X]-c)^2 \geq 0,

as the square of any real number is non-negative. Reformulating, this means \[ \mathbb{E}[(X-c)^2] \geq \mathbb{E}[(X-\mathbb{E}[X])^2] \] holds for any real value \(c\) showing that \(\mathbb{E}[X]\) minimizes \(L^2\)-distance to \(X\).

For the median

While the idea for showing that the median minimizes \(L^1\)-distance is the same as in the proof above for the mean, be warned that this part is a lot more finicky. Because the absolute value is not differentiable, a straightforward proof using calculus is not available either. We will start like before by computing \(\mathbb{E}[|X-c|] – \mathbb{E}[|X-d|]\) for any two real values \(c\) and \(d\). One technicality I have to introduce to you is the so-called indicator function of a set \(A\) that I will denote by \(\chi_A\). This is a function that is \(1\) on \(A\) and \(0\) anywhere else. We will use it to deal with certain case distinctions in a concise way. For starters, let us assume \(d > c\) and compute

|X-d| – |X-c| &= \chi_{\{X \leq c\}}(|X-d| – |X-c|) + \chi_{\{X > c\}}(|X-d| – |X-c|) \\
&= \chi_{\{X \leq c\}}(d-X-c+X) + \chi_{\{X > c\}}(|X-d|- X + c) \\
&= \chi_{\{X \leq c\}}(d-c) + \chi_{\{d > X > c\}}(d-X – X + c) + \chi_{\{X \geq d\}}(X-d- X + c) \\
&= \chi_{\{X \leq c\}}(d-c) + \chi_{\{d > X > c\}}(d-2X + c) + \chi_{\{X \geq d\}}(-d+c) \\
&\geq \chi_{\{X \leq c\}}(d-c) + \chi_{\{d > X > c\}}(-d+c) + \chi_{\{X \geq d\}}(-d+c) \\
&= \chi_{\{X \leq c\}}(d-c) + \chi_{\{X > c\}}(-d+c).

Here, we first split up the computation into the cases \(X \leq c\) and \(X > c\), getting rid of some of the absolute values. The next step was to split up the second case into the cases \(d > X > c\) and \(X \geq d\). Finally, in this second case we realized that we can estimate \(-2X > -2d\). After that, we simply united these two cases back together. Now, applying the expected value to this inequality and using monotony and linearity, we end up with

\mathbb{E}[|X-d|] – \mathbb{E}[|X-c|] &\geq \mathbb{E}[\chi_{\{X \leq c\}}](d-c) + \mathbb{E}[\chi_{\{X > c\}}](-d+c) \\
&= (d-c)\left(P(X \leq c) – P(X > c)\right) \\
&= (d-c)\left(2P(X \leq c) – 1\right)

In the case \(d < c\), we may do a very similar computation by splitting up on \(X \geq c\) and \(X < c\). In this case, we end up with \[\mathbb{E}[|X-d|] – \mathbb{E}[|X-c|] \geq (c-d)\left(2P(X \geq c) – 1\right).\] What was this good for? If we plug in \(c = \mathbb{M}[X]\) in either case, we end up with \[\mathbb{E}[|X-d|] – \mathbb{E}[|X-\mathbb{M}[X]|] \geq 0\] as \(P(X \geq \mathbb{M}[X]) = P(X \leq \mathbb{M}[X]) = 1/2\). This shows that the median minimizes \(L^1\)-distance.

Classification problems

Let us use stochastics to model the following setup. We are given a bunch of data samples. Each data sample is either labeled (classified) as positive (1) or negative (0). Given an unlabeled data sample, we are tasked with giving a good estimate of whether it should be labeled as positive or negative.

An example

We are given news articles and are tasked with finding out whether they are about sports or not. In this case, our data samples consist of word count vectors generated from those articles (this means that for each article we create a table containing the information how often words appear in the article). Our positive data samples will be the sports articles.

Even for this simple example, we see that both the labeling as well as the information in the data sample are dependent of each other: For instance, an article containing sports-related words (‘football’, ‘competition’, …) is much more likely to be, in fact, a sports article while an article labeled as a sports article is much more likely to contain sports-related words more often than the average article.

An inference pipeline transforming an article into a word count vector that is fed into a classifier.
A classical inference pipeline consisting of a data transformation and a classifier receiving the transformed data.

Furthermore, this is a classical example of an inference pipeline. For data to become interpretable by a classifier, it has to be transformed appropriately first. In this case, the raw data might be given by the article, its headlines, and sub headlines. The raw data will then be transformed into a word count vector that in turn is fed into a classifier that predicts whether the article is about sports or not.

Modelling classification problems

Since we do not know anything else about the problem, we might as well assume that the data samples are given to us at random, adhering to a probability distribution \(P\). More precisely, given a set \(S\) as our sample value space, we think of the data samples as an \(S\)-valued random variable \(X\) and of the labeling as a \(\{0,1\}\)-valued random variable \(Y\). In general, \(X\) and \(Y\) will very much depend on each other, as we saw in the preceding example: The values of \(X\) could be word count vectors while \(Y\) could be the labeling of the underlying article as being sports-related or not. We could model the bunch of data samples as a set; however, it is possible that there are duplicate data samples, even with the same label. Since we want to research on the underlying probability distribution, the occurrence of such duplicate data samples could carry precious information. Thus, we opt for modeling the given data samples as an element of \((S \times \{0,1\})^m\), with \(m\) being the amount of data samples obtained. Given such a tuple, we are tasked with constructing a classifier, i.e. a function \[f\colon S \to \{0,1\}.\]

Loss functions

How do we know if we have actually found a good classifier? What could good mean in this context? The idea of a loss function is to penalize each false prediction of the classifier. The penalty could be a constant number or it could depend on whether or positive or a negative sample has been falsely predicted. For this simple task at hand, the easiest would be to take the empirical risk.

Definition. Given an \(m\)-tuple of labeled data samples \(D\), the empirical risk of a classifier \(f\) is given by \[ L_{\mathrm{ER},D}(f) := \frac{|\{(x,y) \in D \mid f(x) \neq y \}|}{m}. \]

In other words, the empirical risk of a classifier computes the fraction of falsely predicted data samples. This means that the classifier is penalized by \(1/m\) for each false prediction. While this might be a good start for evaluating the performance of a classifier, concentrating on minimizing empirical risk alone is prone to overfitting, i.e. producing classifiers that perform well on the data samples obtained but fail miserably on new data.

What we should be more interested in, even though will not be computable in real world examples, is the true risk of a classifier.

Definition. If \(P\) is the probability distribution of the tuple of data samples and labelings \((X,Y)\), then the true risk of a classifier \(f\) is given by \[ L_P(f) := P(\{(x,y) \mid f(x) \neq y\}).\]

In other words, the true risk of a classifier penalizes false predictions of labeled data \((x,y)\) by the true probability of them occurring. In comparison, the empirical risk tries to approximate the true probability by the relative frequency of the labeled data sample in our sample set. If our sample set is large enough, this should be a good approximation; however, if we are out of luck, our sample set could contain labeled data samples with much higher frequency than they usually appear in the wildness. This leads to some interesting approaches of dealing with classification problems by minimizing the true risk for all probability distributions simultaneously. These approaches allow for a slim margin of error depending on the size of the sample set. But this is for another blog post.

The Bayes classifier

What would we do if we knew the underlying probability distribution \(P\)? Given an unlabeled data sample \(x \in S\), we could compute the probability of its label being positive \(P(Y = 1 \mid X = x)\). This is a conditional probability and may be read as the probability of \(Y\) being \(1\) given that \(X = x\). If this probability is bigger than \(1/2\) we should place our bet on the label being positive. Otherwise, it should be negative. This simple assignment is the Bayes classifier.

Definition. For the probability distribution \(P\) of \((X,Y)\), the Bayes classifier is the classifier given by \[ f_P\colon x \mapsto \begin{cases} 1 & P(Y=1 \mid X = x) \geq 1/2 \\ 0 & \text{else}\end{cases}.\]

An idea, no matter how simple, might yield a good result. In this case, in fact, the simple idea underlying the Bayes classifier yields the optimal result.

Theorem. The Bayes classifier minimizes the true risk. More precisely, given any other classifier \(f\colon S \to \{0,1\}\), the inequality \[ L_P(f) \geq L_P(f_P) \] holds.

Note that even though the Bayes classifier minimizes the true risk, its empirical risk could be arbitrarily high. Indeed: Consider a sample set consisting of one data sample that is labeled as positive. Assume that the probability of that data sample being labeled positively is smaller than \(1/2\). This one data sample is then mislabeled by the Bayes classifier yielding an empirical risk of \(1\).

The connection to the median

Before diving into the proof and to prepare for it, let me explain the connection between the Bayes classifier and the median of a random variable. In fact, the value \(f_P(x)\) for any unlabeled data sample \(x\) may be interpreted as the conditional median of the random variable \(Y\) given \(X = x\). Here’s what this means: The value of the Bayes classifier satisfies

P(Y \geq f_P(x) \mid X = x) &\geq 1/2 \text{ and} \\
P(Y \leq f_P(x) \mid X = x) &\geq 1/2.

Indeed, if \(f_P(x) = 1\) then \[P(Y \geq f_P(x) \mid X = x) = P(Y = 1 \mid X = x) \geq 1/2\] and \[P(Y \leq f_P(x) \mid X = x) = P(Y \in \{0,1\} \mid X = x) = 1 \geq 1/2.\] The proof for \(f_P(x) = 0\) is similar.

A sketch for the proof

As the proof for the Bayes classifier minimizing the true risk uses conditional expectations that I have not yet talked about, I will only give a sketch consisting of hints for the experts here. Also, this is the solution of exercise 7 in chapter 3.5 of Understanding Machine Learning; feel free to read this if you are stuck but I suggest on getting your hands dirty yourself first and giving this exercise a good try.

Ready? The first step is to rephrase the true risk as \[ L_P(f) = \mathbb{E}[|Y – f(X)|]. \] Then, we may use one of the central properties of conditional expectations: The expected value of a conditional expectation of a random variable is the same as the expected value of the random variable. This yields \[ L_P(f) = \mathbb{E}\left[\mathbb{E}[|Y – f(X)| \mid X]\right]. \] Now we are almost done. By monotony of the expected value, it suffices to minimize \(\mathbb{E}[|Y – f(X)| \mid X]\) pointwise. And this is precisely what the Bayes classifier does by the preceding paragraph: For each \(x \in S\), \(\mathbb{E}[|Y – f(x)| \mid X = x]\) is minimal if \(f(x)\) is the conditional median of \(Y\) given \(X = x\) which is the value \(f_P(x)\) of the Bayes classifier.

Further reading

Means and medians

Understanding random experiments through deterministic invariants is a central goal of the mathematical branch known as probability theory or stochastics. While the expected value (or the mean) of a random experiment has a very natural interpretation (as its name suggests, it is the value we expect to occur on average when repeating the experiment) and a simple ad hoc definition as well as an easy generalization once foundations of measure theory have been established, the situation for the median is not as nice. Even though the median is more robust when dealing with outliers and, thus, might yield a more convenient substitute for the expected value, I have not encountered a general definition and treatment of it until I was concerned with learning the underlying statistical theory of Machine Learning (as for instance in The Elements of Machine Learning by Hastie et. al. or in Understanding Machine Learning: From Theory to Algorithms by Shalev-Shwartz and Ben-David). This post gives a basic treatment for discrete random variables and should be accessible to readers with some understanding of elementary discrete math and probability.

Expected values and medians for discrete random variables

Let \(X\) be a random variable attaining finitely many real values \(\{x_1, \dotsc, x_k\} \subset \mathbf{R}\). Denote the probability that \(X = x_i\) by \(p_i\). We know that

  • \(0 \leq p_i \leq 1\) for all \(i\) and
  • \(\sum_{i=1}^k p_i = p_1 + p_2 + \dotsc + p_k = 1.\)

Definition. The expected value (or the mean) of the random variable \(X\), denoted by \(\mathbb{E}[X]\) is the weighted average of its values, weighted by their respective probabilities, i.e. \(\mathbb{E}[X] := \sum_{i=1}^k p_i x_i.\)

Let us quickly discuss a couple of examples and some computational rules before moving on to the median.

  • If \(X\) is constant, i.e. it only attains one value \(x_1\) with probability \(p_1 = 1\), then the expected value agrees with that one value.
  • If we assume that \(p_1 = \dotsc = p_k = 1/k\), i.e. that \(X\) is distributed uniformly on its values, then the expected value will be the ordinary average of the values \(\{x_1, \dotsc, x_k\}\).
  • Let \(k = 2\) so that \(X\) takes on only two values. Let the first value be \(x_1 = 0\) taken on with probability \(p_1 = 0.9999\) and let the second value be \(x_2 = 10000000000 = 10^{10}\) taken on with probability \(p_2 = 1 – p_1 = 0.0001 = 10^{-4}\). By our definition, the expected value of \(X\) is \begin{align*}\mathbb{E}[X] &= p_1 x_1 + p_2 x_2 \\&= 0.999 \cdot 0 + 10^{-4} \cdot 10^{10} = 10^6.\end{align*} Even though we know with 99.99% probability that \(X\) will be zero, the expected value is some unexpectedly high number because there is the trace of a chance that \(X = 10^{10}\). This example is the heart of what we mean by that the expected value is prone to outliers.

Computational rules. The following rules are central to the expected value. They may be proved by (more or less) direct computation from the definition.

  • The expected value is monotone, i.e. for random variables \(X, Y\) such that \(X \geq Y\), we also have \(\mathbb{E}[X] \geq \mathbb{E}[Y]\).
  • The expected value is a linear functional, i.e. for real numbers \(\alpha, \beta \in \mathbb{R}\) and another random variable \(Y\), \(\mathbb{E}[\alpha X + \beta Y] = \alpha\mathbb{E}[X] + \beta\mathbb{E}[Y]\).

Definition. A median of the random variable \(X\) denoted by \(\mathbb{M}[X]\) or \(\mathbf{Median}[X]\) is a real number such that both the probabilities that \(X \leq \mathbb{M}[X]\) as well as \(X \geq \mathbb{M}[X]\) are bigger than \(1/2\). In formulae: \(P(X \leq \mathbb{M}[X]) \geq 1/2\) and \(P(X \geq \mathbb{M}[X]) \geq 1/2\).

The problem is immediately apparent: The definition does not give a formula for the median and there is no reason for the median to be unique (note that I only defined what a median is). Let us take a look at the same examples we exercised for the expected value.

  • If \(X\) is constant then the median agrees with the value that \(X\) attains with probability \(1\). Even this trivial example is not as immediate as in the case of the expected value, so let’s give some details: Let’s say the sole value that \(X\) attains is \(x\). Then we certainly have \(P(X \geq x) = 1 \geq 1/2\) and \(P(X \leq x) = 1 \leq 1/2\).
  • If \(X\) is distributed uniformly on its values, we may compute the median by ordering its values. Let \(x_1 < \dotsb < x_k\) be such an ordering of its values. We claim that every real number in the closed interval \(I = [x_{\lfloor(k+1)/2\rfloor}, x_{\lceil(k+1)/2\rceil}]\) is a median of \(X\). Here, \(\lfloor(k+1)/2\rfloor\) denotes the biggest integer smaller than \((k+1)/2\) and \(\lceil(k+1)/2\rceil\) denotes the smallest integer bigger than \((k+1)/2\) (in German, these brackets are known as the lower resp. upper Gauß brackets). If \(k\) is odd, \((k+1)/2\) already is an integer. In this case, \(I\) consists of one single point. Now, let \(x \in I\). Then \(x \geq x_{\lfloor(k+1)/2\rfloor}\) and, thus, \begin{align*} P(X \leq x) &\geq P(X \leq x_{\lfloor(k+1)/2\rfloor})\\ &= p_1 + \dotsb + p_{\lfloor(k+1)/2\rfloor} \\ &= 1/k (\lfloor(k+1)/2\rfloor) \\ &\geq 1/k ( (k+1)/2 – 1/2) = 1/2. \end{align*} The other defining inequality follows similarly. Often, the median is agreed to be the middle point of the interval \(I\) so that one can talk about the median in this case.
  • Let \(X\) be the random variable taking on the value \(0\) with probability \(0.9999\) and \(10^{10}\) with probability \(10^{-4}\). Then \(\mathbb{M}[X] = 0\) is the only median of \(X\). Indeed, \(P(X \geq 0) = 1 \geq 1/2\) and \(P(X \leq 0) = P(X = 0) = 0.9999 \geq 1/2\) so that \(0\) is a median of \(X\). If \(x > 0\), then \(P(X \geq x) = 10^{-4} < 1/2\) and if \(x < 0\), then \(P(X \leq x) = 0 < 1/2\) so that in both cases \(x\) cannot be a median of \(X\).

Comparing the outcome of the last example for the expected value and the median, we see that the median yields a more typical value of the random variable if its distribution is skewed. What if the distribution is symmetric, though?

Definition. Let \(X\) be a random variable. We say that \(X\) is distributed symmetrically onto its values if there is a symmetry point, i.e., a real value \(s\) such that \(P(X \geq s + x) = P(X \leq s – x)\) for all real values \(x\).

Observation. If \(X\) is distributed symmetrically onto its values, then the expected value is a median, i.e. \(\mathbb{E}[X] = \mathbb{M}[X]\).

I would like to prove this observation; actually, I would like to do a little more and give you some insight into the thought process that goes into writing a proof for this. In order to separate my thought process from the sentences that actually end up being written up in the proof, let us agree on this: My thought process will be written in normal weight, while sentences in the proof will be written in cursive. This should come as no surprise, as more often than not proofs tend to be written in dense, short language without any decoration, but let me give you a spoiler here: Much more will be spent on elaborating on my thought process than actually writing down the proof.

Proof. We start with one thing that mathematicians often do to make the proof more readable. We write “without loss of generality we may assume…”, followed by a replacement of the original statement in question with something that, at first, might seem like a special case.

Let me give you more details for the statement at hand. I would like to get rid of the nuisance that is the existence of the symmetry point in the definition of a symmetric distribution. What if I were to replace the random variable \(X\) by \(Y = X – s\)? Well, this shifts the values: \(Y\) takes on the values \(y_1 < \dotsb < y_k\) where \(y_i = x_i – s\). And this makes \(0\) a symmetry point for its values; indeed, \begin{align*}P(Y \geq 0 + x) &= P(X – s \geq x) = P(X \geq s + x) \\ &= P(X \leq s-x) = P(Y \leq 0-x).\end{align*}Okay, this means that we have replaced \(X\), the random variable that has some unknown symmetry point \(s\), by some special random variable \(Y\) that has a symmetry point \(0\) that looks like it could be easier to handle. But what about our task at hand? Let us assume that we have proven the statement for the random variable \(Y\), i.e., that the expected value \(\mathbb{E}[Y]\) is a median of \(Y\). Does this tell us something about \(X\)? Yes: since \(Y = X – s\) and because the expected value is linear, we have \(\mathbb{M}[X-s] = \mathbb{M}[Y] = \mathbb{E}[Y] = \mathbb{E}[X-s] = \mathbb{E}[X] – s\). We are close. If we knew that \(\mathbb{M}[X-s] = \mathbb{M}[X] – s\), it would immediately follow that the expected value of \(X\) is a median. And, in fact, this holds: Let \(m\) be a median of \(X\). Then \(P(X-s \geq m-s) = P(X \geq m) \geq 1/2\) and \(P(X-s \leq m-s) = P(X \leq m) \geq 1/2.\)

This all means that it is sufficient to prove the statement for the special random variable \(Y\) because, as we have shown above, this implies that the statement is also true for \(X\). Because mathematicians do not like to waste letters (or, rather, we like the letter \(X\) quite a lot), we tend to rename \(Y\) and call it \(X\) again. Still with me? We are ready to write the first sentence of our proof:

Without loss of generality, we may assume that \(X\) has \(0\) as its symmetry point.

More often than not, there is no justification given for why no generality has been lost, because after all that internal monologue that led to this sentence it should be obvious (at least to the author). Because we might feel pity on any future reader stumbling upon this first sentence, let us at least leave a small hint for why this replacement is sane:

Without loss of generality, we may assume that \(X\) has \(0\) as its symmetry point (otherwise, replace \(X\) by \(X – s\) where \(s\) is the original symmetry point).

One fact should be emphasized here: A written mathematical proof should not be expected to spell out every small detail to the reader. Rather, a proof should be a blueprint or a guideline that assists the reader in gaining understanding of why the statement is true. How much detail is spelled out depends on the mathematical level of the text the proof is embedded in. It is very much possible that the observation we are proving right now is left as an exercise to the reader if we are reading a text book on advanced probability theory or statistics so that the reader is expected to have some background on those topics.

Okay, let us turn to some computations now. Because the condition on the probabilities in the definition of a symmetric distribution and in the definition of the median look quite similar, let’s try and compute the median first. Since we have assumed that \(0\) is a symmetry point and the condition \(P(X \geq x) = P(X \leq -x)\) holds for all real values \(x\), it has to hold for \(0\), which means \(P(X \geq 0) = P(X \leq 0)\). This looks like it could be useful if we were to prove that \(0\) is a median. Let’s see: Because \(X\) is a real valued random variable, the probability that \(X\) attains any real value is \(1\) (otherwise, something would be very off). We could rephrase that by saying that \(X\) either is a negative real number, or a non-negative real number. Let us write this down in formulas: \(1 = P(X \geq 0 \text{ or } X < 0) = P(X \geq 0) + P(X < 0)\). Here, we have used that the events “\(X\) is negative” and “\(X\) is non-negative” are disjoint so that their probabilites simply add up to the union. Since the probability that \(X\) is negative is smaller than or equal to the probability that \(X\) is non-positive (because potentially \(0\) could be a value \(X\) attains), we see that \(P(X < 0) \leq P(X \leq 0)\). Putting this together with the assumption that \(P(X \leq 0) = P(X \geq 0)\), we obtain \(1 \leq P(X \geq 0) + P(X \leq 0) = 2P(X\geq 0)\), i.e. that \(P(X \leq 0) = P(X \geq 0) \geq 1/2)\) so that \(0\) is a median. This lays down a proof strategy: We may show the assertion somewhat indirectly, namely by showing that both the median and the mean are \(0\). Let us write this down to prepare our reader for the following.

We will show that both the median and the expected value are \(0\).

Since we have already worked out that the median is \(0\), we may pin that down.

We will start by showing that the median is \(0\).

Now, to write computations down more concisely, mathematicians tend to rewrite their whole thought process in reverse order in a single string of equations leading to the desired result. Let me show you how this could be done in this case.

Since \(0\) is a symmetry point, we have \begin{align*}P(X \leq 0) &= P(X \geq 0) \\ &= 1/2 \bigl(P(X \geq 0) + P(X \leq 0)\bigr) \\ &\geq 1/2 \bigl( P(X \geq 0) + P(X < 0)\bigr) = 1/2\end{align*} which shows that \(0\) is a median.

A reader well-versed in basic probability theory can immediately follow these computations so that we leave out any justifications.

Turning to the expected value, we could use its definition and the assumption on the random value but I’m not exactly sure how. Also, my guts tell me that this will be a mess of indices and intransparent computation. Maybe a more abstract argument will work here. Let’s rephrase the fact that \(0\) is a symmetry point slightly: For each real value \(x\) we have \(P(X \leq x) = P(X \geq -x) = P(-X \leq x).\) This means that the functions \(x \mapsto F_X(x) := P(X \leq x)\) and \(x \mapsto F_{-X}(x) := P(-X \leq x)\) agree everywhere. These functions are known as the distribution function of \(X\) resp. \(-X\) and, as their name suggests, encode information about the way a random variable is distributed. This suggests that \(X\) and \(-X\) are identically distributed, i.e. they attain the same values with the same probabilities and this of course means that they share the same expected value. And this is precisely what we need: Because the expected value is linear, \[\mathbb{E}[X] = \mathbb{E}[-X] = -\mathbb{E}[X].\] But this is only possible if \(\mathbb{E}[X] = 0\). If you have never heard of distribution functions before, let me give some hints for how to show that \(X\) and \(-X\) share the same values attained with the same probabilities more directly. Order the values \(x_1 < \dotsb < x_k\) of \(X\). Then for \(x < x_1\), \[P(-X \leq x) = P(X \leq x) = 0.\] This means that \(x_1\) is the smallest value that both \(X\) and \(-X\) attain. And they attain it with equal probability since \[P(-X \leq x_1) = P(X \leq x_1) = p_1.\] Now we can work our way up through the values of \(X\) inductively. If we have already shown that both \(X\) and \(-X\) attain the values \(x_1, \dotsc, x_i\) with the same probability, then \[P(-X \leq x) = P(X \leq x) = p_1 + \dotsb + p_i\] for all \(x_i \leq x < x_{i+1}\) which means that there is no value in between \(x_i\) and \(x_{i+1}\) that \(-X\) attains. And \(-X\) attains \(x_{i+1}\) with probability \(p_{i+1}\) since \[P(-X \leq x_{i+1}) = P(X \leq x_{i+1}) = p_1 + \dotsb + p_{i+1}.\] Okay, now let’s see what ends up being written in the proof:

Turning to the expected value, a rephrasing of \(0\) begin a symmetry point is that \(P(X \leq x) = P(X \geq -x) = P(-X \leq x)\) which means that \(X\) and \(-X\) share the same distribution function. Since this implies that they both attain the same values with the same probability, we have \(\mathbb{E}[X] = \mathbb{E}[-X] = -\mathbb{E}[X]\) so that \(\mathbb{E}[X] = 0\) follows.

Up to now, one could come to the conclusion that a random variable is far from being symmetrically distributed if the expected value and the median stray far from each other (especially when recalling the example with the huge outlier). However, this is not the case. Let us look at a modification of the example with the huge outlier.

Example. Let \(X\) be a random variable attaining the values \(\{0, 10^{10}\}\). Assume that \(0\) is attained with probability \(1/2 + \varepsilon\) where \(\varepsilon > 0\) is some arbitrarily small positive number. Then \(\mathbb{M}[X] = 0\) while \[\mathbb{E}[X] = (1/2 – \varepsilon) 10^{10} \geq 10^9\] at least for \(\varepsilon \leq 2/5\).

The last example shows that the strength of the median is also its weakness: The median only factors in the majority of values a random variable attains even if the majority is only ever-so-slightly bigger than \(1/2\). One fix could be to not only consider the median but also the first and the third quartile: These are real values that split up the values a random variable attains differently; namely, the first quartile \(q_1\) is a real value such that \(P(X \leq q_1) \geq 0.25\) and \(P(X \geq q_1) \geq 0.75\) while the third quartile satisfies \(P(X \leq q_3) \geq 0.75\) and \(P(X \geq q_3) \geq 0.25\). In this context, the median is also known as the second quartile. Even more generally, we could extract the follow invariant.

Definition. Let \(0 \leq \delta \leq 1\) and \(X\) be a random variable. A \(\delta\)-quantile for the random variable \(X\) is a real number \(q_\delta\) such that \(P(X \leq q_\delta) \geq \delta\) and \(P(X \geq q_\delta) \geq 1 – \delta.\)

Thus, the median may also be called a \(1/2\)-quantile. A warning: The notation here might not be standard; for instance, the wikipedia article about quantiles ( talks about \(k\)-th \(q\)-quantiles which we would simply refer to as \(k/q\)-quantiles.

Conclusion. The combination of both invariants gives us a first impression of how well-behaved the distribution is: On the one hand, both the expected value and the median being close to each other might suggest that the distribution is close to being symmetric; on the other hand, both straying far from each other hints at the random variable suffering from outliers. To be sure about this, one should also consider other quantiles.