Understanding random experiments through deterministic invariants is a central goal of the mathematical branch known as probability theory or stochastics. While the *expected value* (or the *mean*) of a random experiment has a very natural interpretation (as its name suggests, it is the value we expect to occur on average when repeating the experiment) and a simple ad hoc definition as well as an easy generalization once foundations of measure theory have been established, the situation for the *median* is not as nice. Even though the median is more robust when dealing with outliers and, thus, might yield a more convenient substitute for the expected value, I have not encountered a general definition and treatment of it until I was concerned with learning the underlying statistical theory of Machine Learning (as for instance in *The Elements of Machine Learning* by Hastie et. al. or in *Understanding Machine Learning: From Theory to Algorithms* by Shalev-Shwartz and Ben-David). This post gives a basic treatment for discrete random variables and should be accessible to readers with some understanding of elementary discrete math and probability.

## Expected values and medians for discrete random variables

Let $latex X$ be a random variable attaining finitely many real values $latex \{x_1, \dotsc, x_k\} \subset \mathbf{R}$. Denote the probability that $latex X = x_i$ by $latex p_i$. We know that

- $latex 0 \leq p_i \leq 1$ for all $latex i$ and
- $latex \sum_{i=1}^k p_i = p_1 + p_2 + \dotsc + p_k = 1.$

**Definition. **The *expected value* (or the *mean*) of the random variable $latex X$, denoted by $latex \mathbb{E}[X]$ is the weighted average of its values, weighted by their respective probabilities, i.e. $latex \mathbb{E}[X] := \sum_{i=1}^k p_i x_i.$

Let us quickly discuss a couple of examples and some computational rules before moving on to the median.

- If \(X\) is
*constant*, i.e. it only attains one value \(x_1\) with probability \(p_1 = 1\), then the expected value agrees with that one value. - If we assume that $latex p_1 = \dotsc = p_k = 1/k$, i.e. that $latex X$ is
*distributed uniformly on its values*, then the expected value will be the ordinary average of the values $latex \{x_1, \dotsc, x_k\}$. - Let $latex k = 2$ so that $latex X$ takes on only two values. Let the first value be $latex x_1 = 0$ taken on with probability $latex p_1 = 0.9999$ and let the second value be $latex x_2 = 10000000000 = 10^{10}$ taken on with probability $latex p_2 = 1 – p_1 = 0.0001 = 10^{-4}$. By our definition, the expected value of $latex X$ is \begin{align*}\mathbb{E}[X] &= p_1 x_1 + p_2 x_2 \\&= 0.999 \cdot 0 + 10^{-4} \cdot 10^{10} = 10^6.\end{align*} Even though we know with 99.99% probability that $latex X$ will be zero, the expected value is some unexpectedly high number because there is the trace of a chance that $latex X = 10^{10}$. This example is the heart of what we mean by that the expected value is prone to outliers.

**Computational rules.** The following rules are central to the expected value. They may be proved by (more or less) direct computation from the definition.

- The expected value is
*monotone*, i.e. for random variables $latex X, Y$ such that $latex X \geq Y$, we also have $latex \mathbb{E}[X] \geq \mathbb{E}[Y]$. - The expected value is a
*linear functional*, i.e. for real numbers $latex \alpha, \beta \in \mathbb{R}$ and another random variable $latex Y$, $latex \mathbb{E}[\alpha X + \beta Y] = \alpha\mathbb{E}[X] + \beta\mathbb{E}[Y]$.

**Definition. **A *median* of the random variable $latex X$ denoted by $latex \mathbb{M}[X]$ or $latex \mathbf{Median}[X]$ is a real number such that both the probabilities that $latex X \leq \mathbb{M}[X]$ as well as $latex X \geq \mathbb{M}[X]$ are bigger than $latex 1/2$. In formulae: $latex P(X \leq \mathbb{M}[X]) \geq 1/2$ and $latex P(X \geq \mathbb{M}[X]) \geq 1/2$.

The problem is immediately apparent: The definition does not give a formula for the median and there is no reason for the median to be unique (note that I only defined what *a* median is). Let us take a look at the same examples we exercised for the expected value.

- If \(X\) is constant then the median agrees with the value that \(X\) attains with probability \(1\). Even this trivial example is not as immediate as in the case of the expected value, so let’s give some details: Let’s say the sole value that \(X\) attains is \(x\). Then we certainly have \(P(X \geq x) = 1 \geq 1/2\) and \(P(X \leq x) = 1 \leq 1/2\).
- If $latex X$ is distributed uniformly on its values, we may compute the median by ordering its values. Let $latex x_1 < \dotsb < x_k$ be such an ordering of its values. We claim that every real number in the closed interval $latex I = [x_{\lfloor(k+1)/2\rfloor}, x_{\lceil(k+1)/2\rceil}]$ is a median of $latex X$. Here, $latex \lfloor(k+1)/2\rfloor$ denotes the biggest integer smaller than $latex (k+1)/2$ and $latex \lceil(k+1)/2\rceil$ denotes the smallest integer bigger than $latex (k+1)/2$ (in German, these brackets are known as the
*lower resp. upper Gauß brackets*). If $latex k$ is odd, $latex (k+1)/2$ already is an integer. In this case, $latex I$ consists of one single point. Now, let $latex x \in I$. Then $latex x \geq x_{\lfloor(k+1)/2\rfloor}$ and, thus, \begin{align*} P(X \leq x) &\geq P(X \leq x_{\lfloor(k+1)/2\rfloor})\\ &= p_1 + \dotsb + p_{\lfloor(k+1)/2\rfloor} \\ &= 1/k (\lfloor(k+1)/2\rfloor) \\ &\geq 1/k ( (k+1)/2 – 1/2) = 1/2. \end{align*} The other defining inequality follows similarly. Often, the median is agreed to be the middle point of the interval $latex I$ so that one can talk about *the* median in this case. - Let \(X\) be the random variable taking on the value $latex 0$ with probability $latex 0.9999$ and $latex 10^{10}$ with probability $latex 10^{-4}$. Then $latex \mathbb{M}[X] = 0$ is the only median of $latex X$. Indeed, $latex P(X \geq 0) = 1 \geq 1/2$ and $latex P(X \leq 0) = P(X = 0) = 0.9999 \geq 1/2$ so that $latex 0$ is a median of $latex X$. If $latex x > 0$, then $latex P(X \geq x) = 10^{-4} < 1/2$ and if $latex x < 0$, then $latex P(X \leq x) = 0 < 1/2$ so that in both cases $latex x$ cannot be a median of $latex X$.

Comparing the outcome of the last example for the expected value and the median, we see that the median yields a more typical value of the random variable if its distribution is *skewed*. What if the distribution is *symmetric*, though?

**Definition. **Let $latex X$ be a random variable. We say that $latex X$ is *distributed symmetrically onto its values* if there is a *symmetry point, *i.e., a real value $latex s$ such that $latex P(X \geq s + x) = P(X \leq s – x)$ for all real values $latex x$.

**Observation. **If $latex X$ is distributed symmetrically onto its values, then the expected value is a median, i.e. $latex \mathbb{E}[X] = \mathbb{M}[X]$.

I would like to prove this observation; actually, I would like to do a little more and give you some insight into the thought process that goes into writing a proof for this. In order to separate my thought process from the sentences that actually end up being written up in the proof, let us agree on this: My thought process will be written in normal weight, while *sentences in the proof will be written in cursive.* This should come as no surprise, as more often than not proofs tend to be written in dense, short language without any decoration, but let me give you a spoiler here: Much more will be spent on elaborating on my thought process than actually writing down the proof.

**Proof.** We start with one thing that mathematicians often do to make the proof more readable. We write *“without loss of generality we may assume…”*, followed by a replacement of the original statement in question with something that, at first, might seem like a special case.

Let me give you more details for the statement at hand. I would like to get rid of the nuisance that is the existence of the symmetry point in the definition of a symmetric distribution. What if I were to replace the random variable $latex X$ by $latex Y = X – s$? Well, this shifts the values: $latex Y$ takes on the values $latex y_1 < \dotsb < y_k$ where $latex y_i = x_i – s$. And this makes $latex 0$ a symmetry point for its values; indeed, \begin{align*}P(Y \geq 0 + x) &= P(X – s \geq x) = P(X \geq s + x) \\ &= P(X \leq s-x) = P(Y \leq 0-x).\end{align*}Okay, this means that we have replaced $latex X$, the random variable that has some unknown symmetry point $latex s$, by some special random variable $latex Y$ that has a symmetry point $latex 0$ that looks like it could be easier to handle. But what about our task at hand? Let us assume that we have proven the statement for the random variable $latex Y$, i.e., that the expected value $latex \mathbb{E}[Y]$ is a median of $latex Y$. Does this tell us something about $latex X$? Yes: since $latex Y = X – s$ and because the expected value is linear, we have $latex \mathbb{M}[X-s] = \mathbb{M}[Y] = \mathbb{E}[Y] = \mathbb{E}[X-s] = \mathbb{E}[X] – s$. We are close. If we knew that $latex \mathbb{M}[X-s] = \mathbb{M}[X] – s$, it would immediately follow that the expected value of $latex X$ is a median. And, in fact, this holds: Let $latex m$ be a median of $latex X$. Then \(P(X-s \geq m-s) = P(X \geq m) \geq 1/2\) and \(P(X-s \leq m-s) = P(X \leq m) \geq 1/2.\)

This all means that it is sufficient to prove the statement for the special random variable $latex Y$ because, as we have shown above, this implies that the statement is also true for $latex X$. Because mathematicians do not like to waste letters (or, rather, we like the letter $latex X$ quite a lot), we tend to rename $latex Y$ and call it $latex X$ again. Still with me? We are ready to write the first sentence of our proof:

*Without loss of generality, we may assume that $latex X$ has $latex 0$ as its symmetry point.*

More often than not, there is no justification given for why no generality has been lost, because after all that internal monologue that led to this sentence it should be obvious (at least to the author). Because we might feel pity on any future reader stumbling upon this first sentence, let us at least leave a small hint for why this replacement is sane:

*Without loss of generality, we may assume that $latex X$ has $latex 0$ as its symmetry point* *(otherwise, replace $latex X$ by $latex X – s$ where $latex s$ is the original symmetry point).*

One fact should be emphasized here: A written mathematical proof should not be expected to spell out every small detail to the reader. Rather, a proof should be a blueprint or a guideline that assists the reader in gaining understanding of why the statement is true. *How much* detail is spelled out depends on the mathematical level of the text the proof is embedded in. It is very much possible that the observation we are proving right now is left as an exercise to the reader if we are reading a text book on advanced probability theory or statistics so that the reader is expected to have some background on those topics.

Okay, let us turn to some computations now. Because the condition on the probabilities in the definition of a symmetric distribution and in the definition of the median look quite similar, let’s try and compute the median first. Since we have assumed that $latex 0$ is a symmetry point and the condition $latex P(X \geq x) = P(X \leq -x)$ holds for all real values $latex x$, it has to hold for $latex 0$, which means $latex P(X \geq 0) = P(X \leq 0)$. This looks like it could be useful if we were to prove that $latex 0$ is a median. Let’s see: Because $latex X$ is a real valued random variable, the probability that $latex X$ attains *any* real value is $latex 1$ (otherwise, something would be very off). We could rephrase that by saying that $latex X$ either is a negative real number, or a non-negative real number. Let us write this down in formulas: $latex 1 = P(X \geq 0 \text{ or } X < 0) = P(X \geq 0) + P(X < 0)$. Here, we have used that the events “$latex X$ is negative” and “$latex X$ is non-negative” are *disjoint* so that their probabilites simply add up to the union. Since the probability that $latex X$ is negative is smaller than or equal to the probability that $latex X$ is non-positive (because potentially $latex 0$ could be a value $latex X$ attains), we see that $latex P(X < 0) \leq P(X \leq 0)$. Putting this together with the assumption that $latex P(X \leq 0) = P(X \geq 0)$, we obtain $latex 1 \leq P(X \geq 0) + P(X \leq 0) = 2P(X\geq 0)$, i.e. that $latex P(X \leq 0) = P(X \geq 0) \geq 1/2)$ so that $latex 0$ is a median. This lays down a proof strategy: We may show the assertion somewhat indirectly, namely by showing that both the median and the mean are $latex 0$. Let us write this down to prepare our reader for the following.

*We will show that both the median and the expected value are $latex 0$.*

Since we have already worked out that the median is $latex 0$, we may pin that down.

*We will start by showing that the median is $latex 0$.*

Now, to write computations down more concisely, mathematicians tend to rewrite their whole thought process in reverse order in a single string of equations leading to the desired result. Let me show you how this could be done in this case.

*Since $latex 0$ is a symmetry point, we have \begin{align*}P(X \leq 0) &= P(X \geq 0) \\ &= 1/2 \bigl(P(X \geq 0) + P(X \leq 0)\bigr) \\ &\geq 1/2 \bigl( P(X \geq 0) + P(X < 0)\bigr) = 1/2\end{align*} which shows that $latex 0$ is a median.*

A reader well-versed in basic probability theory can immediately follow these computations so that we leave out any justifications.

Turning to the expected value, we could use its definition and the assumption on the random value but I’m not exactly sure how. Also, my guts tell me that this will be a mess of indices and intransparent computation. Maybe a more abstract argument will work here. Let’s rephrase the fact that \(0\) is a symmetry point slightly: For each real value \(x\) we have \(P(X \leq x) = P(X \geq -x) = P(-X \leq x).\) This means that the functions \(x \mapsto F_X(x) := P(X \leq x)\) and \(x \mapsto F_{-X}(x) := P(-X \leq x)\) agree everywhere. These functions are known as the *distribution function of \(X\) resp. \(-X\)* and, as their name suggests, encode information about the way a random variable is distributed. This suggests that \(X\) and \(-X\) are identically distributed, i.e. they attain the same values with the same probabilities and this of course means that they share the same expected value. And this is precisely what we need: Because the expected value is linear, \[\mathbb{E}[X] = \mathbb{E}[-X] = -\mathbb{E}[X].\] But this is only possible if \(\mathbb{E}[X] = 0\). If you have never heard of distribution functions before, let me give some hints for how to show that \(X\) and \(-X\) share the same values attained with the same probabilities more directly. Order the values \(x_1 < \dotsb < x_k\) of \(X\). Then for \(x < x_1\), \[P(-X \leq x) = P(X \leq x) = 0.\] This means that \(x_1\) is the smallest value that both \(X\) and \(-X\) attain. And they attain it with equal probability since \[P(-X \leq x_1) = P(X \leq x_1) = p_1.\] Now we can work our way up through the values of \(X\) inductively. If we have already shown that both \(X\) and \(-X\) attain the values \(x_1, \dotsc, x_i\) with the same probability, then \[P(-X \leq x) = P(X \leq x) = p_1 + \dotsb + p_i\] for all \(x_i \leq x < x_{i+1}\) which means that there is no value in between \(x_i\) and \(x_{i+1}\) that \(-X\) attains. And \(-X\) attains \(x_{i+1}\) with probability \(p_{i+1}\) since \[P(-X \leq x_{i+1}) = P(X \leq x_{i+1}) = p_1 + \dotsb + p_{i+1}.\] Okay, now let’s see what ends up being written in the proof:

*Turning to the expected value, a rephrasing of \(0\) begin a symmetry point is that \(P(X \leq x) = P(X \geq -x) = P(-X \leq x)\) which means that \(X\) and \(-X\) share the same distribution function. Since this implies that they both attain the same values with the same probability, we have \(\mathbb{E}[X] = \mathbb{E}[-X] = -\mathbb{E}[X]\) so that \(\mathbb{E}[X] = 0\) follows.*

Up to now, one could come to the conclusion that a random variable is far from being symmetrically distributed if the expected value and the median stray far from each other (especially when recalling the example with the huge outlier). However, this is not the case. Let us look at a modification of the example with the huge outlier.

**Example.** Let \(X\) be a random variable attaining the values \(\{0, 10^{10}\}\). Assume that \(0\) is attained with probability \(1/2 + \varepsilon\) where \(\varepsilon > 0\) is some arbitrarily small positive number. Then \(\mathbb{M}[X] = 0\) while \[\mathbb{E}[X] = (1/2 – \varepsilon) 10^{10} \geq 10^9\] at least for \(\varepsilon \leq 2/5\).

The last example shows that the strength of the median is also its weakness: The median only factors in the majority of values a random variable attains even if the majority is only ever-so-slightly bigger than \(1/2\). One fix could be to not only consider the median but also the *first and the third quartile*: These are real values that split up the values a random variable attains differently; namely, the first quartile \(q_1\) is a real value such that \(P(X \leq q_1) \geq 0.25\) and \(P(X \geq q_1) \geq 0.75\) while the third quartile satisfies \(P(X \leq q_3) \geq 0.75\) and \(P(X \geq q_3) \geq 0.25\). In this context, the median is also known as the *second quartile*. Even more generally, we could extract the follow invariant.

**Definition.** Let \(0 \leq \delta \leq 1\) and \(X\) be a random variable. A *\(\delta\)-quantile for the random variable \(X\)* is a real number \(q_\delta\) such that \(P(X \leq q_\delta) \geq \delta\) and \(P(X \geq q_\delta) \geq 1 – \delta.\)

Thus, the median may also be called a \(1/2\)-quantile. A warning: The notation here might not be standard; for instance, the wikipedia article about quantiles (https://en.wikipedia.org/wiki/Quantile) talks about *\(k\)-th \(q\)-quantiles* which we would simply refer to as \(k/q\)-quantiles.

**Conclusion.** The combination of both invariants gives us a first impression of how well-behaved the distribution is: On the one hand, both the expected value and the median being close to each other might suggest that the distribution is close to being symmetric; on the other hand, both straying far from each other hints at the random variable suffering from outliers. To be sure about this, one should also consider other quantiles.