Jensen’s Inequality
Theorem: If $X$ is an integrable real-valued random variable on a given probability space and $\phi:\mathbb{R}\to\mathbb{R}$ is a convex function such that $\phi\circ X$ is also integrable then $$\phi\big(\mathbb{E}[X]\big)\le\mathbb{E}[\phi\circ X]$$
This post is about proving the above theorem. The given inequality is called Jensen’s inequality.
Motivation: I had just completed a graduate course in real analysis (covering essentially chapters 1 through 6 of Folland) and the following semester enrolled in a learning theory class. I was almost certain I’d have no problem with the math in the course (it’s a CS class after all), and YES I didn’t. However, there’s this one theorem that was being used almost every other day (the Jensen’s inequality) that I didn’t have much idea about. Of course, in this class no one was required to know measure theory and, thereby, no one was asked to prove it. We were only asked to prove the discrete version. Anyways, I was stumped. I was supposed to know this! Why didn’t Folland tell me anything about this? By the way, I found out later that Folland did indeed write about it. But he did so as an exercise (the last problem of chapter 3). Anyways, after reading some stuff about convex functions to try to prove the inequality myself (as you can see, my ego was hurt), I could prove it. But this was no Eureka moment. I was actually a little disappointed. I understood why Folland (or any other analysis book) doesn’t talk much about Jensen’s inequality. Well, the proof has NOTHING to do with measure theory. It’s just a property of convex functions (which I didn’t know at that time) that immediately implies Jensen’s inequality. So if you already know the property (Lemma 3), proving the inequality shouldn’t take you any more than two lines. Naturally, 99% of this post is just proving properties of convex functions. The reason I’m writing this post, hence, is to reassure other similarly ego-hurt people that it’s ok. Your knowledge of measure theory is perhaps alright.
To prove the theorem, we need the following lemmas.
Lemma 1: If $\phi:\mathbb{R}\to\mathbb{R}$ is convex then for any $s,t,s',t'\in \mathbb{R}$ with $s\le s'<t'$ and $s<t\le t',$ we have $$\frac{\phi(t)-\phi(s)}{t-s}\le\frac{\phi(t')-\phi(s')}{t'-s'}$$
Note: This is actually an “if-and-only-if” property of convex functions. We only prove one direction because that’s all that we need to prove the main theorem. Reverse-engineering the following to derive the proof of the other direction should be straight-forward.
Proof of Lemma 1: Since $\frac{t-s}{t'-s}\in (0,1],$ we can use the convexity of $\phi$ to get $$\phi(t)=\phi\left(\frac{t-s}{t'-s}(t'-s)+s\right)=\phi\left(\frac{t-s}{t'-s}t'+\left(1-\frac{t-s}{t'-s}\right)s\right)\le \frac{t-s}{t'-s}\phi(t')+\left(1-\frac{t-s}{t'-s}\right)\phi(s)\text{ }(\dagger_1)$$ Similarly, since $\frac{s'-s}{t'-s}\in[0,1),$ we have $$\phi(s')=\phi\left(\frac{s'-s}{t'-s}(t'-s)+s\right)=\phi\left(\frac{s'-s}{t'-s}t'+\left(1-\frac{s'-s}{t'-s}\right)s\right)\le \frac{s'-s}{t'-s}\phi(t')+\left(1-\frac{s'-s}{t'-s}\right)\phi(s)\text{ }(\dagger_1)$$
Using the two inequalities we get
$$\phi(t)-\phi(s)\le \frac{t-s}{t'-s}\phi(t')-\frac{t-s}{t'-s}\phi(s)\text{ (from $(\dagger_1)$)}$$$$\le\frac{t-s}{t'-s}\phi(t')-\frac{t-s}{t'-s}\left[1-\frac{s'-s}{t'-s}\right]^{-1}\left(\phi(s')-\frac{s'-s}{t'-s}\phi(t')\right)\text{ (from $(\dagger_2)$)}$$$$=\frac{t-s}{t'-s}\phi(t')-\frac{t-s}{t'-s}\cdot\frac{t'-s}{t'-s'}\left(\phi(s')-\frac{s'-s}{t'-s}\phi(t')\right)$$$$=\frac{t-s}{t'-s}\phi(t')-\frac{t-s}{t'-s'}\left(\phi(s')-\frac{s'-s}{t'-s}\phi(t')\right)$$$$=\left(\frac{t-s}{t'-s}+\frac{t-s}{t'-s'}\cdot\frac{s'-s}{t'-s}\right)\phi(t')-\frac{t-s}{t'-s'}\phi(s')$$$$=(t-s)\frac{t'-s'+s'-s}{(t'-s)(t'-s')}\phi(t')-\frac{t-s}{t'-s'}\phi(s')=\frac{t-s}{t'-s'}\big(\phi(t')-\phi(s')\big)$$
Thus, rearranging the terms, we get $$\frac{\phi(t)-\phi(s)}{t-s}\le\frac{\phi(t')-\phi(s')}{t'-s'}$$
Lemma 2: Any convex $\phi:\mathbb{R}\to\mathbb{R}$ is Lipshitz continuous in every compact interval $[a,b].$
Proof of Lemma 2: Consider any compact interval $[a,b].$ Fix an arbitrary $c<a$ and an arbitrary $d>b$ and let $$M=\max\left\{-\frac{\phi(a)-\phi(c)}{a-c},\frac{\phi(d)-\phi(b)}{d-b}\right\}$$
Consider any $x,y\in [a,b]$ such that $x\neq y.$ Then, by Lemma 1,
$$-M\le\frac{\phi(a)-\phi(c)}{a-c}\le \frac{\phi(y)-\phi(x)}{y-x}\le \frac{\phi(d)-\phi(b)}{d-b}\le M$$
Hence, $|\phi(y)-\phi(x)|\le M|y-x|.$ Since $x,y\in[a,b]$ is arbitrary, we conclude that $\phi$ is $M$-Lipshitz continuous in $[a,b].$
Lemma 3: For any convex $\phi:\mathbb{R}\to\mathbb{R}$ and $t_0\in \mathbb{R},$ there exists some $\beta\in \mathbb{R}$ such that $\phi(t)-\phi(t_0)\ge\beta(t-t_0)$ for all $t\in\mathbb{R}.$
Proof of Lemma 3: Consider the compact interval $I=[t_0-1,t_0+1].$ Now, using Lemma 2, $\phi$ is Lipshitz continuous in this interval with Lipshitz constant $M_0\ge 0.$ Now, consider any non-increasing sequence $(t_n)_{n=1}^{\infty}\subset I$ such that $t_n>t_0$ for all $n\in\mathbb N$ and $\lim_{n\to\infty}t_n=t.$ Then, for any $n\in\mathbb{N},$ since $t_0<t_{n+1}\le t_{n},$ we can apply Lemma 1 and Lipshitz continuity to conclude
$$-\infty<-M_0\le\frac{\phi(t_{n+1})-\phi(t_0)}{t_{n+1}-t_0}\le\frac{\phi(t_n)-\phi(t_0)}{t_n-t_0}$$
So, using the fact that all lower-bounded non-increasing sequences have a finite limit, the right-hand derivative $\phi'(t_0^+)=\lim_{n\to\infty}\frac{\phi(t_n)-\phi(t_0)}{t_n-t_0}$ exists. Similarly, one can show that the left-hand derivative $\phi'(t_0^{-})$ exists.
Consider any $t'<t_0.$ Using Lemma 1, for any $t>t_0,$ we have $$\frac{\phi(t_0)-\phi(t')}{t_0-t'}\le \frac{\phi(t)-\phi(t_0)}{t-t_0}$$
Hence, the limit of the right hand side as $t$ approaches $t_0$ from the right(which we showed exists) must also satisfy the inequality. That is, $$\frac{\phi(t_0)-\phi(t')}{t_0-t'}\le \lim_{t\to t_0^+}\frac{\phi(t)-\phi(t_0)}{t-t_0}=\phi'(t_0^+)$$
Consider any $t'<t_0$ and $t''>t_0.$ Let $(a_n)_{n=1}^{\infty}\subset [t',t_0)$ be any non-decreasing sequence converging to $t_0$ and $(b_n)_{n=1}^{\infty}\subset (t_0,t’’]$ be any non-increasing sequence converging to $t_0.$ Then, using repeated applications of Lemma 1, we get $$\frac{\phi(t_0)-\phi(t')}{t_0-t'}\le \lim_{n\to \infty}\frac{\phi(a_n)-\phi(t)}{t_0-t}=\phi(t_0^-)\le\phi(t_0^+)=\lim_{n\to \infty}\frac{\phi(b_n)-\phi(t_0)}{t-t_0}\le \frac{\phi(t'')-\phi(t_0)}{t''-t_0}$$
Hence, $\beta=\phi(t_0^+)$ does the job since $\phi(t'')-\phi(t_0)\ge \phi(t_0^+)(t’’-t_0)$ and $\phi(t')-\phi(t_0)\ge \phi(t_0^+)(t’-t_0)$ because $t’’-t_0>0$ and $t’-t_0<0$ respectively. Hence, the claim holds for all $t>t_0$ (as $t’’$ is arbitrary), for all $t<t_0$ (as $t’$ is arbitrary), and obviously for $t=t_0$ itself.
Proof of Jensen's Inequality:
Consider any convex function $\phi:\mathbb{R}\to\mathbb{R}.$ Let $X$ be any real-valued random variable on the probability space $(\Omega,\mathcal{F},\mathbb{P})$ such that both $X$ and $\phi\circ X$ are $L^1$ functions.
Let $t_0=\int_\Omega X(\omega)d\mathbb{P}(w)=\mathbb{E}[X]$ and $t=X(\omega)$ for any $\omega\in\Omega.$ Then, using Lemma 3, there exists a $\beta\in\mathbb{R}$ such that $$\phi\big(X(\omega)\big)-\phi\big(\mathbb{E}[X]\big)\ge \beta\big(X(\omega)-\mathbb{E}[X]\big)$$$$\implies\int_{\Omega}\phi\big(X(\omega)\big)d\mathbb{P}(\omega)-\int_{\Omega}\phi\big(\mathbb{E}[X]\big)d\mathbb{P}(\omega)\ge \int_{\Omega}\beta\big(X(\omega)-\mathbb{E}[X]\big)d\mathbb{P}(\omega)$$$$\implies\int_{\Omega}(\phi\circ X)(\omega)d\mathbb{P}(\omega)-\phi\big(\mathbb{E}[X]\big)\int_{\Omega}d\mathbb{P}(\omega)\ge \beta\left[\int_{\Omega}X(\omega)d\mathbb{P}(\omega)-\mathbb{E}[X]\int_{\Omega}d\mathbb{P}(\omega)\right]$$$$\implies\mathbb{E}[\phi\circ X]-\phi\big(\mathbb{E}[X]\big)\cdot 1\ge \beta\left[\mathbb{E}[X]-\mathbb{E}[X]\cdot 1\right]=0$$
Hence, $\mathbb{E}[\phi\circ X]\ge\phi\big(\mathbb{E}[X]\big).$ This proves Jensen's Inequality.