Hessians
In this post, I wish to introduce the concept of Hessians and explore two theorems regarding the interchangeability of orders of partial derivatives.
Definition: Let $\Omega$ be any open subset of $\mathbb{R}^n$ and let $f:\Omega\rightarrow\mathbb{R}$ be Frechet-differentiable on $\Omega.$ We denote the Frechet-derivative of $f$ at any $x\in \Omega$ as $Df|_{x}\in \mathcal{L}(\mathbb{R}^n,\mathbb{R}).$ Let the function $x\mapsto Df|_{x}$ on $\Omega$ also be Frechet-differentiable at $x_0.$ This is equivalent to the function $x\mapsto \nabla f(x)$ on $\Omega$ being Frechet-differentiable at $x_0.$ We denote the Frechet derivative of the latter function at $x_0$ as $Hf|_{x_0}.$ Note that $Hf|_{x_0}\in\mathcal{L}(\mathbb{R}^n,\mathbb{R}^n).$ We call the matrix of this linear operator (with respect the standard basis on $\mathbb{R}^n$) the Hessian of $f$ at $x_0.$ Hence, $$\big[Hf|_{x_0}\big]=\begin{bmatrix}\partial_1\partial_1 f(x_0)&\dots&\partial_n\partial_1 f(x_0)\\ \vdots&\ddots&\vdots\\ \partial_1\partial_n f(x_0)&\dots&\partial_n\partial_nf(x_0)\end{bmatrix}\text{ }(\star)$$
Recall: Let $\mathcal{X}$ and $\mathcal{Y}$ be two real normed vector spaces and let $\Omega\subset\mathcal{X}$ be an open subset. We say that $f:\Omega\rightarrow\mathcal{Y}$ is Frechet-differentiable at $x\in \Omega$ if there exists a bounded linear operator $A\in\mathcal{L}(\mathcal{X},\mathcal{Y})$ such that $$\lim_{h\to 0_{\mathcal{Y}}}\frac{||f(x+h)-f(x)-Ah||_{\mathcal{X}}}{||h||_{\mathcal{Y}}}=0$$
$A$ is called the Frecher-derivative of $f$ at $x$ and is denoted $Df|_{x}.$
Theorem 1: Let $\Omega$ be any open subset of $\mathbb{R}^n,$ $f:\Omega\rightarrow\mathbb{R}$ be Frechet-differentiable on $\Omega,$ and $x\mapsto \nabla f$ be Frechet-differentiable at some $x_0\in\Omega.$ Then, $\partial_i\partial_j f(x_0)=\partial_j\partial_i f(x_0)$ for all $i,j\in\{1,2,\dots,n\}.$
Theorem 2: Let $\Omega$ be any open subset of $\mathbb{R}^n$ and for some $i,j\in\{1,2,\dots,n\}$ with $i\neq j,$ let $f:\Omega\rightarrow\mathbb{R}$ be such that $\partial_i f,\partial_j f,$ and $\partial_i\partial_j f$ exist everywhere in $\Omega$ and $x\mapsto \partial_i\partial_j f(x)$ is continuous at some $x_0\in \Omega$ then $\partial_j\partial_i f(x_0)$ exists and equals $\partial_i\partial_j f(x_0).$
Notes about the two theorems: Firstly, the none of the two theorems is stronger than the other. Theorem 2 doesn’t require any information about $\partial_i^2f$ (which makes it “stronger” than theorem 1) but requires continuity of the mixed partial derivatives unlike theorem 1 (making it “weaker” than theorem 2). Secondly, when it comes to the proofs, theorem 1 requires some ingenuity (something that I probably wouldn’t have been able to come up with in a timed situation like in an exam) while theorem 2 is similar in spirit to the proof of the fact that “existence and continuity of all first-order partial derivatives implies Frechet differentiability”: a clever “mean-value theorem + continuity of derivative” argument. Thirdly, when it comes to usefulness, again, there’s not a clear winner. Personally, I’ve had to flip the order of integration in only two cases:
In general distribution theory/ fourier analysis, where one can blindly do whatever, because the functions are $\mathcal{C}^\infty$ so both theorems work.
In physics/ computer science applications, where one can blindly do whatever, because … I actually don’t know.
Anyways, before proving the theorems, we state a lemma and define some notations.
Lemma: Let $\Omega$ be any open subset of $\mathbb{R}^n.$ Then, $f:\Omega\rightarrow\mathbb{R}^m$ is Frechet-differentiable at $x\in\Omega$ if and only if $f_i:x\mapsto f(x)^T\mathbf{e}_i$ is Frechet-differentiable at $x$ for all $i\in\{1,2,\dots,m\}.$ Here, $\mathbf{e}_i$ denotes the $i^\text{th}$ standard basis vector in $\mathbb{R}^m.$
Sketch of Proof: Since this is relatively easy to prove, I will present the main observation needed for the proof. $$\big[Df|_{x}\big]=\begin{bmatrix}\dots \mathbf{v}^T_1\dots\\\dots \mathbf{v}^T_1\dots\\\vdots\\\dots \mathbf{v}^T_m\dots\\\end{bmatrix}\iff \big[{Df_i}|_{x}\big]=\mathbf{v}^T_i\text{ for all }i\in\{1,2,\dots,m\}$$
Notations: Let $\mathcal{X}$ and $\mathcal{Y}$ be any two vector spaces. For any $f\in \mathcal{Y}^\mathcal{X},$ we define $\Delta_{h}f\in \mathcal{Y}^\mathcal{X}$ for any $h\in \mathcal{X}$ as $\Delta_{h}f(x):x\mapsto f(x+h)-f(x).$
At this point we are ready to prove the two theorems.
Proof of Theorem 1: Since $\Delta f(x)^T\mathbf{e}_i=\partial_if(x)$ for all $x\in\Omega,$ the above lemma implies that $\partial_kf$ is Frechet-differentiable at $x$ for all $k\in\{1,2,\dots,n\}.$ Fix any $i,j\in\{1,2,\dots,n\}.$
Since $\Omega$ is open, there exists a $\delta_0>0$ such that $B(x,\delta_0)\subset \Omega.$ Hence, for any $h\in\mathbb{R}$ such that $0<|h|<\delta_0/\sqrt{2},$ we have $$\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_i\partial_jf(x)\right|\le \left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\frac{\partial_jf(x+h\mathbf{e}_i)}{h}+\frac{\partial_jf(x)}{h}\right|+\left|\frac{\partial_jf(x+h\mathbf{e}_i)}{h}-\frac{\partial_jf(x)}{h}-\partial_i\partial_jf(x)\right|$$$$=\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)}{h^2}\right|+\left|\frac{\partial_jf(x+h\mathbf{e}_i)-\partial_jf(x)-h\partial_i\partial_jf(x)}{h}\right|$$
Note that all terms are well-defined since $h$ is small enough.
For a fixed $h,$ we define the function $\psi_h:[0,1]\rightarrow\mathbb{R}$ as $$\psi_h(t)=f(x+ht\mathbf{e}_j+h\mathbf{e}_i)-f(x+ht\mathbf{e}_j)-ht\partial_jf(x+h\mathbf{e}_i)+ht\partial_jf(x)$$
Note that $$\psi_h(1)-\psi_h(0)=f(x+h\mathbf{e}_j+h\mathbf{e}_i)-f(x+h\mathbf{e}_j)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)-f(x+h\mathbf{e}_i)+f(x)$$$$=\big(f(x+h\mathbf{e}_j+h\mathbf{e}_i)-f(x+h\mathbf{e}_i)\big)-\big(f(x+h\mathbf{e}_i)-f(x)\big)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)$$$$=\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)\text{ }(\dagger_1)$$
Since $f$ is Frechet-differentiable (here, the mere existence of partial derivatives suffices), we can apply the mean value theorem to conclude that there exists some $m(h)\in (0,1)$ such that $$\psi_h(1)-\psi_h(0)=\psi’_h(m(h))$$$$=h\partial_jf(x+hm(h)\mathbf{e}_j+h\mathbf{e}_i)-h\partial_jf(x+hm(h)\mathbf{e}_j)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)\text{ }(\dagger_2)$$
Consider any $\epsilon>0.$
Using the Frechet-differentiable of $\partial_jf,$ there exists a $\delta_1>0$ such that $$\left|\frac{\partial_jf(x+k_j\mathbf{e}_j+k_i\mathbf{e}_i)-\partial_jf(x)-k_j\partial^2_jf(x)-k_i\partial_i\partial_jf(x)}{\sqrt{k_j^2+k_i^2}}\right|<\frac{\epsilon}{12\sqrt{2}}\text{ whenever }0<\left|\left|k_j\mathbf{e}_j+k_i\mathbf{e}_i\right|\right|_{\mathbb{R}^n}=\sqrt{k_j^2+k_i^2}<\delta_1$$
Note that in the above step, Frechet-differentiability is necessary (this is the only step, where it’s needed). Mere existence of second-order partial derivatives does not suffice since we are computing the directional derivative in the direction $hm(h)\mathbf{e}_j+h\mathbf{e}_i.$
Since $m(h)\in (0,1),$ we must have $\left|\left|hm(h)\mathbf{e}_j+h\mathbf{e}_i\right|\right|_{\mathbb{R}^n}=\sqrt{m(h)^2h^2+h^2}<\delta_1$ if $0<|h|<\delta_1/\sqrt{2}.$ Hence, if $|h|<\min\{\delta_0/\sqrt{2},\delta_1/\sqrt{2}\},$ we have $$\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+h\mathbf{e}_i)-\partial_jf(x)-hm(h)\partial^2_jf(x)-h\partial_i\partial_jf(x)}{h}\right|<\frac{\epsilon}{12}\text{ }(\star_1)$$
Here, we used the fact that $2h^2\ge m(h)^2h^2+h^2$ meaning $1/|h|\le \sqrt{2}/\sqrt{m(h)^2h^2+h^2}.$
Similarly, there exists $\delta_2,\delta_3>0$ such that $$\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x)-hm(h)\partial^2_jf(x)}{h}\right|<\frac{\epsilon}{12}\text{ for }0<|h|<\min\{\delta_0/\sqrt{2},\delta_2\}\text{ }(\star_2)$$$$\left|\frac{\partial_jf(x+h\mathbf{e}_i)-\partial_jf(x)-h\partial_i\partial_jf(x)}{h}\right|<\frac{\epsilon}{12}\text{ for }0<|h|<\min\{\delta_0/\sqrt{2},\delta_3\}\text{ }(\star_3)$$
Note that for $(\star_2)$ and $(\star_3)$ the mere existence of the second-order partial derivatives suffices.
Hence, using $(\dagger_2),$ if $0<|h|<\delta’=\min\{\delta_0/\sqrt{2},\min_{k=1}^3\delta_k\}$ then, $$\left|\frac{\psi_h(1)-\psi_h(0)}{h^2}\right|=\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+\mathbf{e}_i)-\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x+h\mathbf{e}_i)+\partial_jf(x)}{h}\right|$$$$=\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+\mathbf{e}_i)-\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x+h\mathbf{e}_i)+\partial_jf(x)}{h}\right|$$$$=\Bigg|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+h\mathbf{e}_i)-\partial_jf(x)-hm(h)\partial^2_jf(x)-h\partial_i\partial_jf(x)}{h}$$$$-\frac{\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x)-hm(h)\partial^2_jf(x)}{h}-\frac{\partial_jf(x+h\mathbf{e}_i)-\partial_jf(x)-h\partial_i\partial_jf(x)}{h}\Bigg|<\frac{\epsilon}{4}$$
The last inequality follows from the triangle inequality and $(\star_1),(\star_2),$ and $(\star_3).$ Hence, $(\dagger_1)$ implies that $$\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)}{h^2}\right|<\frac{\epsilon}{4}\text{ for }0<|h|<\delta’\text{ }(\star_4)$$
Using $(\star_3)$ and $(\star_4),$ the inequality at the beginning of the proof implies that $$\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_i\partial_jf(x)\right|<\frac{\epsilon}{4}+\frac{\epsilon}{12}<\frac{\epsilon}{2}\text{ for }0<|h|<\delta’$$
Similarly, there exists a $\delta’’>0$ such that $$\left|\frac{\Delta_{h\mathbf{e}_i}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_j\partial_if(x)\right|<\frac{\epsilon}{2}\text{ for }0<|h|<\delta’’$$
However, it is easily verified that $\Delta_{h\mathbf{e}_i}\Delta_{h\mathbf{e}_i}f(x)=\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_j}f(x).$ Thus, choosing any $h\in\mathbb{R}$ with $0<|h|<\min\{\delta’,\delta’’\}$ shows that $$\left|\partial_i\partial_jf(x)-\partial_j\partial_if(x)\right|\le \left|\partial_i\partial_jf(x)-\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}\right|+\left|\frac{\Delta_{h\mathbf{e}_i}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_j\partial_if(x)\right|<\epsilon$$
Since $\epsilon$ is arbitrary, we must have $\partial_i\partial_jf(x)=\partial_j\partial_if(x).$ This proves the theorem.
Proof of Theorem 2: We first consider the case where $n=2$ with $i=1,j=2$ and then reduce the arbitrary-$n$ case to the simpler case.
Let $x=(x_1,x_2).$ Since $\Omega$ is open, there exists a $\delta$ such that $B(x,\delta)\subset\Omega.$ Consider any $h_1,h_2\in\mathbb{R}$ such that $0<|h_1|,|h_2|<\delta/\sqrt{2}.$
Define $\phi_{h_1}:(x_2-\delta/\sqrt{2},x_2+\delta/\sqrt{2})\rightarrow\mathbb{R}$ as $$\phi_{h_1}(t)=f(x_1+h_1,t)-f(x_1,t)$$
This is well-defined since $|t-x_2|<\delta/\sqrt{2}$ so that both $||(x_1,x_2)-(x_1+h_1,t)||_{\mathbb{R}^2}<\delta$ and $||(x_1,x_2)-(x_1,t)||_{\mathbb{R}^2}<\delta$ meaning that both $(x_1+h_1,t),(x_1,t)\in\Omega.$
Since $\partial_2f$ exists everywhere in $\Omega,$ the mean-value theorem implies that there exists a $\beta=\beta(h_1,h_2)\in (x_2,x_2+h_2)$ such that $$\phi_{h_1}(x_2+h_2)-\phi_{h_1}(x_2)=h_2\phi’_{h_1}(\beta)=h_2[\partial_2f(x_1+h_1,\beta)-\partial_2f(x_1,\beta)]$$
Define $\psi_{(h_1,h_2)}:(x_1-\delta/\sqrt{2},x_1+\delta/\sqrt{2})\rightarrow\mathbb{R}$ as $$\psi_{(h_1,h_2)}(t)=\partial_2f(t,\beta)$$
This is well-defined since $|t-x_1|<\delta/\sqrt{2}$ so that $||(x_1,x_2)-(t,\beta)||_{\mathbb{R}^2}=\sqrt{(t-x_1)^2+(\beta-x_2)^2}<\sqrt{(t-x_1)^2+h_2^2}<\delta.$
Again, since $\partial_1\partial_2f$ exists everywhere in $\Omega,$ the mean-value theorem implies that there exists a $\alpha=\alpha(h_1,h_2)\in (x_1,x_1+h_1)$ such that $$\psi_{(h_1,h_2)}(x_1+h_1)-\psi_{(h_1,h_2)}(x_1)=h_1\psi’_{(h_1,h_2)}(\alpha)=h_1\partial_1\partial_2f(\alpha,\beta)$$
Note that $\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)=\phi_{h_1}(x_2+h_2)-\phi_{h_1}(x_2).$ Hence, combining all the above steps, we get $$\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)=h_1h_2\partial_1\partial_2f(\alpha,\beta)\text{ }(\star)$$
Consider any $\epsilon>0.$
Since $\partial_1\partial_2f$ is continuous at $x=(x_1,x_2),$ there exists a $\delta’>0$ such that $$\left|\partial_1\partial_2f(y)-\partial_1\partial_2f(x)\right|<\epsilon/2\text{ for all }y\in\Omega\text{ with }||y-x||_{\mathbb{R}^2}<\delta’$$
Hence, if $0<|h_1|,|h_2|<\min\{\delta,\delta’\}/\sqrt{2}$ then using the fact that $|\beta-x_2|<|h_2|$ and $|\alpha-x_1|<|h_1|,$ implying that $||(x_1,x_2)-(\alpha,\beta)||_{\mathbb{R}^2}<\delta’$ we get $$\left|\partial_1\partial_2f(\alpha,\beta)-\partial_1\partial_2f(x_1,x_2)\right|<\frac{\epsilon}{2}$$ Using $(\star)$, this means that whenever $0<|h_1|,|h_2|<\min\{\delta,\delta’\}/\sqrt{2},$ we have $$\left|\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}-\partial_1\partial_2f(x_1,x_2)\right|<\frac{\epsilon}{2}\text{ }(\dagger_1)$$
Now, for any $h_2$ satisfying the above constraints , since $\partial_1f$ exists everywhere, there exists a $\delta_{h_2}>0$ such that $$\left|\frac{\frac{f(x_1+h_1,x_2+h_2)-f(x_1,x_2+h_2)}{h_1}-\frac{f(x_1+h_1,x_2)-f(x_1,x_2)}{h_1}}{h_2}-\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}\right|<\frac{\epsilon}{2}\text{ for }0<|h_1|<\delta_{h_2}\text{ }(\dagger_2)$$
Note that $\frac{\frac{f(x_1+h_1,x_2+h_2)-f(x_1,x_2+h_2)}{h_1}-\frac{f(x_1+h_1,x_2)-f(x_1,x_2)}{h_1}}{h_2}$ $\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}.$ $(\dagger_3)$
Now, for any $h_2\in\mathbb{R}$ such that $0<|h_2|<\min\{\delta,\delta’\}/\sqrt{2},$ pick an $h_1\in\mathbb{R}$ with $0<|h_1|<\min\{\delta_{h_2},\delta/\sqrt{2},\delta’/\sqrt{2},\}.$ Then, using $(\dagger_1),(\dagger_2),$ and $(\dagger_3),$ we get $$\left|\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}-\partial_1\partial_2f(x_1,x_2)\right|$$$$\le \left|\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}-\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}\right|+\left|\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}-\partial_1\partial_2f(x_1,x_2)\right|<\epsilon$$
This means that $$\lim_{h_2\to 0}\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}=\partial_1\partial_2f(x_1,x_2)$$
Thus, $\partial_2\partial_1f(x_1,x_2)$ exists and equals $\partial_1\partial_2f(x_1,x_2).$
For arbitrary $n$ and any two $i\neq j$ consider the function $F:\mathbb{R}^2\rightarrow\mathbb{R}$ defined as $$F(u,v)=f(x+u\mathbf{e}_i+v\mathbf{e}_j)$$
Now, the reduction to the simpler case is trivial. This proves our theorem.