Hessians

In this post, I wish to introduce the concept of Hessians and explore two theorems regarding the interchangeability of orders of partial derivatives.

Definition: Let $\Omega$ be any open subset of $\mathbb{R}^n$ and let $f:\Omega\rightarrow\mathbb{R}$ be Frechet-differentiable on $\Omega.$ We denote the Frechet-derivative of $f$ at any $x\in \Omega$ as $Df|_{x}\in \mathcal{L}(\mathbb{R}^n,\mathbb{R}).$ Let the function $x\mapsto Df|_{x}$ on $\Omega$ also be Frechet-differentiable at $x_0.$ This is equivalent to the function $x\mapsto \nabla f(x)$ on $\Omega$ being Frechet-differentiable at $x_0.$ We denote the Frechet derivative of the latter function at $x_0$ as $Hf|_{x_0}.$ Note that $Hf|_{x_0}\in\mathcal{L}(\mathbb{R}^n,\mathbb{R}^n).$ We call the matrix of this linear operator (with respect the standard basis on $\mathbb{R}^n$) the Hessian of $f$ at $x_0.$ Hence, $$\big[Hf|_{x_0}\big]=\begin{bmatrix}\partial_1\partial_1 f(x_0)&\dots&\partial_n\partial_1 f(x_0)\\ \vdots&\ddots&\vdots\\ \partial_1\partial_n f(x_0)&\dots&\partial_n\partial_nf(x_0)\end{bmatrix}\text{ }(\star)$$

Recall: Let $\mathcal{X}$ and $\mathcal{Y}$ be two real normed vector spaces and let $\Omega\subset\mathcal{X}$ be an open subset. We say that $f:\Omega\rightarrow\mathcal{Y}$ is Frechet-differentiable at $x\in \Omega$ if there exists a bounded linear operator $A\in\mathcal{L}(\mathcal{X},\mathcal{Y})$ such that $$\lim_{h\to 0_{\mathcal{Y}}}\frac{||f(x+h)-f(x)-Ah||_{\mathcal{X}}}{||h||_{\mathcal{Y}}}=0$$

$A$ is called the Frecher-derivative of $f$ at $x$ and is denoted $Df|_{x}.$

Theorem 1: Let $\Omega$ be any open subset of $\mathbb{R}^n,$ $f:\Omega\rightarrow\mathbb{R}$ be Frechet-differentiable on $\Omega,$ and $x\mapsto \nabla f$ be Frechet-differentiable at some $x_0\in\Omega.$ Then, $\partial_i\partial_j f(x_0)=\partial_j\partial_i f(x_0)$ for all $i,j\in\{1,2,\dots,n\}.$

Theorem 2: Let $\Omega$ be any open subset of $\mathbb{R}^n$ and for some $i,j\in\{1,2,\dots,n\}$ with $i\neq j,$ let $f:\Omega\rightarrow\mathbb{R}$ be such that $\partial_i f,\partial_j f,$ and $\partial_i\partial_j f$ exist everywhere in $\Omega$ and $x\mapsto \partial_i\partial_j f(x)$ is continuous at some $x_0\in \Omega$ then $\partial_j\partial_i f(x_0)$ exists and equals $\partial_i\partial_j f(x_0).$

Notes about the two theorems: Firstly, the none of the two theorems is stronger than the other. Theorem 2 doesn’t require any information about $\partial_i^2f$ (which makes it “stronger” than theorem 1) but requires continuity of the mixed partial derivatives unlike theorem 1 (making it “weaker” than theorem 2). Secondly, when it comes to the proofs, theorem 1 requires some ingenuity (something that I probably wouldn’t have been able to come up with in a timed situation like in an exam) while theorem 2 is similar in spirit to the proof of the fact that “existence and continuity of all first-order partial derivatives implies Frechet differentiability”: a clever “mean-value theorem + continuity of derivative” argument. Thirdly, when it comes to usefulness, again, there’s not a clear winner. Personally, I’ve had to flip the order of integration in only two cases:

  1. In general distribution theory/ fourier analysis, where one can blindly do whatever, because the functions are $\mathcal{C}^\infty$ so both theorems work.

  2. In physics/ computer science applications, where one can blindly do whatever, because … I actually don’t know.

Anyways, before proving the theorems, we state a lemma and define some notations.

Lemma: Let $\Omega$ be any open subset of $\mathbb{R}^n.$ Then, $f:\Omega\rightarrow\mathbb{R}^m$ is Frechet-differentiable at $x\in\Omega$ if and only if $f_i:x\mapsto f(x)^T\mathbf{e}_i$ is Frechet-differentiable at $x$ for all $i\in\{1,2,\dots,m\}.$ Here, $\mathbf{e}_i$ denotes the $i^\text{th}$ standard basis vector in $\mathbb{R}^m.$

Sketch of Proof: Since this is relatively easy to prove, I will present the main observation needed for the proof. $$\big[Df|_{x}\big]=\begin{bmatrix}\dots \mathbf{v}^T_1\dots\\\dots \mathbf{v}^T_1\dots\\\vdots\\\dots \mathbf{v}^T_m\dots\\\end{bmatrix}\iff \big[{Df_i}|_{x}\big]=\mathbf{v}^T_i\text{ for all }i\in\{1,2,\dots,m\}$$

Notations: Let $\mathcal{X}$ and $\mathcal{Y}$ be any two vector spaces. For any $f\in \mathcal{Y}^\mathcal{X},$ we define $\Delta_{h}f\in \mathcal{Y}^\mathcal{X}$ for any $h\in \mathcal{X}$ as $\Delta_{h}f(x):x\mapsto f(x+h)-f(x).$

At this point we are ready to prove the two theorems.

Proof of Theorem 1: Since $\Delta f(x)^T\mathbf{e}_i=\partial_if(x)$ for all $x\in\Omega,$ the above lemma implies that $\partial_kf$ is Frechet-differentiable at $x$ for all $k\in\{1,2,\dots,n\}.$ Fix any $i,j\in\{1,2,\dots,n\}.$

Since $\Omega$ is open, there exists a $\delta_0>0$ such that $B(x,\delta_0)\subset \Omega.$ Hence, for any $h\in\mathbb{R}$ such that $0<|h|<\delta_0/\sqrt{2},$ we have $$\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_i\partial_jf(x)\right|\le \left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\frac{\partial_jf(x+h\mathbf{e}_i)}{h}+\frac{\partial_jf(x)}{h}\right|+\left|\frac{\partial_jf(x+h\mathbf{e}_i)}{h}-\frac{\partial_jf(x)}{h}-\partial_i\partial_jf(x)\right|$$$$=\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)}{h^2}\right|+\left|\frac{\partial_jf(x+h\mathbf{e}_i)-\partial_jf(x)-h\partial_i\partial_jf(x)}{h}\right|$$

Note that all terms are well-defined since $h$ is small enough.

For a fixed $h,$ we define the function $\psi_h:[0,1]\rightarrow\mathbb{R}$ as $$\psi_h(t)=f(x+ht\mathbf{e}_j+h\mathbf{e}_i)-f(x+ht\mathbf{e}_j)-ht\partial_jf(x+h\mathbf{e}_i)+ht\partial_jf(x)$$

Note that $$\psi_h(1)-\psi_h(0)=f(x+h\mathbf{e}_j+h\mathbf{e}_i)-f(x+h\mathbf{e}_j)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)-f(x+h\mathbf{e}_i)+f(x)$$$$=\big(f(x+h\mathbf{e}_j+h\mathbf{e}_i)-f(x+h\mathbf{e}_i)\big)-\big(f(x+h\mathbf{e}_i)-f(x)\big)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)$$$$=\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)\text{ }(\dagger_1)$$

Since $f$ is Frechet-differentiable (here, the mere existence of partial derivatives suffices), we can apply the mean value theorem to conclude that there exists some $m(h)\in (0,1)$ such that $$\psi_h(1)-\psi_h(0)=\psi’_h(m(h))$$$$=h\partial_jf(x+hm(h)\mathbf{e}_j+h\mathbf{e}_i)-h\partial_jf(x+hm(h)\mathbf{e}_j)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)\text{ }(\dagger_2)$$

Consider any $\epsilon>0.$

Using the Frechet-differentiable of $\partial_jf,$ there exists a $\delta_1>0$ such that $$\left|\frac{\partial_jf(x+k_j\mathbf{e}_j+k_i\mathbf{e}_i)-\partial_jf(x)-k_j\partial^2_jf(x)-k_i\partial_i\partial_jf(x)}{\sqrt{k_j^2+k_i^2}}\right|<\frac{\epsilon}{12\sqrt{2}}\text{ whenever }0<\left|\left|k_j\mathbf{e}_j+k_i\mathbf{e}_i\right|\right|_{\mathbb{R}^n}=\sqrt{k_j^2+k_i^2}<\delta_1$$

Note that in the above step, Frechet-differentiability is necessary (this is the only step, where it’s needed). Mere existence of second-order partial derivatives does not suffice since we are computing the directional derivative in the direction $hm(h)\mathbf{e}_j+h\mathbf{e}_i.$

Since $m(h)\in (0,1),$ we must have $\left|\left|hm(h)\mathbf{e}_j+h\mathbf{e}_i\right|\right|_{\mathbb{R}^n}=\sqrt{m(h)^2h^2+h^2}<\delta_1$ if $0<|h|<\delta_1/\sqrt{2}.$ Hence, if $|h|<\min\{\delta_0/\sqrt{2},\delta_1/\sqrt{2}\},$ we have $$\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+h\mathbf{e}_i)-\partial_jf(x)-hm(h)\partial^2_jf(x)-h\partial_i\partial_jf(x)}{h}\right|<\frac{\epsilon}{12}\text{ }(\star_1)$$

Here, we used the fact that $2h^2\ge m(h)^2h^2+h^2$ meaning $1/|h|\le \sqrt{2}/\sqrt{m(h)^2h^2+h^2}.$

Similarly, there exists $\delta_2,\delta_3>0$ such that $$\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x)-hm(h)\partial^2_jf(x)}{h}\right|<\frac{\epsilon}{12}\text{ for }0<|h|<\min\{\delta_0/\sqrt{2},\delta_2\}\text{ }(\star_2)$$$$\left|\frac{\partial_jf(x+h\mathbf{e}_i)-\partial_jf(x)-h\partial_i\partial_jf(x)}{h}\right|<\frac{\epsilon}{12}\text{ for }0<|h|<\min\{\delta_0/\sqrt{2},\delta_3\}\text{ }(\star_3)$$

Note that for $(\star_2)$ and $(\star_3)$ the mere existence of the second-order partial derivatives suffices.

Hence, using $(\dagger_2),$ if $0<|h|<\delta’=\min\{\delta_0/\sqrt{2},\min_{k=1}^3\delta_k\}$ then, $$\left|\frac{\psi_h(1)-\psi_h(0)}{h^2}\right|=\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+\mathbf{e}_i)-\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x+h\mathbf{e}_i)+\partial_jf(x)}{h}\right|$$$$=\left|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+\mathbf{e}_i)-\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x+h\mathbf{e}_i)+\partial_jf(x)}{h}\right|$$$$=\Bigg|\frac{\partial_jf(x+hm(h)\mathbf{e}_j+h\mathbf{e}_i)-\partial_jf(x)-hm(h)\partial^2_jf(x)-h\partial_i\partial_jf(x)}{h}$$$$-\frac{\partial_jf(x+hm(h)\mathbf{e}_j)-\partial_jf(x)-hm(h)\partial^2_jf(x)}{h}-\frac{\partial_jf(x+h\mathbf{e}_i)-\partial_jf(x)-h\partial_i\partial_jf(x)}{h}\Bigg|<\frac{\epsilon}{4}$$

The last inequality follows from the triangle inequality and $(\star_1),(\star_2),$ and $(\star_3).$ Hence, $(\dagger_1)$ implies that $$\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)-h\partial_jf(x+h\mathbf{e}_i)+h\partial_jf(x)}{h^2}\right|<\frac{\epsilon}{4}\text{ for }0<|h|<\delta’\text{ }(\star_4)$$

Using $(\star_3)$ and $(\star_4),$ the inequality at the beginning of the proof implies that $$\left|\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_i\partial_jf(x)\right|<\frac{\epsilon}{4}+\frac{\epsilon}{12}<\frac{\epsilon}{2}\text{ for }0<|h|<\delta’$$

Similarly, there exists a $\delta’’>0$ such that $$\left|\frac{\Delta_{h\mathbf{e}_i}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_j\partial_if(x)\right|<\frac{\epsilon}{2}\text{ for }0<|h|<\delta’’$$

However, it is easily verified that $\Delta_{h\mathbf{e}_i}\Delta_{h\mathbf{e}_i}f(x)=\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_j}f(x).$ Thus, choosing any $h\in\mathbb{R}$ with $0<|h|<\min\{\delta’,\delta’’\}$ shows that $$\left|\partial_i\partial_jf(x)-\partial_j\partial_if(x)\right|\le \left|\partial_i\partial_jf(x)-\frac{\Delta_{h\mathbf{e}_j}\Delta_{h\mathbf{e}_i}f(x)}{h^2}\right|+\left|\frac{\Delta_{h\mathbf{e}_i}\Delta_{h\mathbf{e}_i}f(x)}{h^2}-\partial_j\partial_if(x)\right|<\epsilon$$

Since $\epsilon$ is arbitrary, we must have $\partial_i\partial_jf(x)=\partial_j\partial_if(x).$ This proves the theorem.

Proof of Theorem 2: We first consider the case where $n=2$ with $i=1,j=2$ and then reduce the arbitrary-$n$ case to the simpler case.

Let $x=(x_1,x_2).$ Since $\Omega$ is open, there exists a $\delta$ such that $B(x,\delta)\subset\Omega.$ Consider any $h_1,h_2\in\mathbb{R}$ such that $0<|h_1|,|h_2|<\delta/\sqrt{2}.$

Define $\phi_{h_1}:(x_2-\delta/\sqrt{2},x_2+\delta/\sqrt{2})\rightarrow\mathbb{R}$ as $$\phi_{h_1}(t)=f(x_1+h_1,t)-f(x_1,t)$$

This is well-defined since $|t-x_2|<\delta/\sqrt{2}$ so that both $||(x_1,x_2)-(x_1+h_1,t)||_{\mathbb{R}^2}<\delta$ and $||(x_1,x_2)-(x_1,t)||_{\mathbb{R}^2}<\delta$ meaning that both $(x_1+h_1,t),(x_1,t)\in\Omega.$

Since $\partial_2f$ exists everywhere in $\Omega,$ the mean-value theorem implies that there exists a $\beta=\beta(h_1,h_2)\in (x_2,x_2+h_2)$ such that $$\phi_{h_1}(x_2+h_2)-\phi_{h_1}(x_2)=h_2\phi’_{h_1}(\beta)=h_2[\partial_2f(x_1+h_1,\beta)-\partial_2f(x_1,\beta)]$$

Define $\psi_{(h_1,h_2)}:(x_1-\delta/\sqrt{2},x_1+\delta/\sqrt{2})\rightarrow\mathbb{R}$ as $$\psi_{(h_1,h_2)}(t)=\partial_2f(t,\beta)$$

This is well-defined since $|t-x_1|<\delta/\sqrt{2}$ so that $||(x_1,x_2)-(t,\beta)||_{\mathbb{R}^2}=\sqrt{(t-x_1)^2+(\beta-x_2)^2}<\sqrt{(t-x_1)^2+h_2^2}<\delta.$

Again, since $\partial_1\partial_2f$ exists everywhere in $\Omega,$ the mean-value theorem implies that there exists a $\alpha=\alpha(h_1,h_2)\in (x_1,x_1+h_1)$ such that $$\psi_{(h_1,h_2)}(x_1+h_1)-\psi_{(h_1,h_2)}(x_1)=h_1\psi’_{(h_1,h_2)}(\alpha)=h_1\partial_1\partial_2f(\alpha,\beta)$$

Note that $\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)=\phi_{h_1}(x_2+h_2)-\phi_{h_1}(x_2).$ Hence, combining all the above steps, we get $$\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)=h_1h_2\partial_1\partial_2f(\alpha,\beta)\text{ }(\star)$$

Consider any $\epsilon>0.$

Since $\partial_1\partial_2f$ is continuous at $x=(x_1,x_2),$ there exists a $\delta’>0$ such that $$\left|\partial_1\partial_2f(y)-\partial_1\partial_2f(x)\right|<\epsilon/2\text{ for all }y\in\Omega\text{ with }||y-x||_{\mathbb{R}^2}<\delta’$$

Hence, if $0<|h_1|,|h_2|<\min\{\delta,\delta’\}/\sqrt{2}$ then using the fact that $|\beta-x_2|<|h_2|$ and $|\alpha-x_1|<|h_1|,$ implying that $||(x_1,x_2)-(\alpha,\beta)||_{\mathbb{R}^2}<\delta’$ we get $$\left|\partial_1\partial_2f(\alpha,\beta)-\partial_1\partial_2f(x_1,x_2)\right|<\frac{\epsilon}{2}$$ Using $(\star)$, this means that whenever $0<|h_1|,|h_2|<\min\{\delta,\delta’\}/\sqrt{2},$ we have $$\left|\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}-\partial_1\partial_2f(x_1,x_2)\right|<\frac{\epsilon}{2}\text{ }(\dagger_1)$$

Now, for any $h_2$ satisfying the above constraints , since $\partial_1f$ exists everywhere, there exists a $\delta_{h_2}>0$ such that $$\left|\frac{\frac{f(x_1+h_1,x_2+h_2)-f(x_1,x_2+h_2)}{h_1}-\frac{f(x_1+h_1,x_2)-f(x_1,x_2)}{h_1}}{h_2}-\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}\right|<\frac{\epsilon}{2}\text{ for }0<|h_1|<\delta_{h_2}\text{ }(\dagger_2)$$

Note that $\frac{\frac{f(x_1+h_1,x_2+h_2)-f(x_1,x_2+h_2)}{h_1}-\frac{f(x_1+h_1,x_2)-f(x_1,x_2)}{h_1}}{h_2}$ $\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}.$ $(\dagger_3)$

Now, for any $h_2\in\mathbb{R}$ such that $0<|h_2|<\min\{\delta,\delta’\}/\sqrt{2},$ pick an $h_1\in\mathbb{R}$ with $0<|h_1|<\min\{\delta_{h_2},\delta/\sqrt{2},\delta’/\sqrt{2},\}.$ Then, using $(\dagger_1),(\dagger_2),$ and $(\dagger_3),$ we get $$\left|\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}-\partial_1\partial_2f(x_1,x_2)\right|$$$$\le \left|\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}-\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}\right|+\left|\frac{\Delta_{h_2\mathbf{e}_2}\Delta_{h_1\mathbf{e}_1}f(x_1,x_2)}{h_1h_2}-\partial_1\partial_2f(x_1,x_2)\right|<\epsilon$$

This means that $$\lim_{h_2\to 0}\frac{\partial_1f(x_1,x_2+h_2)-\partial_1f(x_1,x_2)}{h_2}=\partial_1\partial_2f(x_1,x_2)$$

Thus, $\partial_2\partial_1f(x_1,x_2)$ exists and equals $\partial_1\partial_2f(x_1,x_2).$

For arbitrary $n$ and any two $i\neq j$ consider the function $F:\mathbb{R}^2\rightarrow\mathbb{R}$ defined as $$F(u,v)=f(x+u\mathbf{e}_i+v\mathbf{e}_j)$$

Now, the reduction to the simpler case is trivial. This proves our theorem.

Previous
Previous

On the Convolution of Distributions with Compactly Supported Smooth Functions

Next
Next

On Optimal Policies for Finite Horizon Markov Decision Problems