Lagrange Multipliers

I tutor freshmen in calculus and someone (I forgot his name; I’m terrible; let’s call him $X$) asked me for the intuition behind Lagrange Multipliers (Calc-III). I did give an answer, of course. I hope I did a good job. In this post, I wish to formalize the intuition using some real analysis (at the level of Baby Rudin). If you, Sir $X$, take real analysis later, may be you’ll like this answer better than the one I gave before.

Definitions and Notations

Definition 1: A function $f:X\to \mathbb{R}$ has a local maximum (or minimum) at $x_0\in X$ if there exists an open neighborhood $U$ of $x_0$ such that $f(x_0)\ge f(x)$ (or $f(x_0)\le f(x)$, respectively) for all $x\in U.$ Also, $f$ is said to have a local extremum at $x_0$ if $f$ has a local maximum or a local minimum at the point.

Definition 2: A function $f:X\to \mathbb{R}$ has a local maximum (or minimum) at $x_0\in X$ subject to the constraint $g=y_0$ where $g:X\to Y$ if $g(x_0)=y_0$ and there exists an open neighborhood $U$ of $x_0$ such that $f(x_0)\ge f(x)$ (or $f(x_0)\le f(x)$, respectively) for all $x\in U$ such that $g(x)=y_0.$ Also, $f$ is said to have a local extremum at $x_0$ subject to the same constraint if $f$ has a local maximum or a local minimum at the point subject to the same constraint.

Notation 1: For any $\mathbf{a}\in\mathbb{R}^m$ and $\mathbf{b}\in\mathbb{R}^n,$ we define $(\mathbf{a},\mathbf{b})=(a_1,a_2,\dots,a_m,b_1,b_2,\dots,b_m)\in\mathbb{R}^{m+n}.$

Notation 2: For any function $f:X\to\mathbb{R}^n,$ define $f_i:X\to\mathbb{R}$ for all $i\in\{1,2,\dots,n\}$ as $f_i:x\mapsto \langle \mathbf{e}_i,f(x)\rangle$ where $\mathbf{e}_i$ is the $i^\text{th}$ standard basis vector in $\mathbb{R}^n.$

Some Prerequisite Knowledge

Besides, some basic results from calculus-II, the proof uses material from Baby Rudin. In particular, we need the Implicit Function Theorem.

The Implicit Function Theorem: Let $g$ be a $\mathcal{C}^1$ mapping of an open set $V\subset \mathbb{R}^{m+n}$ to $\mathbb{R}^m$ such that $g(\mathbf{a},\mathbf{b})=\mathbf{0}$ for some $(\mathbf{a},\mathbf{b})\in V$ where $\mathbf{a}\in\mathbb{R}^m$ and $\mathbf{b}\in\mathbb{R}^n.$ Let $A=Dg|_{(\mathbf{a},\mathbf{b})}.$ Define $A_x\in \mathcal{L}(\mathbb{R}^m)$ as $A_x:\mathbf{h}\mapsto A(\mathbf{h},\mathbf{0})$ and $A_y\in \mathcal{L}(\mathbb{R}^n,\mathbb{R}^m)$ as $A_y:\mathbf{k}\mapsto A(\mathbf{0},\mathbf{k}).$ Assume that $A_x$ is invertible. Then, there exists open sets $U\subset V$ and $W\subset \mathbb{R}^n$ with $(\mathbf{a},\mathbf{b})\in U$ and $\mathbf{b}\in W$ such that

(i) To every $\mathbf{y}\in W$ corresponds a unique $\mathbf{x}$ such that $$(\mathbf{x},\mathbf{y})\in U\quad\text{ and }\quad g(\mathbf{x},\mathbf{y})=\mathbf{0}$$

(ii) If this $\mathbf{x}$ is defined to be $\phi(\mathbf{y}),$ then $\phi$ is a $\mathcal{C}^1$ mapping of $W$ to $\mathbb{R}^m,$ $\phi(\mathbf{b})=\mathbf{a}$, and finally $$g(\phi(\mathbf{y}), \mathbf{y})=\mathbf{0}\quad\text{ for all }\quad\mathbf{y}\in W$$

(iii) The Frechet-derivative of $\phi$ at $\mathbf{b}$ is given by $$D\phi|_{\mathbf{b}}=-(A_x)^{-1}A_y$$

Statement of the Theorem

Main Theorem: Let $f$ be a $\mathcal{C}^1$ mapping of an open set $E\subset \mathbb{R}^{m+n}$ to $\mathbb{R}$ and $g$ be $\mathcal{C}^1$ mapping of $E$ to $\mathbb{R}^m.$ Say, for some $(\mathbf{a},\mathbf{b})\in E$ where $\mathbf{a}\in\mathbb{R}^m$ and $\mathbf{b}\in\mathbb{R}^n,$ we have $\text{rank}(Dg|_{(\mathbf{a},\mathbf{b})})=m.$ If $f$ has a local extremum at $(\mathbf{a},\mathbf{b})$ under the constraint $g=\mathbf{c}$ then $$\nabla f(\mathbf{a},\mathbf{b})=\sum_{k=1}^m\lambda_k\nabla g_k(\mathbf{a},\mathbf{b})$$

Proof of The Theorem

Without loss of generality, assume $\mathbf{c}=0,$ since we can always work with $g-\mathbf{c}$ and all the assumptions about $g$ still hold.

Since $A=Dg|_{(\mathbf{a},\mathbf{b})}$ has rank $m,$ the matrix respresentation $M=[A]\in\mathbb{R}^{m\times (m+n)}$ of $A$ with respect to the standard basis of $\mathbb{R}^{(m+n)}$ has full rank $m$. So, $M$ must have $m$ linearly independent columns. By relabeling the co-ordinates of the domain $E\subset \mathbb{R}^{m+n},$ we can assume without loss of generality that the first $m$ columns of $M$ are linearly independent. It is easy to see that the columns of $M’=[A_x]\in\mathbb{R}^{m\times m}$ equal the first $m$ columns of $M.$ Hence, $M’$ is invertible. So, $A_x$ is invertible. $(\star)$

By $(\star)$, we know that $(M’)^T$ is also invertible. Hence, there exists a unique vector $\mathbf{\lambda}\in\mathbb{R}^m$ such that $(M’)^T\mathbf{\lambda}=[\partial_1 f(\mathbf{a},\mathbf{b}),\partial_2 f(\mathbf{a},\mathbf{b}),\dots, \partial_m f(\mathbf{a},\mathbf{b})]^T.$ In other words, $$\partial_i f(\mathbf{a},\mathbf{b})=\sum_{k=1}^m\lambda_k \partial_i g_k(\mathbf{a},\mathbf{b})\quad\text{ for all }i\in\{1,2,\dots,m\}\quad(\flat_1)$$

Without loss of generality assume that the given extremum of $f$ subject to $g=\mathbf{0}$ at $(\mathbf{a},\mathbf{b})$ is a maximum. So, there exists some open neighborhood $V\subset E$ of $(\mathbf{a},\mathbf{b})$ such that $f(\mathbf{x})\le f(\mathbf{a},\mathbf{b})$ for all $\mathbf{x}\in V$ such that $g(\mathbf{x})=\mathbf{0}.$ $(\dagger)$

Applying the Implicit Function Theorem on $g|_{V}:V\to\mathbb{R}^m$ using $(\star)$ and the fact that $\big(Dg|_{V}\big)|_{(\mathbf{a},\mathbf{b})}=Dg|_{(\mathbf{a},\mathbf{b})}=A$, we can find open sets $U\subset V$ and $W\subset \mathbb{R}^n$ with $(\mathbf{a},\mathbf{b})\in U$ and $\mathbf{b}\in W$ and a function $\phi\in\mathcal{C}^1(W,\mathbb{R}^m)$ such that statements (i), (ii), and (iii) in the said theorem are satisfied.

Note that the function $F:W\to\mathbb{R}$ defined by $F:\mathbf{y}\mapsto f(\phi(\mathbf{y}),\mathbf{y})$ has a local maximum at $\mathbf{b}.$ This is because for all $\mathbf{y}\in W,$ we have $(\phi(\mathbf{y}),\mathbf{y})\in U\subset V$ by (i). So, $(\phi(\mathbf{y}),\mathbf{y})\in V$ and $g((\phi(\mathbf{y}),\mathbf{y}))=\mathbf{0}$ by (ii). So, by $(\dagger),$ $F(\mathbf{y})=f(\phi(\mathbf{y}),\mathbf{y})\le f(\mathbf{a},\mathbf{b})=f(\phi(\mathbf{b}),\mathbf{b})=F(\mathbf{b}).$

Note that $F$ is $\mathcal{C}^1.$ Hence, $\nabla F(\mathbf{b})=\mathbf{0}.$ This is easy to prove by applying Theorem 5.8 (Baby Rudin) to the function $F_i:[-\delta,\delta]\to \mathbb{R}$ defined by $F_i:t\mapsto F(\mathbf{b}+t\mathbf{e}_i)$ for sufficiently small $\delta>0$ where $\mathbf{e}_i$ is the $i^\text{th}$ standard basis vector of $\mathbb{R}^n$ for all $i\in\{1,2,\dots,n\}.$

Let $\psi:W\to\mathbb{R}^{m+n}$ be defined as $\psi:\mathbf{y}\mapsto (\phi(\mathbf{y}),\mathbf{y}).$ Using the chain rule, we get $DF|_{\mathbf{b}}=Df|_{\psi(\mathbf{b})}D\psi|_{\mathbf{b}}.$ Let $\mathbf{e}_j$ denote the $i^\text{th}$ standard basis vector of $\mathbb{R}^n$ for some $j\in\{1,2,\dots,n\}.$ Then, $$D\psi|_{\mathbf{b}}\mathbf{e}_j=[\partial_j\psi_1(\mathbf{b}),\dots, \partial_j\psi_{m+n}(\mathbf{b})]^T=[\partial_j\phi_1(\mathbf{b}),\dots, \partial_j\phi_{m}(\mathbf{b}),\delta_{1,j},\dots, \delta_{nj}]^T$$

Hence, $$DF|_{\mathbf{b}}\mathbf{e}_j=Df|_{\psi(\mathbf{b})}[\partial_j\phi_1(\mathbf{b}),\dots, \partial_j\phi_{m}(\mathbf{b}),\delta_{1,j},\dots, \delta_{nj}]^T$$$$=\partial_{m+j}f(\psi(\mathbf{b}))+\sum_{i=1}^m\partial_if(\psi(\mathbf{b}))\partial_j\phi_i(\mathbf{b})=\partial_{m+j}f(\mathbf{a},\mathbf{b})+\sum_{i=1}^m\partial_if(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})$$

Since $\nabla F(\mathbf{b})=[DF|_{\mathbf{b}}]^T=\mathbf{0},$ we conclude that $$\partial_{m+j}f(\mathbf{a},\mathbf{b})+\sum_{i=1}^m\partial_if(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})=0\quad\text{ for all }\quad j\in\{1,2,\dots,n\}\quad(\flat_2)$$

Again, note that $g(\phi(\mathbf{y}),\mathbf{y})=\mathbf{0}$ for all $\mathbf{y}\in W.$ Define $G:W\to\mathbb{R}^{m}$ as $G:\mathbf{y}\mapsto g(\phi(\mathbf{y}),\mathbf{y}).$ So, $DG|_{\mathbb{b}}$ is the zero-transformation. Hence, for any $k\in\{1,2,dots,m\},$ we have $\nabla G_k=g_k(\phi(\mathbf{y}),\mathbf{y})=\mathbf{0}.$ Using the same arguments as we applied for $F$ before, $$\partial_{m+j}g_k(\mathbf{a},\mathbf{b})+\sum_{i=1}^m\partial_ig_k(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})=0\quad\text{ for all }\quad j\in\{1,2,\dots,n\}\quad(\flat_3)$$

Hence, for any $j\in\{1,2,\dots,n\},$ we have

$$\partial_{m+j}f(\mathbf{a},\mathbf{b})=-\sum_{i=1}^m\partial_if(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})\quad\text{ by }\quad(\flat_2)$$

$$=-\sum_{i=1}^m\left(\sum_{k=1}^m\lambda_k \partial_i g_k(\mathbf{a},\mathbf{b})\right)\partial_j\phi_i(\mathbf{b})=-\sum_{k=1}^m\lambda_k\left(\sum_{i=1}^m \partial_i g_k(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})\right)\quad\text{ by }\quad(\flat_1)$$

$$=-\sum_{k=1}^m\lambda_k\Big(-\partial_{m+j}g_k(\mathbf{a},\mathbf{b})\Big)=\sum_{k=1}^m\lambda_k\partial_{m+j}g_k(\mathbf{a},\mathbf{b})\quad\text{ by }\quad(\flat_3)$$

This, in combination with $(\flat_1),$ gives $\nabla f=\sum_{k=1}^m\lambda_k\nabla g_k(\mathbf{a},\mathbf{b}).$ Hence, proved.

Next
Next

On The Lp Boundedness of Reisz Projections and Its Implications on The Fourier Series