Lagrange Multipliers

Feb 17

I tutor freshmen in calculus and someone (I forgot his name; I’m terrible; let’s call him $X$) asked me for the intuition behind Lagrange Multipliers (Calc-III). I did give an answer, of course. I hope I did a good job. In this post, I wish to formalize the intuition using some real analysis (at the level of Baby Rudin). If you, Sir $X$, take real analysis later, may be you’ll like this answer better than the one I gave before.

Definitions and Notations

Definition 1: A function $f:X\to \mathbb{R}$ has a local maximum (or minimum) at $x_0\in X$ if there exists an open neighborhood $U$ of $x_0$ such that $f(x_0)\ge f(x)$ (or $f(x_0)\le f(x)$, respectively) for all $x\in U.$ Also, $f$ is said to have a local extremum at $x_0$ if $f$ has a local maximum or a local minimum at the point.

Definition 2: A function $f:X\to \mathbb{R}$ has a local maximum (or minimum) at $x_0\in X$ subject to the constraint $g=y_0$ where $g:X\to Y$ if $g(x_0)=y_0$ and there exists an open neighborhood $U$ of $x_0$ such that $f(x_0)\ge f(x)$ (or $f(x_0)\le f(x)$, respectively) for all $x\in U$ such that $g(x)=y_0.$ Also, $f$ is said to have a local extremum at $x_0$ subject to the same constraint if $f$ has a local maximum or a local minimum at the point subject to the same constraint.

Notation 1: For any $\mathbf{a}\in\mathbb{R}^m$ and $\mathbf{b}\in\mathbb{R}^n,$ we define $(\mathbf{a},\mathbf{b})=(a_1,a_2,\dots,a_m,b_1,b_2,\dots,b_m)\in\mathbb{R}^{m+n}.$

Notation 2: For any function $f:X\to\mathbb{R}^n,$ define $f_i:X\to\mathbb{R}$ for all $i\in\{1,2,\dots,n\}$ as $f_i:x\mapsto \langle \mathbf{e}_i,f(x)\rangle$ where $\mathbf{e}_i$ is the $i^\text{th}$ standard basis vector in $\mathbb{R}^n.$

Some Prerequisite Knowledge

Besides, some basic results from calculus-II, the proof uses material from Baby Rudin. In particular, we need the Implicit Function Theorem.

The Implicit Function Theorem: Let $g$ be a $\mathcal{C}^1$ mapping of an open set $V\subset \mathbb{R}^{m+n}$ to $\mathbb{R}^m$ such that $g(\mathbf{a},\mathbf{b})=\mathbf{0}$ for some $(\mathbf{a},\mathbf{b})\in V$ where $\mathbf{a}\in\mathbb{R}^m$ and $\mathbf{b}\in\mathbb{R}^n.$ Let $A=Dg|_{(\mathbf{a},\mathbf{b})}.$ Define $A_x\in \mathcal{L}(\mathbb{R}^m)$ as $A_x:\mathbf{h}\mapsto A(\mathbf{h},\mathbf{0})$ and $A_y\in \mathcal{L}(\mathbb{R}^n,\mathbb{R}^m)$ as $A_y:\mathbf{k}\mapsto A(\mathbf{0},\mathbf{k}).$ Assume that $A_x$ is invertible. Then, there exists open sets $U\subset V$ and $W\subset \mathbb{R}^n$ with $(\mathbf{a},\mathbf{b})\in U$ and $\mathbf{b}\in W$ such that

(i) To every $\mathbf{y}\in W$ corresponds a unique $\mathbf{x}$ such that $$(\mathbf{x},\mathbf{y})\in U\quad\text{ and }\quad g(\mathbf{x},\mathbf{y})=\mathbf{0}$$

(ii) If this $\mathbf{x}$ is defined to be $\phi(\mathbf{y}),$ then $\phi$ is a $\mathcal{C}^1$ mapping of $W$ to $\mathbb{R}^m,$ $\phi(\mathbf{b})=\mathbf{a}$, and finally $$g(\phi(\mathbf{y}), \mathbf{y})=\mathbf{0}\quad\text{ for all }\quad\mathbf{y}\in W$$

(iii) The Frechet-derivative of $\phi$ at $\mathbf{b}$ is given by $$D\phi|_{\mathbf{b}}=-(A_x)^{-1}A_y$$

Statement of the Theorem

Main Theorem: Let $f$ be a $\mathcal{C}^1$ mapping of an open set $E\subset \mathbb{R}^{m+n}$ to $\mathbb{R}$ and $g$ be $\mathcal{C}^1$ mapping of $E$ to $\mathbb{R}^m.$ Say, for some $(\mathbf{a},\mathbf{b})\in E$ where $\mathbf{a}\in\mathbb{R}^m$ and $\mathbf{b}\in\mathbb{R}^n,$ we have $\text{rank}(Dg|_{(\mathbf{a},\mathbf{b})})=m.$ If $f$ has a local extremum at $(\mathbf{a},\mathbf{b})$ under the constraint $g=\mathbf{c}$ then $$\nabla f(\mathbf{a},\mathbf{b})=\sum_{k=1}^m\lambda_k\nabla g_k(\mathbf{a},\mathbf{b})$$

Proof of The Theorem

Without loss of generality, assume $\mathbf{c}=0,$ since we can always work with $g-\mathbf{c}$ and all the assumptions about $g$ still hold.

Since $A=Dg|_{(\mathbf{a},\mathbf{b})}$ has rank $m,$ the matrix respresentation $M=[A]\in\mathbb{R}^{m\times (m+n)}$ of $A$ with respect to the standard basis of $\mathbb{R}^{(m+n)}$ has full rank $m$. So, $M$ must have $m$ linearly independent columns. By relabeling the co-ordinates of the domain $E\subset \mathbb{R}^{m+n},$ we can assume without loss of generality that the first $m$ columns of $M$ are linearly independent. It is easy to see that the columns of $M’=[A_x]\in\mathbb{R}^{m\times m}$ equal the first $m$ columns of $M.$ Hence, $M’$ is invertible. So, $A_x$ is invertible. $(\star)$

By $(\star)$, we know that $(M’)^T$ is also invertible. Hence, there exists a unique vector $\mathbf{\lambda}\in\mathbb{R}^m$ such that $(M’)^T\mathbf{\lambda}=[\partial_1 f(\mathbf{a},\mathbf{b}),\partial_2 f(\mathbf{a},\mathbf{b}),\dots, \partial_m f(\mathbf{a},\mathbf{b})]^T.$ In other words, $$\partial_i f(\mathbf{a},\mathbf{b})=\sum_{k=1}^m\lambda_k \partial_i g_k(\mathbf{a},\mathbf{b})\quad\text{ for all }i\in\{1,2,\dots,m\}\quad(\flat_1)$$

Without loss of generality assume that the given extremum of $f$ subject to $g=\mathbf{0}$ at $(\mathbf{a},\mathbf{b})$ is a maximum. So, there exists some open neighborhood $V\subset E$ of $(\mathbf{a},\mathbf{b})$ such that $f(\mathbf{x})\le f(\mathbf{a},\mathbf{b})$ for all $\mathbf{x}\in V$ such that $g(\mathbf{x})=\mathbf{0}.$ $(\dagger)$

Applying the Implicit Function Theorem on $g|_{V}:V\to\mathbb{R}^m$ using $(\star)$ and the fact that $\big(Dg|_{V}\big)|_{(\mathbf{a},\mathbf{b})}=Dg|_{(\mathbf{a},\mathbf{b})}=A$, we can find open sets $U\subset V$ and $W\subset \mathbb{R}^n$ with $(\mathbf{a},\mathbf{b})\in U$ and $\mathbf{b}\in W$ and a function $\phi\in\mathcal{C}^1(W,\mathbb{R}^m)$ such that statements (i), (ii), and (iii) in the said theorem are satisfied.

Note that the function $F:W\to\mathbb{R}$ defined by $F:\mathbf{y}\mapsto f(\phi(\mathbf{y}),\mathbf{y})$ has a local maximum at $\mathbf{b}.$ This is because for all $\mathbf{y}\in W,$ we have $(\phi(\mathbf{y}),\mathbf{y})\in U\subset V$ by (i). So, $(\phi(\mathbf{y}),\mathbf{y})\in V$ and $g((\phi(\mathbf{y}),\mathbf{y}))=\mathbf{0}$ by (ii). So, by $(\dagger),$ $F(\mathbf{y})=f(\phi(\mathbf{y}),\mathbf{y})\le f(\mathbf{a},\mathbf{b})=f(\phi(\mathbf{b}),\mathbf{b})=F(\mathbf{b}).$

Note that $F$ is $\mathcal{C}^1.$ Hence, $\nabla F(\mathbf{b})=\mathbf{0}.$ This is easy to prove by applying Theorem 5.8 (Baby Rudin) to the function $F_i:[-\delta,\delta]\to \mathbb{R}$ defined by $F_i:t\mapsto F(\mathbf{b}+t\mathbf{e}_i)$ for sufficiently small $\delta>0$ where $\mathbf{e}_i$ is the $i^\text{th}$ standard basis vector of $\mathbb{R}^n$ for all $i\in\{1,2,\dots,n\}.$

Let $\psi:W\to\mathbb{R}^{m+n}$ be defined as $\psi:\mathbf{y}\mapsto (\phi(\mathbf{y}),\mathbf{y}).$ Using the chain rule, we get $DF|_{\mathbf{b}}=Df|_{\psi(\mathbf{b})}D\psi|_{\mathbf{b}}.$ Let $\mathbf{e}_j$ denote the $i^\text{th}$ standard basis vector of $\mathbb{R}^n$ for some $j\in\{1,2,\dots,n\}.$ Then, $$D\psi|_{\mathbf{b}}\mathbf{e}_j=[\partial_j\psi_1(\mathbf{b}),\dots, \partial_j\psi_{m+n}(\mathbf{b})]^T=[\partial_j\phi_1(\mathbf{b}),\dots, \partial_j\phi_{m}(\mathbf{b}),\delta_{1,j},\dots, \delta_{nj}]^T$$

Hence, $$DF|_{\mathbf{b}}\mathbf{e}_j=Df|_{\psi(\mathbf{b})}[\partial_j\phi_1(\mathbf{b}),\dots, \partial_j\phi_{m}(\mathbf{b}),\delta_{1,j},\dots, \delta_{nj}]^T$$$$=\partial_{m+j}f(\psi(\mathbf{b}))+\sum_{i=1}^m\partial_if(\psi(\mathbf{b}))\partial_j\phi_i(\mathbf{b})=\partial_{m+j}f(\mathbf{a},\mathbf{b})+\sum_{i=1}^m\partial_if(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})$$

Since $\nabla F(\mathbf{b})=[DF|_{\mathbf{b}}]^T=\mathbf{0},$ we conclude that $$\partial_{m+j}f(\mathbf{a},\mathbf{b})+\sum_{i=1}^m\partial_if(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})=0\quad\text{ for all }\quad j\in\{1,2,\dots,n\}\quad(\flat_2)$$

Again, note that $g(\phi(\mathbf{y}),\mathbf{y})=\mathbf{0}$ for all $\mathbf{y}\in W.$ Define $G:W\to\mathbb{R}^{m}$ as $G:\mathbf{y}\mapsto g(\phi(\mathbf{y}),\mathbf{y}).$ So, $DG|_{\mathbb{b}}$ is the zero-transformation. Hence, for any $k\in\{1,2,dots,m\},$ we have $\nabla G_k=g_k(\phi(\mathbf{y}),\mathbf{y})=\mathbf{0}.$ Using the same arguments as we applied for $F$ before, $$\partial_{m+j}g_k(\mathbf{a},\mathbf{b})+\sum_{i=1}^m\partial_ig_k(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})=0\quad\text{ for all }\quad j\in\{1,2,\dots,n\}\quad(\flat_3)$$

Hence, for any $j\in\{1,2,\dots,n\},$ we have

$$\partial_{m+j}f(\mathbf{a},\mathbf{b})=-\sum_{i=1}^m\partial_if(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})\quad\text{ by }\quad(\flat_2)$$

$$=-\sum_{i=1}^m\left(\sum_{k=1}^m\lambda_k \partial_i g_k(\mathbf{a},\mathbf{b})\right)\partial_j\phi_i(\mathbf{b})=-\sum_{k=1}^m\lambda_k\left(\sum_{i=1}^m \partial_i g_k(\mathbf{a},\mathbf{b})\partial_j\phi_i(\mathbf{b})\right)\quad\text{ by }\quad(\flat_1)$$

$$=-\sum_{k=1}^m\lambda_k\Big(-\partial_{m+j}g_k(\mathbf{a},\mathbf{b})\Big)=\sum_{k=1}^m\lambda_k\partial_{m+j}g_k(\mathbf{a},\mathbf{b})\quad\text{ by }\quad(\flat_3)$$

This, in combination with $(\flat_1),$ gives $\nabla f=\sum_{k=1}^m\lambda_k\nabla g_k(\mathbf{a},\mathbf{b}).$ Hence, proved.

Pritam Kayal

Lagrange Multipliers

Definitions and Notations

Some Prerequisite Knowledge

Statement of the Theorem

Proof of The Theorem

On The Lp Boundedness of Reisz Projections and Its Implications on The Fourier Series