On Optimal Policies for Finite Horizon Markov Decision Problems
This post is in reference to section 4.3 and section 4.4 of Martin L. Puterman’s “Markov Decision Processes: Discrete Stochastic Dynamic Programming.” At the end of section 4.3, the author mentions two points on
“The Principle of Optimality”
Why it’s okay to only worry about deterministic policies (or, if an optimal history-dependent random policy exist, does a Markov deterministic policy also exist)
Here, I will expand upon them and formalize them as two theorems. The book has theorems labeled 4.3.1 through 4.3.4 in section 4.3. and theorems 4.4.1 through 4.4.3 in section 4.4. So, just because I can, I’ll label the two new theorems as 4.3.5 and 4.4.4. Also, the two results are quite easy to prove. I do not claim to make a novel mind-blowing discovery. This is just for completion.
Throughout this post, I’ll use the same notations as used in the book. Also, we will assume that the state space and action space are discrete. Before formalizing the first point, I will state two other theorems from the section 4.3. We will use them in our proofs.
As a reminder, note that the Bellman optimality equations are
$$u_t(h_t)=\sup_{a\in A_{s_t}}\Big\{r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u_{t+1}(h_t,a,j)\Big\}\text{ for all }t=1,\dots,N-1\text{ and all }h_t=(h_{t-1},a_{t-1},s_t)\in H_t$$ with the boundary condition $$u_N(h_N)=r_N(s_N)\text{ for all }h_N=(h_{N-1},a_{N-1},s_N)\in H_N$$
Theorem 4.3.2: Suppose $\{u^*_t\}_{t=1}^N$ is a solution to the optimality equations subject to the boundary conditions then
$u^*_t(h_t)=\sup_{\pi\in \Pi^\text{HR}} u^{\pi}_{t}(h_t)$ for all $t=1,\dots,N-1$ for all $h_t\in H_t.$
$u^*_1(s)=\sup_{\pi\in \Pi^\text{HR}} v^\pi_N(s)$ for all $s\in S.$
Theorem 4.3.3: Suppose $\{u^*_t\}_{t=1}^N$ is a solution to the optimality equations subject to the boundary conditions and that policy $(d^*_1,d^*_2,\dots,d^*_{N-1})\in \Pi^{\text{HD}}$ satisfies
$$r_t(s_t,d^*_t{h_t})+\sum_{j\in S}p_t(j|s_t,d^*_t{h_t})u_{t+1}(h_t,d^*_t{h_t},j)=\sup_{a\in A_{s_t}}\Big\{r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u_{t+1}(h_t,a,j)\Big\}$$
for all $t=1,\dots,N-1$ and all $h_t=(h_{t-1},a_{t-1},s_t)\in H_t$ then
For each $t=1,\dots,N$ we have $u^{\pi^*}_t(h_t)=u^*_t(h_t)$ for all $h_t\in H_t$
$\pi^*$ is an optimal policy or, equivalently, $v^{\pi^*}_N(s)=v^*_N(s)$ for all $s\in S.$
Theorem 4.3.5: If $\pi^*\in\Pi^{\text{HR}}$ is an optimal policy and $\{u^*_t\}_{t=1}^N$ is a solution to the Bellman optimality equations subject to the boundary conditions then $u^{\pi^*}_t(h_t)=u^*_t(h_t)$ for all $t=1,\dots,N$ for all $h_t\in H_t.$
Note: This is different from theorem 4.3.3. In theorem 4.3.5, we wish to show that any policy that satisfies (2) of 4.3.3 must also satisfy (1) of 4.3.3. In other words, any optimal policy is composed of optimal sub-policies.
Proof: We will show by induction on $t$ that $u^{\pi^*}_t(h_t)=u^*_t(h_t)$ for all $h_t\in H_t.$ Since $\pi^*=(d_1,\dots, d_{N-1})$ is optimal, the claim obviously holds for $t=1$ (by definition of optimality and theorem 4.3.2 (2)).
Let the claim hold for some $t\in\{1,\dots,N-1\}.$ This is our inductive hypothesis.
Assume, for a contradiction, that the claim does not hold for $t+1.$ Then, there exists some $h_{t+1}=(h_t,a_t,s_{t+1})\in H_{t+1}$ such that $u^{\pi^*}_{t+1}(h_{t+1})<u^*_{t+1}(h_{t+1})$ with $p_t(s_{t+1}|s_t,a_t)\neq 0$ where $h_t=(h_{t-1},a_{t-1},s_t)\in H_t.$
We know that $$u^{\pi^*}_t(h_t)=\sum_{a\in A_{s_t}}q_{d_t(h_t)}(a)\Big(r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^{\pi^*}_{t+1}(h_t,a,j)\Big)$$
Now, for all $a\in A_{s_t}$ and all $j\in S,$ we have $u^{\pi^*}_{t+1}(h_t,a,j)\le\sup_{\pi\in \Pi^\text{HR}} u^{\pi}_{t+1}(h_t,a,j)=u^*_{t+1}(h_t,a,j)$ using theorem 4.3.2 (a). In particular, $u^{\pi^*}_{t+1}(h_t,a_t,s_{t+1})=u^{\pi^*}_{t+1}(h_{t+1})<u^*_{t+1}(h_{t+1}).$ Hence, noting that $p_t(s_{t+1}|s_t,a_t)\neq 0$, we have
$$u^{\pi^*}_t(h_t)<\sum_{a\in A_{s_t}}q_{d_t(h_t)}(a)\Big(r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^*_{t+1}(h_t,a,j)\Big)$$
$$\le \sup_{a\in A_{s_t}}\Big\{r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^*_{t+1}(h_t,a,j)\Big\}=u^*_t(h_t)$$
The second inequality follows from the fact that $\sum_{a\in A_{s_t}}q_{d_t(h_t)}(a)=1.$ This contradicts our inductive hypothesis. Hence, our assumption is wrong. So, the claim must hold for $t+1$ as well.
The theorem, hence, follows from induction.
Before formalizing the next point, I will state another theorem from the next section (4.4).
Theorem 4.4.2: Let $\{u^*_t\}_{t=1}^N$ be a solution to the Bellman optimality equations subject to the boundary conditions. Then
For all $t=1,\dots,N$ $u^*_t$ depends on $h_t$ only through $s_t$ for all $h_t=(h_{t-1},a_{t-1},s_{t})\in H_t$
For any $\epsilon>0,$ there exists an $\epsilon$-optimal policy which is deterministic and Markov
If $\max_{a\in A_{s_t}}\Big\{r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^*_{t+1}(h_t,a,j)\Big\}$ exists for all $t=1,\dots,N-1$ for all $h_t=(h_{t-1},a_{t-1},s_{t})\in H_t,$ then there exists an optimal policy which is deterministic and Markov
Theorem 4.4.4: If there exists an optimal policy, then there also exists an optimal Markov deterministic optimal policy.
Note: All we need to show is if an optimal $\pi^*\in \Pi^{\text{HR}}$ exists then an optimal $\pi_*\in \Pi^{\text{HD}}$ exists as well. Then, our theorem 4.3.5 helps satisfy the hypothesis of theorem 4.4.2 (3) which implies theorem 4.4.4.
Proof: Firstly note that if $\sum_{j=1}^\infty \lambda_j a_j=\sup_{j=1}^\infty a_j$ for $\lambda_j\in [0,1]$ for all $j\in\mathbb{N}$ and $\sum_{j=1}^\infty \lambda_j =1$ then $\max_{j=1}^\infty a_j$ exists. This is because $\sum_{j=1}^\infty \lambda_j a_j = \sup_{j=1}^\infty a_j=s=\sum_{j=1}^\infty \lambda_j s$ meaning $\sum_{j=1}^\infty \lambda_j (s-a_j)=0.$ Since all terms are non-negative, we must have $\lambda_j=0$ or $a_j=s$ for each $j\in\mathbb{N}.$ Since $\sum_{j=1}^\infty \lambda_j =1$ there must be at least one $k\in\mathbb{N}$ such that $a_k=s.$
Let $\pi^*=(d_1,\dots, d_{N-1})\in \Pi^{\text{HR}}$ be an optimal policy. Then, using theorem 4.3.5, we have $u^{\pi^*}_t(h_t)=u^*_t(h_t)$ for all $t=1,\dots,N-1$ for all $h_t\in H_t.$ Hence, $$\sum_{a\in A_{s_t}}q_{d_t(h_t)}(a)\Big(r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^{\pi^*}_{t+1}(h_t,a,j)\Big)=\sup_{a\in A_{s_t}}\Big\{r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^*_{t+1}(h_t,a,j)\Big\}$$
Again, using theorem 4.3.5, we have $u^{\pi^*}_{t+1}(h_t,a,j)=u^*_{t+1}(h_t,a,j)$ so $$\sum_{a\in A_{s_t}}q_{d_t(h_t)}(a)\Big(r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^*_{t+1}(h_t,a,j)\Big)=\sup_{a\in A_{s_t}}\Big\{r_t(s_t,a)+\sum_{j\in S}p_t(j|s_t,a)u^*_{t+1}(h_t,a,j)\Big\}$$
Hence, using the fact at the beginning of the proof, the hypothesis of theorem 4.4.2 is satisfied. So theorem 4.4.2 shows the existence of an optimal Markov deterministic policy.