Dynamic Programming Proofs
1 Probability Spaces and Expectation
In this section, we define the basic probability concepts on finite sets.
1.1 Definitions
This document omits basic intuitive definitions, such as the comparison or arithmetic operations on random variables. Arithmetic operations on random variables are performed element-wise for each element of the sample set. Please see the Lean file for complete details.
A finite probability measure \(p\colon \Omega \to \mathbb {R}_+\) on a finite set \(\Omega \) is any function that satisfies
The set of finite probability measures \(\Delta (\Omega )\) for a finite \(\Omega \) is defined as
A finite probability space is \(P = (\Omega , p)\), where \(\Omega \) is a finite set referred to as the sample set, \(p\in \Delta (\Omega )\), and the \(\sigma \)-algebra is \(2^{\Omega }\).
A random variable defined on a finite probability space \(P\) is a mapping \(\tilde{x}\colon \Omega \to \mathbb {R}\).
For the remainder of section 1, we assume that \(P = (\Omega , p)\) is a finite probability space. All random variables are defined on the space \(P\) unless specified otherwise.
A boolean set is \(\mathcal{B} = \left\{ \operatorname{false}, \operatorname{true}\right\} \).
The expectation of a random variable \(\tilde{x} \colon \Omega \to \mathbb {R}\) is
An indicator function \(\mathbb {I}\colon \mathcal{B} \to \left\{ 0, 1 \right\} \) is defined for \(b\in \mathcal{B}\) as
The probability of \(\tilde{b}\colon \Omega \to \mathcal{B}\) is defined as
The conditional expectation of \(\tilde{x} \colon \Omega \to \mathbb {R}\) conditioned on \(\tilde{b} \colon \Omega \to \mathcal{B}\) is defined as
where we define that \(x / 0 = 0\) for each \(x\in \mathbb {R}\).
The conditional probability of \(\tilde{b}\colon \Omega \to \mathcal{B}\) on \(\tilde{c}\colon \Omega \to \mathcal{B}\) is defined as
It is common to prohibit conditioning on a zero probability event both for expectation and probabilities. In this document, we follow the Lean convention, where the division by \(0\) is \(0\); see div_zero. However, even some basic probability and expectation results may require that we assume that the conditioned event does not have probability zero for it to hold.
The random conditional expectation of a random variable \(\tilde{x} \colon \Omega \to \mathbb {R}\) conditioned on \(\tilde{y} \colon \Omega \to \mathcal{Y}\) for a finite set \(\mathcal{Y}\) is the random variable \(\mathbb {E}\left[ \tilde{x} \mid \tilde{y} \right]\colon \Omega \to \mathbb {R}\) is defined as
The Lean file defines expectations more broadly for a data type \(\rho \) which is more general than just \(\mathbb {R}\). The main reason to generalize to both \(\mathbb {R}\) and \(\mathbb {R}_+\). However, in principle, the definitions could be used to reason with expectations that go beyond real numbers and may include other algebras, such as vectors or matrices.
1.2 Basic Properties
Suppose that \(\tilde{b}, \tilde{c} \colon \Omega \to \mathcal{B}\). Then:
where the equality applies for all \(\omega \in \Omega \).
Suppose that \(\tilde{c} \colon \Omega \to \mathcal{B}\) such that \(\mathbb {P}\left[ \tilde{c} \right] = 0\). Then for any \(\tilde{x} \colon \Omega \to \mathbb {R}\):
Immediate from the definition and the fact that \(0 \cdot x = 0\) for \(x\in \mathbb {R}\).
Suppose that \(\tilde{c} \colon \Omega \to \mathcal{B}\) such that \(\mathbb {P}\left[ \tilde{c} \right] = 0\). Then for any \(\tilde{b} \colon \Omega \to \mathbb {R}\):
Immediate from theorem 1.2.2.
Suppose that \(\tilde{b}, \tilde{c} \colon \Omega \to \mathcal{B}\), then
The property holds immediately when \(\mathbb {P}\left[ \tilde{c} \right] = 0\). Assume that \(\mathbb {P}\left[ \tilde{c} \right] {\gt} 0\). Then:
Let \(\tilde{y}\colon \Omega \to \mathcal{Y}\) with a finite \(\mathcal{Y}\). Then
theorem 1.2.12 shows the equivalence of expectations for surely equal random variables.
Random variables \(\tilde{x}, \tilde{y} \colon \Omega \to \mathbb {R}\) satisfy that
From the distributive property of sums.
A random variable \(\tilde{x}\colon \Omega \to \mathbb {R}\) and \(c\in \mathbb {R}\) satisfies that
Suppose that \(\tilde{x} \colon \Omega \to \mathbb {R}\) and \(c\in \mathbb {R}\). Then
From ??.
Suppose that \(\tilde{x}, \tilde{y} \colon \Omega \to \mathbb {R}\) and \(\tilde{z}\colon \Omega \to \mathcal{V}\) are random variables and \(c\in \mathbb {R}\), such that \(\tilde{y}(\omega ) = c + \tilde{x}(\omega )\). Then
From theorem 1.2.9.
Suppose that \(\tilde{x}, \tilde{y} \colon \Omega \to \mathbb {R}\) satisfy that
Then
Suppose that \(\tilde{x}, \tilde{z} \colon \Omega \to \mathbb {R}\) satisfy that
Then
Immediately from the congruence of sums.
1.3 The Laws of The Unconscious Statisticians
Let \(\tilde{x} \colon \Omega \to \mathbb {R}\) be a random variable. Then:
Let \(\mathcal{X} := \tilde{x}(\Omega )\), which is a finite set. Then:
The following theorem generalizes the theorem above.
Let \(\tilde{x} \colon \Omega \to \mathbb {R}\) and \(\tilde{b} \colon \Omega \to \mathcal{Y}\) be random variables. Then:
Let \(\tilde{x} \colon \Omega \to \mathbb {R}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) be random variables with \(\mathcal{Y}\) finite. Then:
1.4 Total Expectation and Probability
Let \(\tilde{b} \colon \Omega \to \mathcal{B}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) be random variables with a finite set \(\mathcal{Y}\). Then:
Let \(\tilde{x} \colon \Omega \to \mathcal{X}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) be random variables with a finite set \(\mathcal{Y}\). Then:
Recall that we are allowing the division by 0 and assume that \(x / 0 = 0\).
Above, we used the fact that
which follows by analyzing two cases. First, when \(p(\omega ') = 0\), then the equality holds immediately. If \(p(\omega ') {\gt} 0\) then by lemma 1.2.5, \(\mathbb {P}\left[ \tilde{y} = \tilde{y}(\omega ')\right] {\gt} 0\), we get from definition 1.1.9 that
which completes the step.
2 Formal Decision Framework
2.1 Markov Decision Process
A Markov decision process \(M := (\mathcal{S}, \mathcal{A}, P, r)\) consists of a finite nonempty set of states \(\mathcal{S}\), a finite nonempty set of actions \(\mathcal{A}\), transition function \(P\colon \mathcal{S} \times \mathcal{A} \to \Delta (\mathcal{S})\), and a reward function \(r \colon \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb {R}\).
2.2 Histories
We implicitly assume in the remainder of the section an MDP \(M = (\mathcal{S}, \mathcal{A}, p, r)\).
A history \(h\) in a set of histories \(\mathcal{H}\) is a sequence of states and actions defined for \(M\) recursively as
where \(s \in \mathcal{S}\), \(a\in \mathcal{A}\), and \(h'\in \mathcal{H}\).
The length \(l\colon \mathcal{H} \to \mathbb {N}\) of a history \(h\) is defined as
The set \(\mathcal{H}_{\mathrm{NE}}\) of non-empty histories is
Following histories \(\mathcal{H}(h,t) \subseteq \mathcal{H}\) for \(h\in \mathcal{H}\) of length \(t\in \mathbb {N}\) are defined recursively as
The set of histories \(\mathcal{H}_{t}\) of length \(t \in \mathbb {N}\) is defined recursively as
For \(h \in \mathcal{H}\):
The theorem follows by induction on \(t\) from the definition.
We use \(\tilde{s}_k\colon \mathcal{H} \to \mathcal{S}\) to denote the 0-based \(k\)-th state of each history.
We use \(\tilde{a}_k\colon \mathcal{H} \to \mathcal{A}\) to denote the 0-based \(k\)-th action of each history.
The history-reward random variable \(\tilde{r}^{\mathrm{h}}\colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined recursively as
The history-reward random variable \(\tilde{r}^{\mathrm{h}}_k \colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined as the \(k\)-th reward (0-based) of a history.
The history-reward random variable \(\tilde{r}^{\mathrm{h}}_{\le k} \colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined as the sum of all \(k\)-th or earlier rewards (0-based) of a history.
The history-reward random variable \(\tilde{r}^{\mathrm{h}}_{\ge k}\colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined as the sum of \(k\)-th or later reward (0-based) of a history.
2.3 Policies
The set of decision rules \(\mathcal{D}\) is defined as \(\mathcal{D} := \mathcal{A}^{\mathcal{S}}. \) A single action \(a \in \mathcal{A}\) can also be interpreted as a decision rule \(\mathcal{d} := s \mapsto a\).
The set of history-dependent policies is \(\Pi _{\mathrm{HR}}:= \Delta (\mathcal{A})^{\mathcal{H}}. \)
The set of Markov deterministic policies \(\Pi _{\mathrm{MD}}\) is \(\Pi _{\mathrm{MD}}:= \mathcal{D}^{\mathbb {N}}. \) A Markov deterministic policy \(\pi \in \Pi _{\mathrm{MD}}\) can also be interpreted as \(\bar{\pi } \in \Pi _{\mathrm{HR}}\):
where \(\delta \) is the Dirac distribution, \(l\) is the length defined in definition 2.2.2, and \(s_{\mathrm{l}}\) is the history’s last state.
The set of stationary deterministic policies \(\Pi _{\mathrm{SD}}\) is defined as \(\Pi _{\mathrm{SD}} := \mathcal{D}. \) A stationary policy \(\pi \in \Pi _{\mathrm{SD}}\) can be interpreted as \(\bar{\pi } \in \Pi _{\mathrm{HR}}\):
where \(\delta \) is the Dirac distribution and \(s_{\mathrm{l}}\) is the history’s last state.
2.4 Distribution
The history probability distribution \(p^{\mathrm{h}}_T \colon \Pi _{\mathrm{HR}}\to \Delta (\mathcal{H}(h,t))\) and \(\pi \in \Pi _{\mathrm{HR}}\) is defined for each \(T \in \mathbb {N}\) and \(h\in \mathcal{H}(\hat{h},t)\) as
Moreover, the function \(p^{\mathrm{h}}\) maps policies to correct probability distribution.
The history-dependent expectation is defined for each \(t\in \mathbb {N}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(\hat{h}\in \mathcal{H}\) and a \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\) as
In the \(\mathbb {E}\) operator above, the random variable \(\tilde{x}\) lives in a probability space \((\Omega , p)\) where \(\Omega = \mathcal{H}(\hat{h}, t)\) and \(p(h) = p^{\mathrm{h}}(h, \pi ), \forall h\in \Omega \). Moreover, if \(\hat{h}\) is a state, then it is interpreted as a history with the single initial state.
The history-dependent expectation is defined for each \(t\in \mathbb {N}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(\hat{h}\in \mathcal{H}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\), \(\tilde{b}\colon \mathcal{H} \to \mathcal{\mathcal{B}}\) as
In the \(\mathbb {E}\) operator above, the random variables \(\tilde{x}\) and \(\tilde{b}\) live in a probability space \((\Omega , p)\) where \(\Omega = \mathcal{H}(\hat{h}, t)\) and \(p(h) = p^{\mathrm{h}}(h, \pi ), \forall h\in \Omega \). Moreover, if \(\hat{h}\) is a state, then it is interpreted as a history with the single initial state.
The history-dependent expectation is defined for each \(t\in \mathbb {N}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(\hat{h}\in \mathcal{H}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\), \(\tilde{y}\colon \mathcal{H} \to \mathcal{\mathcal{V}}\) as
In the \(\mathbb {E}\) operator above, the random variables \(\tilde{x}\) and \(\tilde{h}\) live in a probability space \((\Omega , p)\) where \(\Omega = \mathcal{H}(\hat{h}, t)\) and \(p(h) = p^{\mathrm{h}}(h, \pi ), \forall h\in \Omega \). Moreover, if \(\hat{h}\) is a state, then it is interpreted as a history with the single initial state.
2.5 Basic Properties
Assume \(\tilde{x} \colon \mathcal{H} \to \mathbb {R}\) and \(c \in \mathbb {R}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
Directly from theorem 1.2.12.
Suppose that \(\tilde{x}, \tilde{y} \colon \mathcal{H} \to \mathbb {R}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
From theorem 1.2.7.
Suppose that \(c\in \mathbb {R}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
From theorem 1.2.8.
Suppose that \(\tilde{x}, \tilde{y} \colon \mathcal{H} \to \mathbb {R}\) satisfy that \(\tilde{x}(h) = \tilde{y}(h), \forall h \in \mathcal{H}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
From theorem 1.2.9.
For each \(\hat{h}\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), and \(t\in \mathbb {N}\):
where \(\tilde{\operatorname{id}}(h)\) is the identity function, \(l_{\mathrm{h}}\) is the length of a history (0-based), \(\tilde{s}_k\colon \mathcal{H} \to \mathcal{S}\) and \(\tilde{a}_k\colon \mathcal{H} \to \mathcal{A}\) are the 0-based \(k\)-th state and action, respectively of each history.
Follows from theorem 2.5.1 and the equality of the reward function \(\tilde{r}^{\mathrm{h}}\) and the sum in the expectation.
For each \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), and \(t\in \mathbb {N}\):
where \(k_0:=l_{\mathrm{h}}(h)\).
Follows from theorem 2.5.4.
For each \(\hat{h}\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(t\in \mathbb {N}\), \(h\in \mathcal{H}\):
where \(k_0:=l_{\mathrm{h}}(\hat{h})\).
From theorem 1.2.8.
Assume \(h \in \mathcal{H}\) and \(f \colon \mathcal{H} \to \mathbb {R}\) such that \(s_0 := s_{\mathrm{l}}(h)\)
Then
Directly from the definition of the expectation.
2.6 Total Expectation
For each \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(t\in \mathbb {N}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\) and \(\tilde{y}\colon \mathcal{H} \to \mathcal{V}\):
From theorem 1.4.2.
Suppose that the random variable \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\) satisfies for some \(k, t \in \mathbb {N}\), with \(k \le t\), that
where \(h_{\le k}\) is the prefix of \(h\) of length \(k\). Then for each \(h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}\):
2.7 Conditional Properties
For each \(h\in \mathcal{H}\), \(k_0 := l_{\mathrm{h}}(h)\), \(\pi \in \Pi _{\mathrm{HR}}\), \(t\in \mathbb {N}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\), \(s\in \mathcal{S}\), \(a\in \mathcal{A}\):
This should follow by algebraic manipulation.
3 Dynamic Program: History-Dependent Finite Horizon
In this section, we derive dynamic programming equations for histories. We assume an MDP \(M = (\mathcal{S}, \mathcal{A}, p, r)\) throughout this section.
The main idea of the proof is to:
Derive (exponential-size) dynamic programming equations for the history-dependent value function of history-dependent policies
Define the value function
Define an optimal value function
Show that value functions decompose to equivalence classes
Show that the value function for the equivalence classes can be computed efficiently
3.1 Definitions
A finite horizon objective definition is given by \(O := (s_0, T)\) where \(s_0\in \mathcal{S}\) is the initial state and \(T\in \mathbb {N}\) is the horizon.
In the reminder of the section, we assume an objective \(O = (s_0, T)\).
The finite horizon objective function for and objective \(O\) is \(\pi \in \Pi _{\mathrm{HR}}\) is defined as
A policy \(\pi ^{\star }\in \Pi _{\mathrm{HR}}\) is return optimal for an objective \(O\) if
The set of history-dependent value functions \(\mathcal{U}\) is defined as
A history-dependent policy value function \(\hat{u}_t\colon \mathcal{H} \times \Pi _{\mathrm{HR}}\to \mathbb {R}\) for each \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), and \(t\in \mathbb {N}\) is defined as
where \(k_0 := l_{\mathrm{h}}(h)\). Note that this definition includes also the rewards accrued in the history \(h\).
The optimal history-dependent value function \(\hat{u}_t^{\star }\colon \mathcal{H} \to \mathbb {R}\) is defined for a horizon \(t\in \mathbb {N}\) as
[It is not clear how to define this in Lean because the set of policies is not finite and it is not clear how to define the supremum or maximum.]
The following definition is another way of defining an optimal policy. Note that just because we define this value, it does not mean that it must exist.
An optimal history-dependent policy \(\pi ^{\star }\in \Pi _{\mathrm{HR}}\) is a policy that satisfies
A policy \(\pi ^{\star }\in \Pi _{\mathrm{HR}}\) optimal in definition 3.1.7 is also optimal in definition 3.1.7 for any initial state \(s_0\) and horizon \(T\).
3.2 History-dependent Dynamic Program
The following definitions of history-dependent value functions use a dynamic program formulation.
The history-dependent policy Bellman operator \(L_{\mathrm{h}}^{\pi } \colon \mathcal{U} \to \mathcal{U}\) is defined for \(\pi \in \Pi _{\mathrm{HR}}\) as
where the value function \(\tilde{u}\) is interpreted as a random variable on defined on the sample space \(\Omega = \mathcal{H}\).
The history-dependent optimal Bellman operator \(L_{\mathrm{h}}^{\star }\colon \mathcal{U} \to \mathcal{U}\) is defined as
where the value function \(\tilde{u}\) is interpreted as a random variable on defined on the sample space \(\Omega = \mathcal{H}\).
The history-dependent DP value function \(u_t^{\pi } \in \mathcal{U}\) for a policy \(\pi \in \Pi _{\mathrm{HR}}\) and \(t\in \mathbb {N}\) is defined as
The history-dependent DP value function \(u_t^{\star }\in \mathcal{U}\) for \(t\in \mathbb {N}\) is defined as
Suppose that \(u^1, u^2 \in \mathcal{U}\) satisfy that \(u^1(h) \ge u^2(h), \forall h\in \mathcal{H}\). Then
From theorem 1.2.11.
The following theorem shows the history-dependent value function can be computed by the dynamic program. The following theorem is akin to ( ?? , ?? , theorem 4.2.1 ) .
For each \(\pi \in \Pi _{\mathrm{HR}}\) and \(t\in \mathbb {N}\):
By induction on \(t\). The base case for \(t=0\) follows from the definition. The inductive case for \(t + 1\) follows for each \(h\in \mathcal{H}\) when \(l_{\mathrm{h}}(h) = k_0\) as
The following theorem is akin to ( ?? , ?? , theorem 4.3.2 ) .
For each \(t\in \mathbb {N}\):
By induction on \(t\). The base case is immediate. The inductive case follows for \(t+1\) as follows. For each \(\pi \in \Pi _{\mathrm{HR}}\):
4 Expected Dynamic Program: Markov Policy
4.1 Optimality
We discuss results needed to prove the optimality of Markov policies.
The set of independent value functions is defined as \(\mathcal{V} := \mathbb {R}^{\mathcal{S}}\).
A Markov Bellman operator \(L^{\star }\colon \mathcal{V} \to \mathcal{V}\) is defined as
The optimal value function \(v^{\star }_t\in \mathcal{V}, t\in \mathbb {N}\) is defined as
Suppose that \(t\in \mathbb {N}\). Then:
By induction on \(t\). The base case follows immediately from the definition. The inductive step for \(t+1\) follows as:
The optimal finite-horizon policy \(\pi ^{\star }_t, t\in \mathbb {N}\) is defined as
where \(a_0\) is an arbitrary action.
Assume a horizon \(T\in \mathbb {N}\). Then:
\(k := l_{\mathrm{h}}(h)\).
Fix some \(T \in \mathbb {N}\). By induction on \(k\) from \(k = T\) to \(k=0\). The base case is immediate from the definition. We prove the inductive case for \(k-1\) from \(k\) as
Here, \(k := {l_{\mathrm{h}}(h)}\) \(a^{\star }:= \pi ^{\star }_T(k, s_0)\) and \(s_0 := s_{\mathrm{l}}(h)\)
4.2 Evaluation
We discuss results pertinent to the evaluation of Markov policies.
Markov value functions depend on the length of the history.
The set of independent value functions is defined as \(\mathcal{V}_{\mathrm{M}} := \mathbb {R}^{\mathbb {N}\times \mathcal{S}}\).
A Markov policy Bellman operator \(L^{\pi }\colon \mathcal{V}_{\mathrm{M}} \to \mathcal{V}_{\mathrm{M}}\) for \(\pi \in \Pi \) is defined as
The Markov policy value function \(v^{\pi }_t\in \mathcal{V}_{\mathrm{M}}, t \in \mathbb {N}\) for \(\pi \in \Pi _{\mathrm{MD}}\) is defined as
Bibliography
- ??
M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, 2005.