Equations
- I.argmax' H f = match List.argmax f I.attach.toList, ⋯ with | some x, x_1 => x
Instances For
Objective function
Equations
- MDPs.objective_fh O π = MDPs.expect_h (MDPs.Hist.init ↑O.s₀) π O.T MDPs.reward
Instances For
An optimal policy πopt
Equations
- MDPs.Optimal_fh O πopt = ∀ (π : MDPs.PolicyHR M), MDPs.objective_fh O πopt ≥ MDPs.objective_fh O π
Instances For
Value function type for history-dependent value functions
Equations
- MDPs.ValuesH M = (MDPs.Hist M → ℝ)
Instances For
History-dependent value function
Equations
- MDPs.hvalue_π π Nat.zero = fun (x : MDPs.Hist M) => 0
- MDPs.hvalue_π π t.succ = fun (h : MDPs.Hist M) => MDPs.expect_h h π t.succ (MDPs.reward_from h.length)
Instances For
A value-optimal policy πopt
Equations
- MDPs.OptimalVF_fh t πopt = ∀ (π : MDPs.PolicyHR M) (h : MDPs.Hist M), MDPs.hvalue_π πopt t h ≥ MDPs.hvalue_π π t h
Instances For
Bellman operator on history-dependent value functions
Equations
- MDPs.DPhπ π v x✝ = MDPs.expect_h x✝ π 1 fun (h' : MDPs.Hist M) => MDPs.reward h' + v h'
Instances For
Bellman operator on history-dependent value functions
Equations
- One or more equations did not get rendered due to their size.
Instances For
Dynamic program value function, finite-horizon history dependent
Equations
- MDPs.u_dp_π π Nat.zero = fun (x : MDPs.Hist M) => 0
- MDPs.u_dp_π π t.succ = MDPs.DPhπ π (MDPs.u_dp_π π t)
Instances For
Dynamic program value function, finite-horizon history dependent
Equations
- MDPs.u_dp_opt Nat.zero = fun (x : MDPs.Hist M) => 0
- MDPs.u_dp_opt t.succ = MDPs.DPhopt (MDPs.u_dp_opt t)
Instances For
A deterministic Markov policy. Depends on the time step, and does not depend on the horizon.
Equations
- MDPs.PolicyMD M = (ℕ → MDPs.DecisionRule M)
Instances For
Equations
- MDPs.instCoePolicyMDPolicyHROfDecidableEq = { coe := fun (d : MDPs.PolicyMD M) (h : MDPs.Hist M) => Finprob.dirac_dist M.A (d h.length h.last) }
History-independent value function. Note that the optimal value function is history-independent, while the value function of a Markov policy depends on the time step.
Equations
- MDPs.Values x✝ = (σ → ℝ)
Instances For
Markov q function
Equations
- MDPs.q_of_v s a v = MDPs.expect_h (MDPs.Hist.init s) (fun (x : MDPs.Hist M) => Finprob.dirac_dist M.A a) 1 fun (h : MDPs.Hist M) => MDPs.reward h + v h.last
Instances For
Bellman operator on history-dependent value functions
Equations
- MDPs.DPMopt v x✝ = M.A.attach.sup' ⋯ fun (a : { x : α // x ∈ M.A }) => MDPs.q_of_v x✝ a v
Instances For
Optimal value function
Equations
- MDPs.v_dp_opt Nat.zero = fun (x : σ) => 0
- MDPs.v_dp_opt t.succ = MDPs.DPMopt (MDPs.v_dp_opt t)
Instances For
The Markov DP is optimal
Optimal policy for horizon t
Equations
- MDPs.πopt t x✝¹ x✝ = if t ≥ x✝¹ then M.A.argmax' ⋯ fun (a : { x : α // x ∈ M.A }) => MDPs.q_of_v x✝ a (MDPs.v_dp_opt (t - x✝¹)) else Classical.indefiniteDescription (fun (a : α) => a ∈ M.A) ⋯
Instances For
Greedy to v_opt is optimal policy
Equations
- MDPs.ValuesM x✝ = (ℕ → σ → ℝ)
Instances For
Optimal Bellman operator on state-dependent value functions. Also includes the prior history's reward.
Equations
- MDPs.DPMπ π v x✝¹ x✝ = MDPs.expect_h (MDPs.Hist.init x✝) (fun (h : MDPs.Hist M) => Finprob.dirac_dist M.A (π h.length h.last)) 1 fun (h : MDPs.Hist M) => MDPs.reward h + v (x✝¹ + 1) h.last
Instances For
Value function of a Markov policy. Horizon to value function.
Equations
- MDPs.v_dp_π π Nat.zero = fun (x : ℕ) => 0
- MDPs.v_dp_π π t.succ = MDPs.DPMπ π (MDPs.v_dp_π π t)