 
  
  
   
Dynamic programming is a method for valuing American style options and other financial instruments that allow the holder to make decisions that effect the ultimate payout. The idea is to define the appropriate value function, f(x,t), that satisfies a nonlinear version of the backwards evolution equation (7). I will explain the idea in a simple but somewhat abstract situation. From in the previous section, it is possible to use these ideas to treat other related problems.
We have a Markov chain as before, but now the transition probabilities
depend on a ``control parameter'',   .  That is
 .  That is
  
 
In the ``stochastic control problem'', we are allowed to choose the
control parameter at time t,   , knowing the value of X(t)
but not any more about the future than the transition probabilities.
Because the system is a Markov chain, knowledge of earlier values,
X(t-1),
 , knowing the value of X(t)
but not any more about the future than the transition probabilities.
Because the system is a Markov chain, knowledge of earlier values,
X(t-1),   , will not help predict or control the future.
Choosing
 , will not help predict or control the future.
Choosing   as a function of X(t) and t is called ``feedback 
control'' or a ``decision strategy''.  The point here is
that the optimal control policy is a feedback control.  That is, 
instead of trying to choose a whole control trajectory,
  as a function of X(t) and t is called ``feedback 
control'' or a ``decision strategy''.  The point here is
that the optimal control policy is a feedback control.  That is, 
instead of trying to choose a whole control trajectory,   for
  for
  , we instead try to choose the feedback functions
 , we instead try to choose the feedback functions
  .  We will write
 .  We will write   for such a decision strategy.
  for such a decision strategy.
Any given strategy has an expected payout, which we write
  
 
Our object is to compute the value of the financial instrument under the optimal decision strategy:
  
 
and the optimal strategy that achieves this.
The appropriate collection of values for this is the ``cost to go'' function
  
 
As before, we have ``initial data''   .  We need to
compute the values f(x,t) in terms of already computed values f(x,t+1).
For this, we suppose that the optimal decision strategy at time t is not yet 
known but those at later times are already computed.  If we use control
variable
 .  We need to
compute the values f(x,t) in terms of already computed values f(x,t+1).
For this, we suppose that the optimal decision strategy at time t is not yet 
known but those at later times are already computed.  If we use control
variable   at time t, and the optimal control thereafter, we get
payout depending on the state at time t+1:
  at time t, and the optimal control thereafter, we get
payout depending on the state at time t+1:
  
 
Maximizing this expected payout over   gives the optimal expected
payout at time t:
  gives the optimal expected
payout at time t:
  
 
This is the principle of dynamic programming. We replace the ``multiperiod optimization problem'' (11) with a sequence of hopefully simpler ``single period'' optimization problems (13) for the cost to go function.
 
  
 