\documentclass{article}
\usepackage{ifthen}
\begin{document}
\newcounter{OldSection}
\newcounter{ParCount}
\newcommand{\para}{
\vspace{.4cm}
\ifthenelse { \value{OldSection} < \value{section} }
{ \setcounter{OldSection}{ \value{section} }
\setcounter{ParCount}{ 0 } }
{}
\stepcounter{ParCount}
\noindent
\bf \arabic{section}.\arabic{ParCount}. \rm \hspace{.2cm}
}
\Large \begin{center}
Stochastic Calculus Notes, Lecture 1 \\
\normalsize
Last modified September 12, 2004
\end{center} \normalsize
\section{Overture}
\para Introduction:
The term {\em stochastic} means ``random''.
Because it usually occurs together with ``process'' (stochastic process),
it makes people think of something something random that changes in
a random way over time. The term {\em calculus} refers to ways to calculate
things or find things that can be calculated (e.g. derivatives in the
differential calculus). Stochastic calculus is the study of stochastic
processes through a collection of powerful ways to calculate things.
Whenever we have a question about the behavior of a stochastic process,
we will try to find an expected value or probability that we can calculate
that answers our question.
\para Organization:
We start in the {\em discrete} setting in which there is a finite or
countable (definitions below) set of possible outcomes. The tools are
summations and matrix multiplication. The main concepts can be displayed
clearly and concretely in this setting. We then move to continuous
processes in continuous time where things are calculated using integrals,
either ordinary integrals in $R^n$ or abstract integrals in probability
space. It is impossible (and beside the point if it were possible) to
treat these matters with full mathematical rigor in these notes. The reader
should get enough to distinguish mathematical right from wrong
in cases that occur in practical applications.
\para Backward and forward equations:
Backward equations and forward equations are perhaps the most useful
tools for getting information about stochastic processes. Roughly
speaking, there is some number, $f$, that we want to know. For example
$f$ could be the expected value of a portfolio after following a
proposed trading strategy. Rather than compute $f$ directly, we
define an array of related expected values, $f(x,t)$. The
{\em tower property} implies relationships, backward equations
or forward equations, among these
values that allow us to compute some of them in terms of others.
Proceeding from the few known values ({\em initial conditions} and
{\em boundary conditions}), we eventually find the $f$ we first
wanted. For discrete time and space, the equations are matrix
equations or recurrence relations. For continuous time and space,
they are partial differential equations of {\em diffusion} type.
\para Diffusions and Ito calculus:
The Ito calculus is a tool for studying continuous stochastic processes in
continuous time. If $X(t)$ is a differentiable function of time, then
$\Delta X = X(t+\Delta t) - X(t)$ is of the order
of\footnote{This means that there is a $C$ so that
$\left| X(t+\Delta t - X(t)\right| \leq C\left|\Delta \right|$ for
small $\Delta t$.} $\Delta t$. Therefore
$\Delta f(X(t)) = f(X(t+\Delta t)) - f(X(t)) \approx f^{\prime} \Delta X$
to this accuracy. For an Ito process, $\Delta X$ is of the order of
$\sqrt{\Delta t}$, so $\Delta f \approx f^{\prime} \Delta X +
\frac{1}{2} f^{\prime \prime}\Delta X^2$ has an error smaller than
$\Delta t$. In the special case where $X(t)$ is Brownian motion,
it is often permissible (and the basis of the Ito calculus) to
replace $\Delta X^2$ by its mean value, $\Delta t$.
\section{Discrete probability}
Here are some basic definitions and ideas of probability. These might
seem dry without examples. Be patient. Examples are coming in later
sections. Although the topic is elementary, the notation is taken from
more advanced probability so some of it might be unfamiliar. The terminology
is not always helpful for simple problems but it is just the
thing for describing stochastic processes and decision problems under
incomplete information.
\para Probability space:
Do an ``experiment'' or ``trial'', get an ``outcome'', $\omega$. The
set of all possible outcomes is $\Omega$, which is the
{\em probability space}. The $\Omega$ is {\em discrete}
if it is finite or countable (able to be listed in a single infinite
numbered list). The outcome $\omega$ is often called a {\em random
variable}. I avoid that term because I (and most other people) want
to call functions $X(\omega)$ random variables, see below.
\para Probability:
The probability of a specific outcome is $P(\omega)$. We always assume
that $P(\omega)\geq 0$ for any $\omega \in \Omega$ and that
$\displaystyle \sum_{\omega \in \Omega} P(\omega) = 1$. The interpretation
of probability is a matter for philosophers, but we might say that
$P(\omega)$ is the probability of outcome $\omega$ happening, or the
fraction of times event $\omega$ would happen in a large number of independent
trials. The philosophical problem is that it may be impossible actually to
perform a large number of independent trials. People also sometimes say that
probabilities represent our often subjective (lack of) knowledge of future
events. Probability 1 means something that is certain to happen while probability
0 is for something that cannot happen. ``Probability zero
$\Rightarrow$ impossible'' is only strictly true for discrete probability.
\para Event:
An {\em event} is a set of outcomes, a subset of $\Omega$. The probability
of an event is the sum of the probabilities of the outcomes that make up
the event
\begin{equation}
P(A) = \sum_{\omega \in A}P(\omega) \;.
\label{discrete} \end{equation}
Usually, we specify an event in some way other than listing all
the outcomes in it (see below). We do not distinguish
between the outcome $\omega$ and the event that that outcome occurred
$A=\left\{\omega\right\}$. That is, we write $P(\omega)$ for
$P(\left\{\omega\right\})$ or vice versa. This is called ``abuse of
notation'': we use notation in a way that is not absolutely correct but
whose meaning is clear. It's the mathematical version of saying
``I could care less'' to mean the opposite.
\para Countable and uncountable (technical detail):
A probability space (or any set) that is not countable is called
``uncountable''. This distinction was formalized by the late nineteenth
century mathematician Georg Cantor, who showed that the set of (real)
numbers in the interval $[0,1]$ is not countable. Under the uniform
probability density, $P(\omega) = 0$ for any $\omega \in [0,1]$. It is
hard to imagine that the probability formula (\ref{discrete}) is useful
in this case, since
every term in the sum is zero. The difference between continuous
and discrete probability is the difference between integrals and sums.
\para Example:
Toss a coin 4 times. Each toss yields either H (heads) or T (tails).
There are 16 possible outcomes, TTTT, TTTH, TTHT, TTHH, THTT, $\ldots$, HHHH.
The number of outcomes is $\#(\Omega) = \left|\Omega\right|=16$. We
suppose that
each outcome is equally likely, so $P(\omega) = \frac{1}{16}$ for each
$\omega \in \Omega$. If $A$ is the event that the first two tosses are
H, then
$$
A=\left\{ \mbox{HHHH, HHHT, HHTH, HHTT}\right\} \; .
$$
There are 4 elements (outcomes) in $A$, each having probability $\frac{1}{16}$
Therefore
$$
\displaystyle P(\mbox{first two H}) = P(A) = \sum_{\omega\in A}P(\omega)
= \sum_{\omega \in A} \frac{1}{16} = \frac{4}{16} = \frac{1}{4} \;\; .
$$
\para Set operations:
Events are sets, so set operations apply to events.
If $A$ and $B$ are events, the event ``$A$ and $B$'' is the set of outcomes
in both $A$ and $B$. This is the set intersection $A\cap B$, because
the outcomes that make both $A$ and $B$ happen are those that are in
both events.
The union $A\cup B$ is the set of outcomes in $A$ or in $B$ (or in both).
The {\em complement} of $A$, $A^c$, is the event ``not $A$'', the set of
outcomes not in $A$.
The empty event is the empty set, the set with no elements, $\emptyset$.
The probability of $\emptyset$ should be zero because the sum that defines
it has no terms: $P(\emptyset)=0$. The complement of $\emptyset$ is $\Omega$.
Events $A$ and $B$ are disjoint if $A\cap B = \emptyset$. Event $A$ is
contained in event $B$, $A \subseteq B$, if every outcome in $A$ is also in
$B$. For example, if the event $A$ is as above and $B$ is the event that
the first toss is H, then $A\subseteq B$.
\para Basic facts:
Each of these facts is a consequence of the representation
$P(A) = \sum_{\omega \in A}P(\omega)$. First $P(A) \leq P(B)$ if
$A\subseteq B$. Also, $P(A)+P(B) = P(A\cup B)$ if $P(A\cap B) = 0$, but
not otherwise.
If $P(\omega \neq 0$ for all $\omega \in \Omega$, then $P(A\cap B) = 0$
only wehn $A$ and $B$ are distoint.
Clearly, $P(A) + P(A^c) = P(\Omega) = 1$.
\para Conditional probability:
The probability of outcome $A$ given that $B$
has occurred is the {\em conditional probability} (read
``the probability of $A$ given $B$'',
\begin{equation}
P(A\mid B) = \frac{P(A\cap B)}{P(B)} \; .
\label{br} \end{equation}
This is the fraction of $B$ outcomes that are also $A$ outcomes. The formula
is called {\em Bayes' rule}. It is often used to calculate $P(A\cap B)$
once we know $P(B)$ and $P(A\mid B)$. The formula for that is
$P(A\cap B) = P(A\mid B)P(B)$.
\para Independence:
Events $A$ and $B$ are {\em independent} if $P(A\mid B)=P(A)$.
That is, knowing whether or not $B$ occurred does not change the probability of
$A$. In view of Bayes' rule, this is expressed as
\begin{equation}
P(A\cap B)=P(A) \cdot P(B) \; .
\label{indep} \end{equation}
For example, suppose $A$ is the event that two of the four tosses are H, and
$B$ is the event that the first toss is $H$. Then $A$ has 6 elements
(outcomes), $B$ has 8, and, as you can check by listing them, $A\cap B$
has 3 elements. Since each element has probability $\frac{1}{16}$, this
gives $P(A\cap B) = \frac{3}{16}$ while $P(A) = \frac{6}{16}$ and
$P(B) = \frac{8}{16} = \frac{1}{2}$. We might say ``duh'' for the last
calculation since we started the example with the hypothesis that H and T
were equally likely. Anyway, this shows that (\ref{indep}) is indeed
satisfied in this case. This example is supposed to show that while some
pairs of events, such as the first and second tosses, are ``obviously''
independent, others are independent as the result of a calculation. Note
that if $C$ is the event that 3 of the 4 tosses are $H$ (instead of 2
for $A$), then $P(C) = \frac{4}{16}=\frac{1}{4}$ and
$P(B\cap C) =\frac{3}{16}$, because
$$
B\cap C = \{\mbox{HHHT, HHTH, HTHH}\}
$$
has three elements. Bayes' rule (\ref{br}) gives
$P(B\mid C) = \frac{3}{16}/\frac{3}{4} = \frac{3}{4}$. Knowing that there
are 3 heads in all raises the probability that the first toss is $H$ from
$\frac{1}{2}$ to $\frac{3}{4}$.
\para Working with conditional probability:
Let us fix the event $B$, and discuss the conditional probability
$\widetilde{P}(\omega) = P(\omega \mid B)$, which also is a probability
(assuming $P(B)>0$).
There are two slightly different ways to discuss $\widetilde{P}$.
One way is to take $B$ to be the probability space and define
$$
\widetilde{P}(\omega) = \frac{P(\omega)}{P(B)}
$$
for all $\omega \in B$. Since $B$ is the probability space for
$\widetilde{P}$, we do not have to define $\widetilde{P}$ for
$\omega \notin B$. This $\widetilde{P}$ is a probability because
$\widetilde{P}(\omega) \geq 0$ for all $\omega \in B$ and
$\sum_{\omega \in B}\widetilde{P}(\omega) = 1$. The other way
is to keep $\Omega$ as the probability space and set the conditional
probabilities to zero for $\omega \notin B$. If we know the event
$B$ happened, then the probability of an outcome not in $B$ is zero.
\begin{equation}
P(\omega \mid B) = \left\{ \begin{array}{ll}
\frac{P(\omega)}{P(B)} & \mbox{for $\omega \in B$,} \\
0 & \mbox{for $\omega \notin B$.}
\end{array} \right.
\label{CondProb} \end{equation}
Either way, we restrict to outcomes in $B$ and
``renormalize'' the probabilities by dividing by $P(B)$ so that they
again sum to one.
Note that (\ref{CondProb}) is just the general
conditional probability formula (\ref{br}) applied to the event
$A = \left\{\omega \right\}$.
We can condition a second time by conditioning $\widetilde{P}$
on another event, $C$. It seems natural that
$\widetilde{P}(\omega \mid C)$, which is the conditional
probability of $\omega$ given that $C$, occurred given that $B$ occurred,
should be be the $P$ conditional probability of $\omega$ given that
both $B$ and $C$ occurred. Bayes' rule verifies this intuition:
\begin{eqnarray*}
\widetilde{P}(\omega\mid C) & = & \frac{\widetilde{P}(\omega)}{\widetilde{P}(C)} \\
& = & \frac{P(\omega\mid B)}{P(C\mid B)} \\
& = & \frac{P(\omega)}
{\displaystyle P(B) \frac{P(C\cap B}{P(B)}} \\
& = & \frac{P(\omega)}{P(B\cap C)} \\
& = & P(\omega \mid B\cap C) \; .
\end{eqnarray*}
The conclusion is that conditioning on $B$ and then on $C$ is the same as
conditioning on $B\cap C$ ($B$ and $C$) all at once.
This {\em tower property}
underlies the many recurrence relations that allow us to get answers in
practical situations.
\para Algebra of sets and incomplete information:
A set of events, $\cal F$, is an
{\em algebra} if
\begin{description}
\item[\it i:] $\displaystyle A\in \cal F$ implies that $A^c \in \cal F$.
\item[\it ii:]$A\in \cal F$ and $B\in \cal F$ implies that $A\cup B \in \cal F$
and $A\cap B \in \cal F$.
\item[\it iii:] $\Omega \in \cal F$ and $\emptyset \in \cal F$.
\end{description}
We interpret $\cal F$ as representing a state of partial information. We know
whether any of the events in $\cal F$ occurred but we do not have enough
information to determine whether an event not in $\cal F$ occurred. The above
axioms are natural in light of this interpretation. If we know whether $A$
happened, we surely know whether ``not $A$'' happened. If we know whether
$A$ happened and whether $B$ happened, then we can tell whether
``$A$ and $B$'' happened. We definitely know whether $\emptyset$ happened
(it did not) and whether $\Omega$ happened (it did). Events in $\cal F$
are called {\em measurable} or {\em determined in $\cal F$}.
\para Example 1 of an $\cal F$:
Suppose we learn the outcomes of the first two tosses.
One event measurable in $\cal F$ is (with some abuse of notation)
$$
\left\{\mbox{HH}\right\} = \left\{\mbox{HHHH, HHHT, HHTH, HHTT}\right\} \; .
$$
An example of an
event not determined by this $\cal F$ is the event of no more than one H:
$$
A = \left\{\mbox{TTTT, TTTH, TTHT, THTT, HTTT}\right\} \; .
$$
Knowing just the first two tosses does not tell you with certainty whether
the total number of heads is less than two.
\para Example 2 of an $\cal F$:
Suppose we know only the results of the tosses but not the
order. This might happen if we toss 4 identical coins at the same time. In
this case, we know only the number of H coins. Some measurable sets are
(with an abuse of notation)
\begin{eqnarray*}
\left\{4\right\} & = & \left\{ \mbox{HHHH}\right\} \\
\left\{3\right\} & = & \left\{ \mbox{HHHT, HHTH, HTHH, THHH}\right\} \\
\vdots & & \\
\left\{0\right\} & = & \left\{\mbox{TTTT}\right\}
\end{eqnarray*}
The event $\{2\}$ has $6$ outcomes (list them), so its probability is
$\displaystyle 6 \cdot \frac{1}{16} = \frac{3}{8}$. There are other
events measurable in this algebra, such as ``less than 3 H'', but, in
some sense, the events listed {\em generate} the algebra.
\para $\sigma-$algebra:
An algebra of sets is a $\sigma-$algebra (pronounced ``sigma algebra'')
if it is {\em closed under countable intersections}, which means the
following.
Suppose $A_n \in {\cal F}$ is a countable
family of events measurable in ${\cal F}$, and $A = \cap_{n}A_n$ is the
set of outcomes in all of the $A_n$, then $A \in {\cal F}$, too.
The reader can check that an algebra closed under countable intersections
is also closed under countable unions, and conversely.
An algebra is automatically a $\sigma-$algebra if $\Omega$ is finite.
If $\Omega$ is infinite, an algebra might or might not be a
$\sigma-$algebra.\footnote{Let
$\Omega$ be the set of integers and $A \in {\cal F}$ if $A$ is finite
or $A^c$ is finite. This $\cal F$ is an algebra (check), but not
a $\sigma-$algebra. For example, if $A_n$ leaves out only the first
$n$ odd integers, then $A$ is the set of even integers, and neither
$A$ nor $A^c$ is finite.} In a $\sigma-$algebra, it is possible to
take limits of infinite sequences of events, just as it is possible
to take limits of sequences of real numbers. We will never (again)
refer to an algebra of events that is not a $\sigma-$algebra.
\para Terminology:
What we call ``outcome'' is usually called ``random variable''.
I did not use this terminology because it can be confusing, in that we
often think of ``variables'' as real (or complex) numbers.
A ``real valued function'' of the random variable
$\omega$ is a real number $X$ for each $\omega$, written $X(\omega)$. The
most common abuse of notation in probability is to write $X$ instead of
$X(\omega)$. We will do this most of the time, but not just yet. We often
think of $X$ as a random number whose value is determined by the outcome
(random variable) $\omega$. A common convention is to use upper case letters
for random numbers and lower case letters for specific values of that variable.
For example, the ``cumulative distribution function'' (CDF), $F(x)$, is the
probability that $X\leq x$, that is:
$\displaystyle F(x) = \sum_{X(\omega)\leq x}P(\omega)$.
\para Informal event terminology:
We often describe events in words. For example, we might write
$P(X\leq x)$ where, strictly, we might be supposed to say
$A_x = \left\{ \omega \mid X(\omega) \leq x \right\}$ then
$P(X \leq x) = P(A_x)$. For example, if there are two functions,
$X_1$ and $X_2$, we might try to calculate the probability that they
are equal, $P(X_1 = X_2)$. Strictly speaking, this is
the probability of the set of $\omega$ so that $X_1(\omega) = X_2(\omega)$.
\para Measurable:
A function (of a random variable) $X(\omega)$ is measurable
with respect to the algebra $\cal F$ if the value of $X$ is completely
determined by the information in $\cal F$.
To give a mathematical definition, for any number, $x$, we can consider
the event that $X=x$, which is $B_x = \{\omega \; : \; X(\omega) = x\}$.
In discrete probability, $B_x$
will be the empty set for almost all $x$ values and will not be empty only
for those values of $x$ actually taken by $X(\omega)$ for one of the
outcomes $\omega$.
The function $X(\omega)$ is ``measurable with respect
to $\cal F$'' if the sets $B_x$ are all measurable. People often write
$X \in \cal F$ (an abuse of notation) to indicate that $X$ is measurable
with respect to $\cal F$. In Example 2 above, the function
$X = \mbox{number of H minus number of T}$ is measurable, while the
function $X=\mbox{number of T before the first H}$ is not (find an
$x$ and $B_x\notin \cal F$ to show this).
\para Generating an algebra of sets:
Suppose there are events $A_1$, $\ldots$, $A_k$
that you know. The algebra, $\cal F$, generated by these sets is the algebra
that expresses the information about the outcome you gain by knowing these
events. One definition of $\cal F$ is that an event $A$ is in $\cal F$ if
$A$ can be expressed in terms of the known events $A_j$ using the set operations
intersection, union, and complement a number of times.
For example, we
could define an event $A$ by saying ``$\omega$ is in $A_1$ and ($A_2$ or $A_3$)
but not in $A_4$ or $A_5$'', which would be written
$A = (A_1 \cap (A_2\cup A_3)) \cap (A_4 \cup A_5)^c$.
This is the same as saying that $\cal F$ is the
smallest algebra of sets that contains the known events $A_j$. Obviously
(think about this!) any algebra that contains the $A_j$ contains any event
described by set operations on the $A_j$, that is the definition of algebra
of sets. Also the sets defined by set operations on the $A_j$ form an
algebra of sets.
For example, if $A_1$ is the event that the first toss
is H and $A_2$ is the event that both the first two are $H$, then $A_1$ and
$A_2$ generate the algebra of events determined by knowing the results of
the first two tosses. This is Example 1 above. To generate a
$\sigma-$ algebra, we mayhave to allow infinitely many set operations,
but a precise discussion of this would be ``off message''.
\para Generating by a function:
A function $X(\omega)$ defines an algebra of sets
generated by the sets $B_x$. This is the smallest algebra, $\cal F$, so
that $X$ is measurable with respect to $\cal F$. Example 2 above has
this form. We can think of $\cal F$ as being the algebra of sets defined
by statements about the values of $X(\omega)$. For example, one
$A \in \cal F$ would be the set of $\omega$ with $X$ either between 1 and 3
or greater than 4.
We write ${\cal F}_X$ for the algebra of sets generated by $X$ and ask
what it means that another function of $\omega$, $Y(\omega)$, is measurable
with respect to ${\cal F}_X$. The information interpretation of ${\cal F}_X$
says that $Y\in {\cal F}_X$ if knowing the value of $X(\omega)$ determines
the value of $Y(\omega)$. This means that if $\omega_1$ and $\omega_2$
have the same $X$ value ($X(\omega_1) = X(\omega_2)$) then they also have
the same $Y$ value. Said another way, if $B_x$ is not empty, then there
is some number, $u(x)$, so that $Y(\omega) = u(x)$ for every $\omega \in B_x$.
This means that $Y(\omega) = u(X(\omega))$ for all $\omega \in \Omega)$.
Altogether, saying $Y\in {\cal F}_X$ is a fancy way of saying
that $Y$ is a function of $X$. Of course, $u(x)$ only needs to be defined
for those values of $x$ actually taken by the random variable $X$.
For example, if $X$ is the number of $H$ in 4 tosses, and $Y$ is the
number of $H$ minus the number of $T$, then, for any 4 tosses, $\omega$,
$Y(\omega)=2X(\omega) - 4$. That is, $u(x) = 2x-4$.
\para Equivalence relation:
A $\sigma-$algebra, $\cal F$, determines an {\em equivalence relation}.
Outcomes $\omega_1$ and $\omega_2$ are equivalent, written
$\omega_1 \sim \omega_2$, if the information in $\cal F$ does not distinguish
$\omega_1$ from $\omega_2$. More formally, $\omega_1 \sim \omega_2$ if
$\omega_1 \in A \Rightarrow \omega_2\in A$ for every $A\in \cal F$.
For example, in Example 2 above, $\mbox{THTT} \sim \mbox{TTTH}$.
Because $\cal F$ is an algebra, $\omega_1 \sim \omega_2$ also implies
that $\omega_1 \notin A \Rightarrow \omega_2 \notin A$ (think this through).
Note that it is possible that $A_{\omega} = A_{\omega^{\prime}}$
while $\omega \neq \omega^{\prime}$. This happens when
$\omega \sim \omega^{\prime}$.
The {\em equivalence class} of outcome $\omega$ is the set of outcomes
equivalent to $\omega$ in $\cal F$, indistinguishable from $\omega$ using
the information available in $\cal F$. If $A_{\omega}$ is the
equivalence class of $\omega$, then $A_{\omega} \in \cal F$.
(Proof: for any $\omega^{\prime}$ not equivalent to $\omega$ in
$\cal F$, there is at least one $B_{\omega^{\prime}} \in \cal F$ with
$\omega \in B_{\omega^{\prime}}$ but
$\omega^{\prime} \notin B_{\omega^{\prime}}$.
Since there are (at most) countably many $\omega^{\prime}$,
and $\cal F$ is a $\sigma-$algebra,
$A_{\omega} = \cap_{\omega^{\prime}}B_{\omega^{\prime}} \in \cal F$.
This $A_{\omega}$ contains every $\omega_1$ that is equivalent to $\omega$
(why?) and only those.)
In Example 2, the equivalence class of THTT is the event
$\left\{\mbox{HTTT, THTT, TTHT, TTTH}\right\}$.
\para Partition:
A {\em partition} of $\Omega$ is a collection of events,
${\cal P} = \left\{B_1, B_2, \ldots\right\}$ so that every outcome
$\omega \in \Omega$ is in exactly one of the events $B_k$.
The $\sigma-$algebra generated by $\cal P$, which we call $\cal F_{\cal P}$,
consists of events that are unions of events in $\cal P$ (Why are complements
and intersections not needed?). For any partition $\cal P$, the equivalence
classes of $\cal F_{\cal P}$ are the events in $\cal P$ (think this
through).
Conversely, if $\cal P$ is the partition of $\Omega$ into equivalence
classes for $\cal F$, then $\cal P$ generates $\cal F$.
In Example 2 above, the sets $B_k = \left\{k\right\}$ form the
partition corresponding to $\cal F$. More generally, the
sets $B_x = \left\{ \omega \mid X(\omega) = x \right\}$
that are not empty are the partition corresponding to ${\cal F}_X$.
In discrete probability, partitions are a convenient way
to understand conditional expectation (below).
The ininformation in ${\cal F}_{\cal P}$ is the knowledge of which of
the $B_j$ happened.
The remaining uncertainty i swhich of the $\omega \in B_j$ happened.
\para Expected value:
A random variable (actually, a function of a random variable)
$X(\omega)$ has expected value
$$
E[X] = \sum_{\omega \in \Omega} X(\omega) P(\omega) \; .
$$
(Note that we do not write $\omega$ on the left. We think of $X$ as simply
a random number and $\omega$ as a story telling how $X$ was generated.)
This is the ``average'' value in the sense that if you could perform the
``experiment'' of sampling $X$ many times then average the resulting
numbers, you would get roughly $E[X]$. This is because $P(\omega)$ is the
fraction of the time you would get $\omega$ and $X(\omega)$ is the number
you get for $\omega$. If $X_1(\omega)$ and $X_2(\omega)$ are two random
variables, then $E[X_1+X_2] = E[X_1] + E[X_2]$. Also, $E[cX] = cE[X]$
if $c$ is a constant (not random).
\para Best approximation property:
If we wanted to approximate a random variable, $X$, (function $X(\omega)$
with $\omega$ not written) by a single non random number, $x$, what
value would we pick? That would depend on the sense of ``best''. One
such sense is {\em least squares}, choosing $x$ to minimize the expected
value of $(X-x)^2$. A calculation, which uses the above properties of
expected value, gives
\begin{eqnarray*}
E\left[\left(X-x\right)^2\right] & = & E[X^2 - 2Xx + x^2 ] \\
& = & E[X^2] - 2xE[X] + x^2 \; .
\end{eqnarray*}
Minimizing this over $x$ gives the optimal value
\begin{equation}
x_{\mbox{\scriptsize opt}} = E[X] \; .
\label{exp} \end{equation}
\para Classical conditional expectation:
There are two senses of the term {\em conditional expectation}. We start
with the original {\em classical} sense then turn to the related but different
{\em modern} sense often
used in stochastic processes. Conditional expectation is defined from
conditional probability in the obvious way
\begin{equation}
E[X|B] = \sum_{\omega \in B} X(\omega)P(\omega|B) \; .
\label{Classic} \end{equation}
For example, we can calculate
$$
E[\# \mbox{of H in 4 tosses } \mid \mbox{ at least one H}] \; .
$$
Write $B$ for the event $\{\mbox{at least one H}\}$. Since only
$\omega =$TTTT does not have at least one H, $\left|B\right| = 15$ and
$P(\omega\mid B) = \frac{1}{15}$ for any $\omega \in B$.
Let $X(\omega)$ be the number of H in $\omega$.
Unconditionally, $E[X] = 2$, which means
$$
\frac{1}{16} \sum_{x\in\Omega} X(\omega) = 2 \; .
$$
Note that $X(\omega) = 0$ for all $\omega \notin B$ (only TTTT), so
$$
\sum_{\omega \in \Omega} X(\omega) P(\omega)
= \sum_{\omega \in B} X(\omega)P(\omega)\; ,
$$
and therefore
\begin{eqnarray*}
\frac{1}{16}\sum_{\omega \in B}X(\omega)P(\omega) & = & 2 \\
\frac{15}{16} \cdot \frac{1}{15} \sum_{\omega \in B}X(\omega)P(\omega)
& = & 2 \\
\frac{1}{15} \sum_{\omega \in B}X(\omega)P(\omega) & = & \frac{2 \cdot 16}{15} \\
E[X\mid B] & = & \frac{32}{15} \;\;\;=\;\; 2 + .133\ldots\;\; .
\end{eqnarray*}
Knowing that there was at least one H increases the expected number of H
by $.133\ldots$.
\para Law of total probability:
Suppose ${\cal P} = \left\{B_1, B_2, \ldots \right\}$ is a partition of
$\Omega$. The {\em law of total probability} is the formula
\begin{equation}
E[X] = \sum_k E[X \mid B_k] P(B_k) \; .
\label{TotProb} \end{equation}
This is easy to understand: exactly one of the events $B_k$ happens.
The expected
value of $X$ is the sum over each of the events $B_k$ of the expected
value of $X$ given that $B_k$ happened, multiplied by the probability
that $B_k$ did happen. The derivation is a simple combination of the
definitions of conditional expectation (\ref{Classic}) and conditional
probability (\ref{CondProb}):
\begin{eqnarray*}
E[X] & = & \sum_{\omega \in \Omega} X(\omega) P(\omega) \\
& = & \sum_k \left( \sum_{\omega \in B_k} X(\omega) P(\omega) \right) \\
& = & \sum_k \left( \sum_{\omega \in B_k}
X(\omega) \frac{P(\omega)}{P(B_k)} \right) P(B_k) \\
& = & \sum_k E[X \mid B_k] P(B_k) \; .
\end{eqnarray*}
This fact underlies the recurrence relations that are among the primary tools
of stochastic calculus. It will be reformulated below as the
{\em tower property} when we discuss the modern view of conditional
probability.
\para Modern conditional expectation:
The modern conditional expectation starts with an algebra, $\cal F$, rather
than just the set $B$.
It defines a (function of a) random variable, $Y(\omega) = E[X\mid {\cal F}]$,
that is measurable with respect to $\cal F$ even though $X$ is not.
This function represents the best prediction (in the least squares sense)
of $X$ given the information in $\cal F$.
If $X \in \cal F$, then the value of $X(\omega)$ is determined
by the information in $\cal F$, so $Y = X$.
In the classical case, the information is
the occurrance or non occurrance of a single event, $B$. That is,
the algebra, ${\cal F}_B$, consists only of the sets $B$, $B^c$, $\emptyset$,
and $\Omega$.
For this ${\cal F}_B$, the modern definition gives a function
$Y(\omega)$ so that
$$
Y(\omega) = \left\{ \begin{array}{ll}
E[X\mid B] & \mbox{if $\omega \in B$,}\\
E[X\mid B^c] & \mbox{if $\omega \notin B$.}\\ \end{array}\right.
$$
Make sure you understand the fact that this two valued function $Y$ is
measurable with respect to ${\cal F}_B$.
Only slightly more complicated is the case where $\cal F$ is generated by
a partition, ${\cal P} = \left\{B_1, B_2, \ldots \right\}$, of $\Omega$.
The conditional expectation $Y(\omega) = E[X\mid {\cal F}]$
is defined to be
\begin{equation}
Y(\omega) = E[X\mid B_j] \; \mbox{ if $\omega \in B_j$ } \; ,
\label{ModCondProb} \end{equation}
where $E[X\mid B_j]$ is classical conditional expectation (\ref{Classic}).
A single set $B$ defines a partition: $B_1 = B$, $B_2 = B^c$, so this agrees
with the earlier definition in that case.
The information in $\cal F$ is only which of the $B_j$ occurred.
The modern conditional expectation replaces $X$ with its expected value over
the set taht occurred. This is the expected value of $X$ given the
information in $\cal F$.
\para Example of modern conditional expectation:
Take $\Omega$ to be sequences of 4 coin tosses. Take $\cal F$ to
be the algebra of Example 2 determined by the number of H tosses. Take
$X(\omega)$ to be the number of H tosses before the first T
(e.g. $X(\mbox{HHTH}) = 2$, $X(\mbox{TTTT}) = 0$, $X(\mbox{HHHH}) = 4$,
etc.). With the usual abuse of notation, we calculate (below):
$Y(\left\{0\right\}) = 0$, $Y(\left\{1\right\}) = 1/4$,
$Y(\left\{2\right\}) = 2/3$, $Y(\left\{3\right\}) = 3/2$,
$Y(\left\{4\right\}) = 4$. Note, for example, that because
HHTT and HTHT are equivalent in $\cal F$ (in the equivalence class
$\left\{2\right\}$), $Y(\mbox{HHTT}) = Y(\mbox{HTHT})=1/4$ even though
$X(\mbox{HHTT}) \neq X(\mbox{HTHT})$. The common value of $Y$ is
its average value of $X$ over the outcomes in the equivalence class.
$$
\begin{array}{cc}
\hline \\
\left\{0\right\} & \begin{array}{c}
\mbox{TTTT} \\
0 \end{array} \\
& \mbox{expected value } = 0 \\
& \\
\hline \\
\left\{1\right\} & \begin{array}{cccc}
\mbox{HTTT} & \mbox{THTT} & \mbox{TTHT} & \mbox{TTTH} \\
1 & 0 & 0 & 0 \end{array} \\
& \mbox{expected value } = (1 + 0 + 0 + 0) / 4 = 1/4 \\
& \\
\hline \\
\left\{2\right\} & \begin{array}{cccccc}
\mbox{HHTT} & \mbox{HTHT} & \mbox{HTTH} & \mbox{THHT} & \mbox{THTH} & \mbox{TTHH}\\
2 & 1 & 1 & 0 & 0 & 0 \end{array} \\
& \mbox{expected value } = (2 + 1 + 1 + 0 + 0 + 0) / 6 = 2/3 \\
& \\
\hline \\
\left\{3\right\} & \begin{array}{cccc}
\mbox{HHHT} & \mbox{HHTH} & \mbox{HTHH} & \mbox{THHH} \\
3 & 2 & 1 & 0 \end{array} \\
& \mbox{expected value } = (3 + 2 + 1 + 0) / 4 = 3/2 \\
& \\
\hline \\
\left\{4\right\} & \begin{array}{c}
\mbox{HHHH} \\
4 \end{array} \\
& \mbox{expected value } = 4 \\
& \\
\hline \\
\end{array}
$$
\para Best approximation property:
Suppose we have a random variable, $X(\omega)$, that is not measurable with
respect to the $\sigma-$algebra $\cal F$. That is, the information in
$\cal F$ does not completely determine the value of $X$. The conditional
expectation, $Y(\omega) = E[X\mid {\cal F}]$, among all functions measurable
with respect to $\cal F$, is the closest to $X$ in the least squares sense.
That is, if $Z \in \cal F$, then
$$
E\left[ ( Z - X )^2\right] \geq E\left[ ( Y - X )^2\right] \; .
$$
In fact, this best approximation property will be the definition of
conditional expectation in
situations where the partition definition is not directly applicable.
The best approximation property for modern conditional expectation
is a consequence of the best approximation for classical conditional
expectation. The least squares error is the sum of the least squares
errors over each $B_k$ in the partition defined by $\cal F$. We
minimize the least squares error in $B_k$ by choosing
$Y(B_k)$ to be the average of $X$ over $B_k$ (weighted by the
probabilities $P(\omega)$ for $\omega \in B_k$). By choosing the best
approximation in each $B_k$, we get the best approximation overall.
This can be expressed in the terminology of linear algebra. The set
of functions (random variables) $X$ is a vector space (Hilbert space)
with inner product
$$
\langle X,Y\rangle = \sum_{\omega\in \Omega} X(\omega)Y(\omega)P(\omega)
= E\left[XY\right] \; ,
$$
so $\left\|X - Y \right\|^2 = E\left[(X-Y)^2\right]$.
The set of functions measurable with respect to $\cal F$ is a subspace,
which we call $\cal S_F$. The conditional expectation, $Y$, is the
orthogonal projection of $X$ onto $\cal S_F$, which is the element of
$\cal S_F$ that closest to $X$ in the norm just given.
\para Tower property:
Suppose $\cal G$ is a $\sigma-$algebra that has less information
than $\cal F$.
That is, every event in $\cal G$ is also in $\cal F$, but
events in $\cal F$ need not be in $\cal G$.
This is expressed simply
(without abuse of notation) as ${\cal G} \subseteq \cal F$.
Consider the (modern) conditional expectations $Y = E[X\mid {\cal F}]$
and $Z = E[X\mid {\cal G}]$.
The {\em tower property} is the fact that $Z = E[Y\mid{\cal G}]$. That is,
conditioning in one step gives the same result as conditioning in
two steps. As we said before, the tower property underlies the backward
equations that are among the most useful tools of stochastic calculus.
The tower property is an application of the law of total probability to
conditional expectation. Suppose $\cal P$ and $\cal Q$ are the partitions
of $\Omega$ corresponding to $\cal F$ and $\cal G$ respectively.
The partition $\cal P$ is a {\em refinement} of $\cal Q$, which
means that each $C_k \in \cal Q$ itself is partitioned into events
$\left\{B_{k,1}, B_{k,2}, \ldots \right\}$, where the $B_{k,j}$ are
elements of $\cal P$. Then (see ``Working with conditional probability'')
for $\omega \in C_k$, we want to show that $Z(\omega) = E[Y\mid C_k]$:
\begin{eqnarray*}
Z(\omega) & = & E[X \mid C_k] \\
& = & \sum_j E[X \mid B_{jk}] P(B_{jk}\mid C_k ) \\
& = & \sum_j Y(B_{jk}) P(B_{jk}\mid C_k) \\
& = & E[Y\mid C_k] \; .
\end{eqnarray*}
The linear algebra projection interpretation makes the tower property
seem obvious.
Any function measurable with respect to $\cal G$ is also measurable
with respect to $\cal F$, which means that the subspace $\cal S_G$
is contained in $\cal S_F$. If you project $X$ onto
$\cal S_F$ then project the projection onto $\cal S_G$, you get
the same thing as projecting $X$ directly onto $\cal S_G$ (always
orthogonal projections).
\para Modern conditional probability:
Probabilities can be defined as expected values of characteristic functions
(see below).
Therefore, the modern definition of conditional expectation gives a
modern definition of conditional probability. For any event, $A$, the
{\em indicator function}, ${\bf 1}_A(\omega)$, (also written
$\chi_A(\omega)$, for ``characteristic function'', terminology
less used by probabilists because characteristic function means
something else to them) is defined by ${\bf 1}_A(\omega) = 1$
if $\omega \in A$, and ${\bf 1}_A(\omega) = 0$ if $\omega \notin A$.
The obvious formula $P(A) = E[{\bf 1}_A]$ is the representation of the
probability as an expected value. The modern conditional probability
then is $P(A\mid {\cal F}) = E[{\bf 1}_A \mid {\cal F}]$. Unraveling
the definitions, this is a function, $Y_A(\omega)$, that takes the value
$P(A\mid B_k)$ whenever $\omega \in B_k$.
A related statement, given for practice with notation, is
$$
P(A\mid {\cal F})(\omega) =
\sum_{B_k \in \cal P_{\cal F}} P(A\mid B_k) {\bf 1}_{B_k}(\omega) \; .
$$
\section{Markov Chains, I}
\para Introduction:
Discrete time Markov\footnote{The Russian mathematician A.\ A.\ Markov
was active in the last decades of the $19^{th}$ century.
He is known for his path breaking work on the distribution of prime
numbers as well as on probability.}
chains are a simple abstract class of discrete random processes.
Many practical models are Markov chains.
Here we discuss Markov chains having a finite {\em state space} (see below).
Many of the general concepts above come into play here. The probability
space $\Omega$ is the space of paths. The natural states of partial
information are described by the algebras ${\cal F}_t$, which represent
the information obtained by observing the chain up to time $t$.
The tower property applied to the ${\cal F}_t$ leads to backward and
forward equations. This section is mostly definitions. The good
stuff is in the next section.
\para Time:
The time variable, $t$, will be an integer representing the number of
time units from a starting time. The actual time to go from $t$ to $t+1$
could be a nanosecond (for modeling computer communication networks) or a
month (for modeling bond rating changes), or whatever. To be specific,
we usually start with $t=0$ and consider only non negative times.
\para State space:
At time $t$ the system will be in one of a finite list of
states. This set of states is the {\em state space}, $\cal S$. To be a
Markov chain, the state should be a complete description of the
actual state of the system at time $t$. This means that it should contain
any information about the system at time $t$ that helps predict the state
at future times $t+1$, $t+2$, ... .
This is illustrated with the hidden Markov model below.
The state at time $t$ will be called $X(t)$ or $X_t$. Eventually, there may
be an $\omega$ also, so that the state is a function of $t$ and $\omega$:
$X(t,\omega)$ or $X_t(\omega)$. The states may be called $s_1$, $\ldots$,
$s_m$ , or simply $1,2$, \ldots, $m$. depending on the context.
\para Path space:
The sequence of states $X_0$, $X_1$, $\ldots$, $X_T$, is a {\em path}.
The set of paths is {\em path space}.
It is possible and often convenient to use the set of paths as the probability
space, $\Omega$.
When we do this, the path
$X = (X_0, X_1, \ldots,X_T) = (X(0), X(1), \ldots, X(T))$ plays the role
that was played by the outcome $\omega$ in the general theory above.
We will soon have a formula for the $P(X)$, probability of path $X$,
in terms of {\em transition probabilities}.
In principle, it should be possible to calculate the
probability of any event (such as $\left\{X(2) \neq s\right\}$, or
$\left\{\mbox{$X(t)= s_1$ for some $t\leq T$}\right\}$) by listing
all the paths (outcomes) in that event and summing their probabilities.
This is rarely the easiest way.
For one thing, the path space, while finite, tends to be enormous.
For example, if there are $m = \left|{\cal S}\right| = 7$ states and $T=50$
times, then the number of paths is $\left\|\Omega\right\| = m^T = 7^{50}$,
which is about $1.8\times 10^{42}$. This number is beyond computers.
\para Algebras ${\cal F}_t$ and ${\cal G}_t$:
The information learned by observing a Markov chain up to and including
time $t$ is ${\cal F}_t$.
Paths $X_1$ and $X_2$ are equivalent in ${\cal F}_t$ if $X_1(s) = X_2(s)$
for $0 \leq s \leq t$.
Said only slightly differently, the equivalence class of path $X$ is the
set of paths $X^{\prime}$ with $X^{\prime}(s) = X(s)$ for $0 \leq s \leq t$.
The ${\cal F}_t$ form an increasing family of algebras:
${\cal F}_t \subseteq {\cal F}_{t+1}$. (Event $A$ is in ${\cal F}_t$
if we can tell whether $A$ occurred by knowing $X(s)$ for $0\leq s\leq t$.
In this case, we also can tell whether $A$ occurred by knowing $X(s)$ for
$0 \leq s \leq t+1$, which is what it means for $A$ to be in
${\cal F}_{t+1}$.)
The algebra ${\cal G}_t$ is generated by $X(t)$ only. It encodes the
information learned by observing $X$ at time $t$ only, not at earlier
times. Clearly ${\cal G}_t \subseteq {\cal F}_t$, but
${\cal G}_t$ is not contained in ${\cal G}_{t+1}$, because $X(t+1)$
does not determine $X(t)$.
\para Nonanticipating (adapted) functions:
The underlying outcome, which was called $\omega$, is now called $X$.
A function of a the outcome, or function of a random variable, will
now be called $F(X)$ instead of $X(\omega)$.
Over and over in stochastic processes, we deal with functions that
depend on both $X$ and $t$.
Such a function will be called $F(X,t)$.
The simplest such function is $F(X,t) = X(t)$.
More complicated functions are: ({\em i}) $F(X,t) = 1$ if $X(s) = 1$
for some $s \leq t$, $F(X,t) = 0$ otherwise, and ({\em ii})
$F(X,t) = \min(s > t)$ with $X(s) = 1$ or $F(X,t) = T$ if
$X(s) \neq 1$ for $t < s \leq T$.
A function $F(X,t)$ is {\em nonanticipating} (also called {\em adapted},
though the notions are slightly different in more sophisticated situations)
if, for each $t$, the function of $X$ given by $F(X,t)$ is measurable
with respect to ${\cal F}_t$. This is the same as saying that $F(X,t)$
is determined by the values $X(s)$ for $s \leq t$. The function ({\em i})
above has this property but ({\em ii}) does not.
Nonanticipating functions are important for several reasons. In time, we
will see that the Ito integral makes sense only for nonanticipating
functions. Moreover, functions $F(X,t)$ are a model of decision making
under uncertainty. That $F$ is nonanticipating means that the decision
at time $t$ is made based on information available at time $t$ and does
not depend on future information.
\para Markov property:
Informally, the {\em Markov property} is that $X(t)$ is all the information
about the past that is helpful in predicting the future. In classical
terms, for example,
$$
P(X(t+1) = k| X(t) = j ) = P(X(t+1) = k | X(t) = j, X(t-1) = l,
\mbox{etc.}) \; .
$$
In modern notation, this may be stated
\begin{equation}
P(X(t+1) = k \mid {\cal F}_t) = P(X(t+1) = k \mid {\cal G}_t ) \; .
\label{MarkProp} \end{equation}
Recall that both sides are functions of the outcome, $X$.
The function on the right
side, to be measurable with respect to ${\cal G}_t$ must be a function
of $X(t)$ only (see ``Generating by a function'' in the previous section).
The left side also is a function, but in general could depend on all
the values $X(s)$ for $s\leq t$. The equality (\ref{MarkProp})
states that this function depends on $X(t)$ only.
This may be interpreted as the absence of hidden variables, variables
that influence the evolution of the Markov chain but are not observable
or included in the state description.
If there
were hidden variables, observing the chain for a long period might
help identify them and therefore change our prediction of the future
state. The Markov property (\ref{MarkProp}) states, on the contrary,
that observing $X(s)$ for $s