\documentclass{article}
\usepackage{ifthen}
\begin{document}
\newcounter{OldSection}
\newcounter{ParCount}
\newcommand{\para}{
\vspace{.4cm}
\ifthenelse { \value{OldSection} < \value{section} }
{ \setcounter{OldSection}{ \value{section} }
\setcounter{ParCount}{ 0 } }
{}
\stepcounter{ParCount}
\noindent
\bf \arabic{section}.\arabic{ParCount}. \rm \hspace{.2cm}
}
\Large \begin{center}
Stochastic Calculus Notes, Lecture 4 \\
\normalsize
Last modified \today
\end{center} \normalsize
\section{Continuous probability}
\para Introduction:
Recall that a set $\Omega$ is {\em discrete} if it is finite or countable.
We will call a set {\em continuous} if it is not discrete.
Many of the probability spaces used in stochastic calculus are continuous
in this sense (examples below).
Kolmogorov\footnote{The Russian mathematician Kolmogorov was active in
the middle of the $20^{th}$ century.
Among his many lasting contributions to mathematics are the modern axioms of
probability and some of its most important theorems.
His theories of turbulent fluid flow anticipated modern fractals be
several decades.}
suggested a general framework for continuous probability based on abstract
integration with respect to abstract probability measures.
The theory makes it possible to discuss general constructions such as
conditional expectation in a way that applies to a remarkably diverse set of
examples.
The difference between continuous and discrete probability is the difference
between integration and summation. Continuous probability cannot be based on
the formula
\begin{equation}
P(A) = \sum_{\omega \in A} P(\omega)\; .
\label{probSum} \end{equation}
Indeed, the typical situation in continuous probability is that any event
consisting of a single outcome has probability zero:
$P(\left\{\omega\right\}) = 0$ for all $\omega \in \Omega$.
As we explain below, the classical formalism of probability densities
also does not apply in many of the situations we are interested in.
Abstract probability measures give a framework for working with
probability in path space, as well as more traditional discrete probability
and probabilities given by densities on $R^n$.
These notes outline the Kolmogorov's formalism of probability measures
for continuous probability.
We leave out a great number of details and mathematical proofs.
Attention to all these details would be impossible within our time
constraints.
In some cases we indicate where a precise definition or a complete proof
is missing, but sometimes we just leave it out.
If it seems like something is missing, it probably is.
\para Examples of continuous probability spaces:
Be definition, a {\em probability space} is a set, $\Omega$, of possible
outcomes, together with a $\sigma-$algebra, $\cal F$, of measurable events.
This section discusses only the sets $\Omega$.
The corresponding algebras are discussed below.
\begin{description}
\item $R$, the real numbers. If $x_0$ is a real number and $u(x)$ is a
probability density, then the probability of the event
$B_r(x_0) = \left\{x_0 - r \leq X \leq x_0 + r\right\}$
is
$$
P([x_0 - r,x_0+r]) =
\int_{x_0 - r}^{x_0 + r} u(x) dx
\rightarrow 0 \;\; \mbox{as $r \rightarrow 0$.}
$$
Thus the probability of any individual outcome is zero.
An event with positive probability ($P(A)>0$) is made up entirely of
outcomes $x_0 \in A$, with $P(x_0)=0$.
Because of countable additivity (see below), this is only possible
when $\Omega$ is uncountable.
\item $R^n$, sequences of $n$ numbers (possibly viewed as a row or
column vector depending on the context): $X=(X_1\ldots,X_n)$.
Here too if there is a probability density then the probability
of any given outcome is zero.
\item ${\cal S}^{\cal N}$. Let $\cal S$ be the discrete state space of
a Markov chain.
The space ${\cal S}^T$ is the set of sequences of length $T$ of elements
of $\cal S$.
An element of ${\cal S}^T$ may be written $x = (x(0), x(1), \cdots, x(T-1))$,
with each of the $x(t)$ in $\cal S$.
It is common to write $x_t$ for $x(t)$.
An element of ${\cal S}^{\cal N}$ is an infinite sequence of elements of
$\cal S$.
The ``exponent'' $\cal N$ stands for ``natural numbers''.
We misuse this notation because ours start with $t=0$ while the
actual natural numbers start with $t=1$.
We use ${\cal S}^{\cal N}$ when we ask questions about an entire infinite
trajectory.
For example the hitting probability is
$P(X(t) \neq 1\mbox{ for all } t \geq 0)$.
Cantor proved that ${\cal S}^{\cal N}$ is not countable whenever
the state space has more than one element.
Generally, the probability of any particular infinite sequence is zero.
For example, suppose the transition matrix has $P_{11} = .6$ and
$u_0(1) = 1$. Let $x$ be the infinite sequence that never leaves
state 1: $x = (1,1,1,\cdots)$.
Then $P(x) = u_0(1) \cdot .6 \cdot .6 \cdots$.
Multiplying together an infinite number of $.6$ factors should give
the answer $P(x) = 0$.
More generally, if the transition matrix has $P_{jk} \leq r < 1$ for all
$(j,k)$, then $P(x) = 0$ for any single infinite path.
\item $C([0,T]\rightarrow R)$, the path space for Brownian motion. The
$C$ stands for ``continuous''. The $[0,T]$ is the time interval
$0 \leq t \leq T$; the square brackets tell us to include the endpoints
(0 and $T$ in this case). Round parentheses $(0,T)$ would mean to
leave out 0 and $T$.
The final $R$ is the ``target'' space, the real numbers in this case.
An element of $\Omega$ is a continuous function from the interval
$[0,T]$ to $R$.
This function could be called $X(t)$ or $X_t$ (for $0 \leq t \leq T$).
In this space we can ask questions such as
$P(\int_0^T X(t) dt > 4)$.
\end{description}
\para Probability measures:
Let $\cal F$ be a $\sigma-$algebra of subsets of $\Omega$.
A {\em probability measure} is a way to assign a probability to each
event $A \in \cal F$.
In discrete probability, this is done using (\ref{probSum}).
In $R^n$ a probability density leads to a probability measure by integration
\begin{equation}
P(A) = \int_A u(x) dx \; .
\label{probInt} \end{equation}
There are still other ways to specify probabilities of events in path space.
All of these probability measures satisfy the same basic axioms.
Suppose that for each $A\in \cal F$ we have a number $P(A)$.
The numbers $P(A)$ are a {\em probability measure} if
\begin{description}
\item[i.] If $A\in \cal F$ and $B \in \cal F$ are disjoint events, then
$P(A \cup B) = P(A) + P(B)$.
\item[ii.] $P(A)\geq 0$ for any event $A\in \cal F$.
\item[iii.] $P(\Omega) = 1$.
\item[iv.] If $A_n\in \cal F$ is a sequence of events each disjoint from all
the others and $\cup_{n=1}^{\infty}A_n = A$, then
$\sum_{n=1}^{\infty}P(A_n) = P(A)$.
\end{description}
The last property is called {\em countable additivity}.
It is possible to consider probability measures that are not countably
additive, but is not bery useful.
\para Example 1, discrete probability:
If $\Omega$ is discrete, we may take $\cal F$ to be the set of all
events (i.e.\ all subsets of $\Omega$).
If we know the probabilities of each individual outcome, then the formula
(\ref{probSum}) defines a probability measure.
The axioms (i), (ii), and (iii) are clear.
The last, countable additivity, can be verified given a solid
undergraduate analysis course.
\para Borel sets:
It is rare that one can define $P(A)$ for all $A\subseteq \Omega$.
Usually, there are {\em non measurable} events whose probability one does
not try to define (see below).
This is not related to partial information, but is an intrinsic
aspect of continuous probability.
Events that are not measurable are quite artificial, but they are
impossible to get rid of.
In most applications in stochastic calculus, it is convenient to take
the largest $\sigma-$algebra to be the
{\em Borel sets}\footnote{The larger $\sigma-$algebra of {\em Lebesgue
sets} seems to more of a nuisance than a help, particularly in discussing
convergence of probability measures in path space.}
In a previous lecture we discussed how to generate a $\sigma-$algebra
from a collection of sets.
The Borel algebra is the $\sigma-$algebra that is generated by all {\em balls}.
The {\em open ball} with center $x_0$ and radius $r>0$ in $n$ dimensional
space is $B_r(x_0) = \{x \mid |x-x_0| 0$.
In the former case we would have
$P(\Omega) = \sum_n P(B+\theta_n) = \sum_n 0 = 0$, which is not what we want.
In the latter case, again using countable additivity, we would get
$P(\Omega) = \infty$.
The construction of the set $B$ starts with a description of the $\theta_n$.
Write $n$ in base ten, flip over the decimal point to get a number between
0 and 1, then multiply by $2\pi$.
For example for $n=130$, we get $\theta_n = \theta_{130} = 2\pi\cdot .031$.
Now use the $\theta_n$ to create an equivalence relation and partition of
$\Omega$ by setting $x \sim y$ if $x = y + \theta_n$ (mod $2\pi$) for some $n$.
The reader should check that this is an equivalence relation
($x \sim y \rightarrow y \sim x$,
and $x \sim y$ and $y \sim z \rightarrow x \sim z$).
Now, let $B$ be a set that has exactly one representative from each
of the equivalence classes in the partition.
Any $x \in \Omega$ is in one of the equivalence classes, which means that
there is a $y \in B$ (the representative of the $x$ equivalence class)
and an $n$ so that $y + \theta_n = x$.
That means that any $x \in \Omega$ has $x \in B+\theta_n$ for some $n$, which
is to say that $\bigcup_n B+\theta_n = \Omega$.
To see that $B+\theta_k$ is disjoint from $B+\theta_n$ when $k \neq n$,
suppose that $x \in B+\theta_k$ and $x \in \theta_n$.
Then $x = y + \theta_k$ and $x = z + \theta_n$ for $y\in B$ and $z \in B$.
But (and this is the punch line) this would mean $y \sim z$, which is
impossible because $B$ has only one representative from each equivalence class.
The possibility of selecting a single element from each partition element
without having to say how it is to be done is the {\em axiom of choice}.
\para Probability densities in $R^n$:
Suppose $u(x)$ is a probability density in $R^n$.
Ifg $A$ is an event made from finitely many balls (or rectangles) by
set operations, we can define $P(A)$ by integrating, as in (\ref{probInt}).
This leads to a probability measure on Borel sets corresponding to the
density $u$.
Deriving the probability measure from a probability density does not
seem to work in path space because there is nothing like the
Riemann integral to use
in\footnote{The {\em Feynman integral} in path space
has some properties of true integrals but lacks others.
The probabilist Mark Kac (pronounced ``cats'') discovered that
Feynman's ideas applied to the heat equation rather than the Schr\"odinger
equation can be interpreted as integration with respect to Wiener measure.
This is now called the {\em Feynman Kac formula}.} (\ref{probInt})
Therefore, we describe path space probability measures directly
rather than through probability densities.
\para Measurable functions:
Let $\Omega$ be a probability space with a $\sigma-$algebra $\cal F$.
Let $f(\omega)$ be a function defined on $\Omega$.
In discrete probability, $f$ was measurable with respect to $\cal F$ if the
sets $B_a = \left\{\omega \mid f(\omega = a)\right\}$ all were measurable.
In continuous probability, this definition is replaced by the condition
that the sets $A_{ab} = \left\{\omega \mid a \leq f(\omega) \leq b \right\}$
are measurable.
Because $\cal F$ is countably additive, and because the event $a < f$
is the (countable) union of the events $a+\frac{1}{n} \leq f$, this is
the same as requiring all the sets
$\widetilde{A}_{ab} = \left\{\omega \mid a < f(\omega) < b \right\}$
to be measurable.
If $\Omega$ is discrete (finite or countable), then the two definitions
of measurable function agree.
In continuous probability, the notion of measurability of a function with
respect to a $\sigma-$algebra plays two roles.
The first, which is purely technical, is that $f$ is sufficiently
``regular'' (meaning not crazy) that abstract integrals (defined below)
make sense for it.
The second, particularly for smaller algebras ${\cal G} \subset {\cal F}$,
again involves incomplete information.
A function that is measurable with respect to $\cal G$ not only needs to
be regular, but also must depend on fewer variables (possibly in some
abstract sense).
\para Integration with respect to a measure:
The definition of integration with respect to a general probability measure
is easier than the definition of the Riemann integral.
The integral is written
$$
E[f] = \int_{\omega \in \Omega} f(\omega) dP(\omega) \; .
$$
We will see that in $R^n$ with a density $u$, this agrees with the
classical definition
$$
E[f] = \int_{R^n} f(x) u(x) dx \; ,
$$
if we write $dP(x) = u(x) dx$.
Note that the abstract variable $\omega$ is replaced by the concrete
variable, $x$, in this more concrete situation. The general definition
is forced on us once we make the natural requirements
\begin{description}
\item[i.] If $A\in \cal F$ is any event, then $E[1_A] = P(A)$. The integral
of the indicator function if an event is the probability of that event.
\item[ii.] If $f_1$ and $f_2$ have $f_1(\omega)\leq f_2(\omega)$ for all
$\omega \in \Omega$, then $E[f_1] \leq E[f_2]$. ``Integration is monotone''.
\item[iii.] For any reasonable functions $f_1$ and $f_2$ (e.g.\ bounded),
we have $E[af_1+bf_2] = aE[f_1]+bE[f_2]$. ({\em Linearity} of integration).
\item[iv.] If $f_n(\omega)$ is an increasing family of positive functions
converging {\em pointwise} to $f$ ($f_n(\omega) \geq 0$ and
$f_{n+1}(\omega) \geq f_n(\omega)$ for all $n$,
and $f_n(\omega \rightarrow f(\omega)$ as $n \rightarrow \infty$
for all $\omega$),
then $E[f_n] \rightarrow E[f]$ as $n \rightarrow \infty$.
(This form of countable additivity for abstract probability integrals
is called the {\em monotone convergence theorem}.)
\end{description}
A function is a {\em simple function} if there are finitely many events
$A_k$, and weights $w_k$, so that $f = \sum_k w_k 1_{A_k}$.
Properties (i) and (iii) imply that the expectation of a simple function is
$$
E[f] = \sum_k w_k P(A_k) \; .
$$
We can approximate general functions by simple functions to determine
their expectations.
Suppose $f$ is a nonnegative bounded function: $0 \leq f(\omega) \leq M$
for all $\omega \in \Omega$.
Choose a small number $\epsilon = 2^{-n}$ and define
the\footnote{Take $f = f(x,y) = x^2 + y^2$ in the plane to see why
we call them ring sets.}
``ring sets'' $A_k = \{(k-1)\epsilon \leq f < k\epsilon$.
The $A_k$ depend on $\epsilon$ but we do not indicate that.
Although the events $A_k$ might be complicated, fractal, or whatever,
each of them is measurable.
A simple function that approximates $f$ is
$f_n(\omega) = \sum_k (k-1)\epsilon 1_{A_k}$.
This $f_n$ takes the value $(k-1)\epsilon$ on the sets $A_k$.
The sum defining $f_n$ is finite because $f$ is bounded, though the number
of terms is $M/\epsilon$.
Also, $f_n(\omega) \leq f(\omega)$ for each $\omega \in \Omega$
(though by at most $\epsilon$).
Property (ii) implies that
$$
E[f] \geq E[f_n] = \sum_k (k-1)\epsilon P(A_k) \; .
$$
In the same way, we can consider the upper function
$g_n = \sum_k k\epsilon 1_{A_k}$ and have
$$
E[f] \leq E[g_n] = \sum_k k\epsilon P(A_k) \; .
$$
The reader can check that $f_n \leq f_{n+1} \leq f \leq g_{n+1} \leq g_n$
and that $g_n - f_n \leq \epsilon$.
Therefore, the numbers $E[f_n]$ form an increasing sequence while the
$E[g_n]$ are a decreasing sequence converging to the same number, which
is the only possible value of $E[f]$ consistent with (i), (ii), and (iii).
It is sometimes said that the difference between classical (Riemann)
integration and abstract integration (here) is that the Riemann integral
cuts the $x$ axis into little pieces, while the abstarct integral cuts
the $y$ axis (which is what the simple function approximations amount to).
If the function $f$ is positive but not bounded, it might happen that
$E[f] = \infty$. The ``cut off'' functions,
$f_M(\omega) = \min(f(\omega),M)$, might have $E[f_M] \rightarrow \infty$
as $M \rightarrow \infty$.
If so, we say $E[f] = \infty$.
Otherwise, property (iv) implies that
$E[f] = \lim _{M \rightarrow \infty} E[f_M]$.
If $f$ is both positive and negative (for
different $\omega$), we integrate the positive part,
$f_+(\omega) = \max(f(\omega),0)$, and the negative part
$f_-(\omega) = \min(f(\omega),0$ separately and subtract the results.
We do not attempt a definition if $E[f_+]=\infty$ and $E[f_-]=-\infty$.
We omit the long process of showing that these definitions lead to an
integral that actually has the properties (i) - (iv).
\para Markov chain probability measures on ${\cal S}^{\cal N}$:
Let $\cal A = \cup_{t \geq 0} {\cal F}_t$ as before.
The probability of any $A \in \cal A$ is given by the probability
of that event in ${\cal F}_t$ if $A\in {\cal F}_t$.
Therefore $P(A)$ is given by a formula like (\ref{probSum}) for
any $A \in \cal A$.
A theorem of Kolmogorov states that the {\em completion} of this measure
to all of $\cal F$ makes sense and is countably additive.
\para Conditional expectation:
We have a random variable $X(\omega)$ that is measurable with respect
to the $\sigma-$algebra, $\cal F$.
We have $\sigma-$algebra that is a sub algebra: ${\cal G} \subset \cal F$.
We want to define the conditional expectation $Y = E[X\mid {\cal G}]$.
In discrete probability this is done using the partition defined by $\cal G$.
The partition is less useful because it probably is uncountable, and because
each partition element, $B(\omega) = \cap A$ (the intersection being over
all $A\in \cal G$ with $\omega \in A$), may have $P(B(\omega)) = 0$
(examples below).
This means that we cannot apply Bayes' rule directly.
The definition is that $Y(\omega)$ is the random
variable measurable with respect to $\cal G$ that best approximates $X$
in the least squares sense
$$
E[(Y-X)^2] = \min_{Z \in \cal G} E[(Z-X)^2] \; .
$$
This is one of the definitions we gave before, the one that works for
continuous and discrete probability. In the theory, it is possible to
show that there is a minimizer and that it is unique.
\para Generating a $\sigma-$algebra:
When the probability space, $\Omega$, is finite, we can understand an algebra
of sets by using the partition of $\Omega$ that generates the algebra. This
is not possible for continuous probability spaces. Another way to specify
an algebra for finite $\Omega$ was to give a function $X(\omega$, or a
collection of functions $X_k(\omega)$ that are supposed to be measurable
with respect to $\cal F$. We noted that any function measurable with
respect to the algebra generated by functions $X_k$ is actually a function
of the $X_k$. That is, if $F\in \cal F$ (abuse of notation), then
there is some function $u(x_1,\ldots,x_n)$ so that
\begin{equation}
F(\omega) = u(X_1(\omega), \ldots,X_n(\omega)) \:.
\label{F} \end{equation}
The intuition was that $\cal F$ contains the information you get by knowing
the values of the functions $X_k$. Any function measurable with respect
to this algebra is determined by knowing the values of these functions,
which is precisely what (\ref{F}) says. This approach using functions is
often convenient in continuous probability.
If $\Omega$ is a continuous probability space, we may again specify
functions $X_k$ that we want to be measurable. Again, these functions
generate an algebra, a $\sigma-$algebra, $\cal F$. If $F$ is measurable
with respect to this algebra then there is a (Borel measurable) function
$u(x_1,\ldots)$ so that $F(\omega) = u(X_1, \ldots)$, as before. In fact,
it is possible to define $\cal F$ in this way. Saying that $A \in \cal F$
is the same as saying that ${\bf 1}_A$ is measurable with respect to $\cal F$.
If $u(x_1,\ldots)$ is a Borel measurable function that takes values only
$0$ or $1$, then the function $F$ defined by (\ref{F}) defines a function
that also takes only $0$ or $1$. The event $A = \{\omega \mid F(\omega)=1$
has (obviously) $F={\bf 1}_A$. The $\sigma-$algebra generated by the
$X_k$ is the set of events that may be defined in this way. A complete
proof of this would take a few pages.
\para Example in two dimensions: Suppose $\Omega$ is the unit square
in two dimensions: $(x,y) \in \Omega$ if $0 \leq x \leq 1$ and
$0 \leq y \leq 1$. The ``$x$ coordinate function'' is
$X(x,y) = x$. The information in this is the value of the $x$ coordinate,
but not the $y$ coordinate. An event measurable with respect to this
$\cal F$ will be any event determined by the $x$ coordinate alone.
I call such sets ``bar code'' sets. You can see why by drawing some.
\para Marginal density and total probability:
The abstract situation is that we have a probability space, $\Omega$
with generic outcome $\omega \in \Omega$. We have some functions
$(X_1(\omega),\ldots,X_n(\omega)) = X(\omega)$. With $\Omega$ in the
background, we can ask for the joint PDF of $(X_1,\ldots,X_n)$, written
$u(x_1,\ldots,x_n)$. A formal definition of $u$ would be that if
$A\subseteq R^n$, then
\begin{equation}
P(X(\omega) \in A) = \int_{x\in A} u(x)dx \; .
\label{P(A)} \end{equation}
Suppose we neglect the last variable, $X_n$, and consider the reduced
vector $\tilde{X}(\omega) = (X_1,\ldots,X_{n-1})$ with probability
density $\tilde{u}(x_1,\ldots,x_{n-1})$. This $\tilde{u}$ is the
``marginal density'' and is given by integrating $u$ over the forgotten
variable:
\begin{equation}
\tilde{u}(x_1,\ldots,x_{n_1})=\int_{-\infty}^{\infty}u(x_1,\ldots,x_n)dx_n \; .
\label{marg} \end{equation}
This is a continuous probability analogue of the law of total probability:
integrate (or sum) over a complete set of possibilities, all values of
$x_n$ in this case.
We can prove (\ref{marg}) from (\ref{P(A)}) by considering a set
$B\subseteq R^{n-1}$ and the corresponding set $A\subseteq R^n$
given by $A=B \times R$ (i.e. $A$ is the set of all pairs
$\tilde{x},x_n)$ with $\tilde{x} = (x_1,\ldots,x_{n-1})\in B$).
The definition of $A$ from $B$ is designed so that
$P(X\in A) = P(\tilde{X} \in B)$. With this notation,
\begin{eqnarray*}
P(\tilde{X} \in B) & = & P(X\in A) \\
& = & \int_A u(x)dx \\
& = & \int_{\tilde{x}\in B} \int_{x_n=-\infty}^{\infty}
u(\tilde{x},x_n) dx_n d\tilde{x} \\
P(\tilde{X} \in B) & = & \int_B \tilde{u}(\tilde{x})d\tilde{x} \; .
\end{eqnarray*}
This is exactly what it means for $\tilde{u}$ to be the PDF for $\tilde{X}$.
\para Classical conditional expectation:
Again in the abstract setting $\omega \in \Omega$, suppose we have
random variables $(X_1(\omega), \ldots,X_n(\omega))$. Now consider a function
$f(x_1,\ldots,x_n)$, its expectated value $E[f(X)]$, and the conditional
expectations
$$
v(x_n) = E[f(X) \mid X_n = x_n ] \; .
$$
The Bayes' rule definition of $v(x_n)$ has some trouble because both
the denominator, $P(X_n = x_n)$, and the numerator,
$$
E[f(X) \cdot {\bf 1}_{X_n = x_n}] \; ,
$$
are zero.
The classical solution to this problem is to replace the exact condition
$X_n = x_n$ with an approximate condition having positive (though small)
probability: $x_n \leq X_n \leq x_n + \epsilon$. We use the approximaion
$$
\int_{x_n}^{x_n + \epsilon} g(\tilde{x},\xi_n) d\xi_n
\approx \epsilon g(\tilde{x},x_n) \; .
$$
The error is roughly proportional to $\epsilon^2$ and much smaller than
either the terms above. With this approximation the numerator in
Bayes' rule is
\begin{eqnarray*}
E[f(X)\cdot {\bf 1}_{x_n \leq X_n \leq x_n + \epsilon}]
& = &
\int_{\tilde{x}\in R^{n-1}}\int_{\xi_n = x_n}^{\xi_n = x_n + \epsilon}
f(\tilde{x},\xi_n) u(\tilde{x},x_n) d\xi_n d\tilde{x} \\
& \approx &
\epsilon \int_{\tilde{x}} f(\tilde{x},x_n)u(\tilde{x},x_n) d\tilde{x} \; .
\end{eqnarray*}
Similarly, the denominator is
$$
P(x_n \leq X_n \leq x_n + \epsilon)
\approx \epsilon \int_{\tilde{x}} u(\tilde{x},x_n) d\tilde{x} \; .
$$
If we take the Bayes' rule quotient and let $\epsilon \rightarrow 0$,
we get the classical formula
\begin{equation}
E[f(X) \mid X_n = x_n ] =
\frac{\int_{\tilde{x}} f(\tilde{x},x_n)u(\tilde{x},x_n) d\tilde{x}}
{\int_{\tilde{x}} u(\tilde{x},x_n) d\tilde{x}} \;\; .
\label{BR} \end{equation}
By taking $f$ to be the characteristic function of an event (all possible
events) we get a formula for the probability density of $\tilde{X}$ given
that $X_n = x_n$, namely
\begin{equation}
\tilde{u}(\tilde{x} \mid X_n = x_n ) =
\frac{ u(\tilde{x},x_n)}
{\int_{\tilde{x}} u(\tilde{x},x_n) d\tilde{x}} \;\; .
\label{condu} \end{equation}
This is the classical formula for conditional probability density. The
integral in the denominator insures that, for each $x_n$, $\tilde{u}$
is a probability density as a function of $\tilde{x}$, that is
$$
\int \tilde{u}(\tilde{x} \mid X_n = x_n) d\tilde{x} = 1 \; ,
$$
for any value of $x_n$. It is very useful to notice that as a function
of $\tilde{x}$, $u$ and $\tilde{u}$ almost the same. They differ only
by a constant normalization. For example, this is why conditioning
Gaussian's gives Gaussians.
\para Modern conditional expectation:
The classical conditional expectation (\ref{BR}) and conditional probability
(\ref{condu}) formulas are the same as what comes from the ``modern''
definition from paragraph 1.6. Suppose $X = (X_1,\ldots,X_n)$ has density
$u(x)$, $\cal F$ is the $\sigma-$algebra of Borel sets, and
$\cal G$ is the $\sigma-$algebra generated by $X_n$ (which might be
written $X_n(X)$, thinking of $X$ as $\omega$ in the abstract notation).
For any $f(x)$, we have $\tilde{f}(x_n) = E[f\mid {\cal G}]$. Since
$\cal G$ is generated by $X_n$, the function $\tilde{f}$ being measurable
with respect to $\cal G$ is the same as it's being a function of $x_n$.
The modern definition of $\tilde{f}(x_n)$ is that it minimizes
\begin{equation}
\int_{R^n} \left(f(x) - \tilde{f}(x_n) \right)^2 u(x) dx \; ,
\label{minCond} \end{equation}
over all functions that depend only on $x_n$ (measurable in $\cal G$).
To see the formula (\ref{BR}) emerge, again write $x = (\tilde{x},x_n)$,
so that $f(x) = f(\tilde{x},x_n)$, and $u(x) = u(\tilde{x},x_n)$.
The integral (\ref{minCond}) is then
$$
\int_{x_n = -\infty}^{\infty} \int_{\tilde{x} \in R^{n-1}}
\left(f(\tilde{x},x_n) - \tilde{f}(x_n) \right)^2
u(\tilde{x},x_n) d\tilde{x} dx_n \; .
$$
In the inner integral:
$$
R(x_n) = \int_{\tilde{x} \in R^{n-1}}
\left(f(\tilde{x},x_n) - \tilde{f}(x_n) \right)^2
u(\tilde{x},x_n) d\tilde{x} \; ,
$$
$\tilde{f}(x_n)$ is just a constant. We find the value of $\tilde{f}(x_n)$
that minimizes $R(x_n)$ by minimizing the quantity
\begin{eqnarray*}
\lefteqn{ \int_{\tilde{x} \in R^{n-1}} \left(f(\tilde{x},x_n) - g \right)^2
u(\tilde{x},x_n) d\tilde{x} = } \\
& &\int f(\tilde{x})^2 u(\tilde{x},x_n) d\tilde{x}
+ 2 g \int f(\tilde{x}) u(\tilde{x},x_n) d\tilde{x}
+ g^2 \int u(\tilde{x},x_n) d\tilde{x} \; .
\end{eqnarray*}
The optimal $g$ is given by the classical formula (\ref{BR}).
\para Modern conditional probability:
We already saw that the modern approach to conditional probability
for ${\cal G} \subset \cal F$ is
through conditional expectation. In its most general form, for every
(or almost every)
$\omega\in \Omega$, there should be a probability measure $P_{\omega}$
on $\Omega$ so that the mapping $\omega \rightarrow P_{\omega}$ is
measureable with respect to $\cal G$. The measurability condition
probably means that for every event $A\in \cal F$ the function
$p_A(\omega) = P_{\omega}(A)$ is a $\cal G$ measurable function of $\omega$.
In terms
of these measures, the conditional expectation
$\tilde{f} = E[f\mid {\cal G}]$ would be
$\tilde{f}(\omega) = E_{\omega}[f]$. Here $E_{\omega}$ means the expected
value using the probability measure $P_{\omega}$. There are many
such subscripted expectations coming.
A subtle point here is that the conditional probability measures are defined
on the original probability space, $\Omega$. This forces the measures to
``live'' on tiny (generally measure zero) subsets of $\Omega$. For example,
if $\Omega = R^n$ and $\cal G$ is generated by $x_n$, then the conditional
expectation value $\tilde{f}(x_n)$ is an average of $f$ (using density
$u$) only over the hyperplane $X_n = x_n$. Thus, the conditional probability
measures $P_X$ depend only on $x_n$, leading us to write $P_{x_n}$.
Since $\tilde{f}(x_n) = \int f(x) dP_{x_n}(x)$, and $\tilde{f}(x_n)$
depends only on values of $f(\tilde{x},x_n)$ with the last coordinate
fixed, the measure $dP_{x_n}$ is some kind of $\delta$ measure on that
hyperplane. This point of view is useful in many advanced problems, but
we will not need it in this course (I sincerely hope).
\para Semimodern conditional probability:
Here is an intermediate ``semimodern'' version of conditional probability
density. We have $\Omega = R^n$, and $\tilde{\Omega} = R^{n-1}$ with
elements $\tilde{x} = (x_1,\ldots,x_{n-1})$. For each $x_n$, there will
be a (conditional) probability density function $\tilde{u}_{x_n}$.
Saying that $\tilde{u}$ depends only on $x_n$ is the same as saying
that the function $x \rightarrow \tilde{u}_{x_n}$ is measurable with respect
to $\cal G$. The conditional expectation formula (\ref{BR}) may be
written
$$
E[f\mid {\cal G}](x_n) = \int_{R^{n-1}}
f(\tilde{x},x_n) \tilde{u}_{x_n}(\tilde{x}) d\tilde{x} \; .
$$
In other words, the classical $u(\tilde{x}\mid X_n = x_n)$ of
(\ref{condu}) is the same as the semimodern $\tilde{u}_{x_n}(\tilde{x})$.
\section{Gaussian Random Variables}
The central limit theorem (CLT) makes Gaussian random variables important.
A generalization of the CLT is Donsker's ``invariance principle'' that
gives Brownian motion as a limit of random walk. In many ways Brownian
motion is a multivariate Gaussian random variable. We review multivariate
normal random variables and the corresponding linear algebra as a prelude
to Brownian motion.
\para Gaussian random variables, scalar:
The one dimensional ``standard normal'', or Gaussian, random variable
is a scalar with probability density
$$
u(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \; .
$$
The normalization factor $\frac{1}{\sqrt{2\pi}}$ makes
$\int_{-\infty}^{\infty}u(x)dx = 1$ (a famous fact).
The mean value is $E[X] = 0$ (the integrand $xe^{-x^2/2}$ is antisymmetric
about $x=0$). The variance is (using integration by parts)
\begin{eqnarray*}
E[X^2] & = & \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} x^2 e^{-x^2/2} dx \\
& = & \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} x
\left( xe^{-x^2/2}\right) dx \\
& = & - \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} x
\left( \frac{d}{dx}e^{-x^2/2}\right) dx \\
& = & - \left. \frac{1}{\sqrt{2\pi}} \left( x e^{-x^2/2}\right)
\right|_{-\infty}^{\infty} +
\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty}e^{-x^2/2} dx \\
& = & 0 + 1
\end{eqnarray*}
Similar calculations give $E[X^4] = 3$, $E[X^6] = 15$, and so on. I will
often write $Z$ for a standard normal random variable. A one dimensional
Gaussian random variable with mean $E[X]=\mu$ and variance
$\mbox{var}(X) = E[(X-\mu)^2] = \sigma^2$ has density
$$
u(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \; .
$$
It is often more convenient to think of $Z$ as the random variable (like
$\omega$) and write $X=\mu + \sigma Z$. We write
$X\sim {\cal N}(\mu,\sigma^2)$ to express the fact that $X$ is normal
(Gaussian) with mean $\mu$ and variance $\sigma^2$. The standard normal
random variable is $Z \sim {\cal N}(0,1)$
\para Multivariate normal random variables:
The $n \times n$ matrix, $H$, is positive definite if $x^*Hx>0$ for any
$n$ component column vector $x\neq 0$. It is symmetric if $H^*=H$.
A symmetric matrix is positive definite if and only if all its eigenvales
are positive. Since the inverse of a symmetric matrix is symmetric, the
inverse of a symmetric positive definite (SPD) matrix is also SPD. An
$n$ component random variable is a mean zero multivariate normal if it
has a probability density of the form
$$
u(x) = \frac{1}{z} e^{-\frac{1}{2} x^*Hx} \; ,
$$
for some SPD matrix, $H$. We can get mean
$\mu = (\mu_1,\ldots,\mu_n)^*$
either by taking $X+\mu$ where $X$ has mean zero, or by using the
density with $x^*Hx$ replaced by $(x-\mu)^*H(x-\mu)$.
If $X\in R^n$ is multivariate normal and if $A$ is an $m \times n$
matrix with rank $m$, then $Y\in R^m$ given by $Y=AX$ is also multivariate
normal. Both the cases $m=n$ (same number of $X$ and $Y$ variables) and
$m