Nomenclature for stochastic processes and Bayesian nonparametric statistics

From Marginalia

Jump to: navigation, search

Stochastic processes are now a key part of mainstream Bayesian statistics and probabilistic machine learning. Yet many authors do not have a solid foundation in probability theory, and so many papers commit basic errors when attempting to talk precisely about stochastic processes. This articles tackles a few key ideas and misconceptions. Ideas for new sections are welcome.

Contents

Nomenclature for stochastic processes

In probability theory, a stochastic process is an indexed collection of random variables defined on the same space. That is, a stochastic process is a collection \langle X_j \rangle_{j \in J} of random variables on a (measurable) space S. That is, a stochastic process is a collection of \mathcal F/\mathcal S-measurable functions X_j : \Omega \to S, for  j \in J, where (\Omega,\mathcal F,P) is a probability space and (S,\mathcal S) is a measurable space. You can alternatively think about a stochastic process as a function X : J \times \Omega \to S, but, crucially, this function is not necessarily measurable itself: only the functions  \omega \mapsto X(j,\omega) are presumed to be measurable. Indeed, we have not even specified a \sigma-algebra on the index set J, and so we cannot even speak formally about the joint measurability of X. That said, the index set J often has some structure (say, the Real line and its Borel structure), and we may want X to be measurable. In that case, we are interested in the existence, and construction, of a measurable version. But this is beyond the scope of this article. (See Shalizi's course notes [1] for a relatively gentle introduction.)

Confusion with meaning of "stochastic process" in English

A lot of confusion in machine learning comes from understanding the term "stochastic process" in terms of its English meaning: a "process" (i.e., series of actions or steps) that unfolds in a "stochastic" (i.e., randomly determined) way. Indeed, a Markov chain in discrete or continuous time is a stochastic process modeling a process unfolding in a random way. But when we move from stochastic processes indexed by time, i.e., J=\mathbb N or J=\mathbb R_+, to stochastic processes indexed by, say, the collection J = \mathcal B(\mathbb R) of all Borel measurable subsets of the real line, then the intuitive English meaning becomes misleading.

Confusingly, many stochastic processes defined on more exotic index sets are defined in terms of stochastic processes indexed by J=\mathbb N. But, mathematically, an indexed collection of random variables is simply a collection of (measurable) functions, and so, while they may be defined recursively, or defined in terms of a stochastic process modeling a "process" unfolding in time, these functions simply exist at the outset: they don't appear individually out of thin air when some random event happens.

Random measures

A random measure on a measurable space (S,\mathcal S) is a stochastic process G on (\mathbb R_+,\mathcal B(\mathbb R_+)) with index set \mathcal S such that

 P \{ G(\emptyset) = 0 \} = 1

and, for every countable sequence B_1,B_2,\ldots \in \mathcal S of measurable sets,

 \textstyle P \{ \sum_i G(B_i) = G(\bigcup_i B_i) \} = 1,

where G(B) = G_B, as is usual.

(A random probability measure also satisfies  P \{ G(S) \} = 1 \} = 1.)

One often demands more measurability of G: for example, enough measurability to be able to demand that countable additivity holds for all possible countable collections of sets simultaneously. Alternatively, one can think about a random measure as a random element in the space of measures, where the \sigma-algebra is that generated by the functions of the form \mu \mapsto \mu(B), for B \in \mathcal S. It is typically to also demand that there exists some measurable partition \langle B_i \rangle of S such that P \{ G(B_i) < \infty \} = 1.

Priors versus processes

A very common misstatement found in machine learning papers is "The Dirichlet process is a distribution on the space of probability measures". Confusingly, this statement could be true, but it is probably false. Consider the standard setup:


\begin{align}
G &\sim \mathrm{DP}(\alpha G_0)  \\
X_n \mid G &\sim G \qquad \text{for } n\in \{1,2,\ldots\}.
\end{align}

(To be precise, we must say that the X_n are conditionally i.i.d. given G. Alternatively, we could have written, e.g., X_{n+1} \mid G, X_1, \ldots X_n \sim G, which implies this statement.)

Let us assume that G_0 is a probability measure on the real line. The way to read the first statement is: "G is a Dirichlet process." Emphatically, G is NOT a "sample from a Dirichlet process". (The X_n would fit this description, though.) The confusion comes from the English meaning discussed above and the fact that one often first encounters Dirichlet processes by way of stick-breaking constructions, which are themselves stochastic processes indexed by \mathbb N.

In the setting of a Bayesian statistics paper, the easiest way to fix the statement at the start of the section is to replace "Dirichlet process" with "Dirichlet process prior", although perhaps a better statement would pluralize "prior" to "priors" and "distribution" to "distributions". A Dirichlet process is a random probability measure. (Note that a vector in \mathbb R^n with a finite-dimensional Dirichlet distribution is also a Dirichlet process when the vector is viewed as a distribution on \{1,\ldots,n\}.)

As I mentioned above, there is also the real possibility of ambiguity in the statement at the start of the section: a Dirichlet process could be a random probability measure on the space of probability measures. Let H_0 = \mathrm{DP}(\alpha G_0) be a Dirichlet process prior. Then


\begin{align}
H &\sim \mathrm{DP}(\alpha' H_0)  \\
\end{align}

is a random probability measure on the space of probability measures. This actually appears in the literature.

Personal tools