Graduate Student / Postdoc Seminar

Next-Symbol Prediction Without Mixing: Optimal Rates, Algorithms, and Hardness

Speaker: Yanjun Han, Courant Institute and Center for Data Science, New York University

Location: Warren Weaver Hall 1302

Date: Friday, February 28, 2025, 1 p.m.

Synopsis:

Consider the following "ChatGPT-style" problem: Observing a time series \(X_1, ..., X_n,\) one is tasked to predict the next (unseen) symbol \(X_{n+1}. \)Mathematically, this boils down to estimating the conditional distribution \(P_{X_{n+1}|X_1, ..., X_n},\) which informs downstream tasks such as sampling for text generation or estimating the most likely realizations for autocomplete. This is a well-defined but non-standard statistical problem, in that the estimand is random and data-dependent, unless the data are i.i.d., in which case the problem reduces to density estimation. For dependent data, most of the existing literature focuses on parameter estimation which relies on favorable mixing time (spectral gap) conditions which are in fact not needed for prediction. (Indeed, a chain that moves at a glacial speed is easy to predict but estimating the transition probabilities is impossible.) This allows for an assumption-free framework to study next-symbol prediction but also results in new technical challenges due to the lack of concentration for sample path statistics.

In this talk, we introduce information-theoretic techniques including, in particular, those rooted in universal compression to determine the optimal rate of prediction risk for Markov chains, achieved by computationally efficient algorithms. Extensions to models with long-term memory including hidden Markov models and the associated computational challenges will also be discussed. This is based on joint work with Soham Jana (Notre Dame), Tianze Jiang (Princeton), and Yihong Wu (Yale).