Graduate Student / Postdoc Seminar

Covariate-Corrected Biclustering Methods for Gene-Expression and GWAS Data

Speaker: Aaditya Rangan, Courant

Location: Warren Weaver Hall 1302

Date: Friday, October 13, 2017, 1 p.m.

Synopsis:

A common goal in data-analysis is to sift through a large matrix and detect any significant submatrices (i.e., biclusters) that have a low numerical rank. To give an example from genomics, one might imagine a data-matrix involving several genetic-measurements taken across many patients. In this context a 'bicluster' would correspond to a subset of genetic-measurents that are correlated across a subset of the patients. While some biclusters might extend across most (or all) of the patients, it is also possible for biclusters to involve only a small subset of patients. Detecting biclusters such as these provides a first step towards unraveling the physiological mechanisms underlying the heterogeneity within a patient population.

In this talk I'll describe a simple algorithm for tackling this biclustering problem - i.e., for detecting low-rank submatrices from within a larger data-matrix. The basic method itself is very straightforward (c.f. Rangan 2012), and involves examining the 'loops' (i.e., 2-x-2 submatrices) within the data-set. Importantly, this method can easily be modified to account for many considerations which commonly arise in practice. For example, by counting loops slightly differently, we can correct for controls: finding biclusters that manifest only within a 'case'-population without manifesting within a 'control'-population. Moreover, this methodology can also correct for categorical- and continuous-covariates, as well as sparsity within the data. I'll illustrate these practical features with two examples; the first drawn from gene-expression analysis and the second drawn from a much larger genome-wide-association-study (GWAS).

In addition to being quite practical, this loop-counting method exhibits a few interesting mathematical features, which I'll mention towards the second half of the talk. To begin with, I'll point out some of the well known phase-transitions for the biclustering problem (i.e., parameters which determine whether or not the problem is 'easy' or 'hard'). I'll also briefly explain why the loop-counting method is 'close to optimal' within a certain subset of local algorithms. Finally, I'll demonstrate that the loop-counting method actually outperforms what are called 'spectral-methods' (based on the singular-value-decomposition) near the computational phase-transition, and explain why this is the case.

Notes:

Pizza and drinks at 12:45 p.m.