Wednesday, September 26, 2018, 3:30pm, WWH 1314

Studying the approximation theoretic properties of neural networks with smooth activation function is a classical topic. The networks that are used in practice, however, most often use the non-smooth ReLU activation function. Despite the recent incredible performance of such networks in many classification tasks, a solid theoretical explanation of this success story is still missing.

In this talk, we will present recent results concerning the approximation theoretic properties of deep ReLU neural networks which help to explain some of the characteristics of such networks; in particular we will see that deeper networks can approximate certain classification functions much more efficiently than shallow networks, which is not the case for most smooth activation functions. We emphasize though that these approximation theoretic properties do not explain why simple algorithms like stochastic gradient descent work so well in practice, or why deep neural networks tend to generalize so well; we purely focus on the expressive power of such networks.

As a model class for classifier functions we consider the class of (possibly discontinuous) piecewise smooth functions for which the different "smooth regions" are separated by smooth hypersurfaces. Given such a function, and a desired approximation accuracy, we construct a neural network which achieves the desired approximation accuracy, where the error is measured in L^p. We give precise bounds on the required size (in terms of the number of weights) and depth of the network, depending on the approximation accuracy, on the smoothness parameters of the given function, and on the dimension of its domain of definition. Finally, we show that this size of the networks is optimal, and that networks of smaller depth would need significantly more weights than the deep networks that we construct, in order to achieve the desired approximation accuracy.