Universal Loss and Gaussian Learning Bounds
In this talk I consider two fundamental problems in statistical learning theory: how to choose a loss function, and how to address non-linear problems with linear tools.
A loss function quantifies the difference between the true values and the estimated fits, for a given instance of data. Different loss functions correspond to a variety of merits, and the choice of a "correct" loss could sometimes be questionable. Here, I show that for binary classification problems, the Bernoulli log-likelihood loss (log-loss) is universal with respect to practical alternatives. In other words, by minimizing the log-loss we minimize an upper bound to any smooth, convex and unbiased binary loss function. This property justifies the broad use of log-loss in regression, decision trees, and other domains. In addition, it provides important universality guarantees to many applications, including data clustering, sample complexity analysis and divergence inequalities.
Second, I address a Gaussian representation problem which utilizes the log-loss. In this problem we look for an embedding of an arbitrary data which maximizes its "Gaussian part". This embedding provides an efficient (and practical) representation as it allows us to consider the favorable properties of a Gaussian distribution, which only depends on second order statistics. I introduce different methods and show that the optimal Gaussian embedding is governed by the non-linear canonical correlations of the data. This result provides a primary limit for our ability to Gaussianize arbitrary data-sets and solve complex problems,such as deep neural networks, with linear means.