Learning faster from easy data

NIPS 2013 Workshop, Lake Tahoe, Nevada, United States
Tuesday December 10, 2013
Harrah's Tahoe A

Overview

Most existing theory in both online and statistical learning is centered around a worst-case analysis. For instance, in online learning data are assumed to be generated by an adversary and the goal is to minimize regret. In statistical learning the majority of theoretical results consider risk bounds for the worst-case i.i.d. data generating distribution. In both cases the worst case convergence rates (for regret/n and risk) for 0/1-type and absolute loss functions are O(1/sqrt{n}). Yet in practice simple heuristics like Follow-the-Leader (FTL) often empirically exhibit faster rates.

It has long been known that under Vovk's (1990) mixability condition on the loss function, faster rates are possible. Even without mixability or the closely related exp-concavity (Cesa-Bianchi and Lugosi 2006), in the statistical setting there exist conditions on the distribution under which faster learning rates can be obtained; the main example being Tsybakov's (2004) margin condition, which was recently shown to be intimately connected to mixability (Van Erven et al., 2012).

In practice, even if the loss is not mixable and no distributional assumptions apply, the data are nevertheless often easy enough to allow accelerated learning. Initial promising steps in this direction have been made recently, including parameterless algorithms that combine worst-case O(1/sqrt{n}) regret guarantees for the adversarial setting with

fast rates in the stochastic bandit setting (Bubeck and Slivkins, COLT 2012)
exploitation of observably sub-adversarial data (Rakhlin, Shamir and Sridharan, AISTATS 2013)
learning as fast as FTL whenever FTL works well (De Rooij, Van Erven, Grünwald and Koolen, JMLR 2013)

It remains a huge challenge however, to characterize the types of data for which faster learning is possible, to define `easy data' in a generic way, let alone to design algorithms that automatically adapt to exploit it.

Goals of the workshop

The aim of this day-long workshop is threefold

to map, by means of a series of invited talks and poster presentations, the existing landscape of "easiness criteria" in relation to the efficiency of their corresponding algorithms,
to identify, by means of a panel discussion led by the organizers, obstacles and promising directions,
and through interaction foster partnerships for future research.

Discussion will be centered around the so-far elusive concept of easy data. Can the existing characterizations based on variances, mixability gaps, FTL etc. be brought under a common umbrella? Can ideas and approaches from statistical learning theory be transported to online learning (and vice versa)?

Call for papers

We invite the submission of abstracts to the workshop. Abstracts should be at most 4 pages long (not including references) in PDF format following the NIPS 2013 style. Submissions should not be anonymous. The organizers will review all submissions. Selected abstracts will be presented as ~~contributed talks~~ posters during the workshop. Please send contributions by email to wmkoolen@gmail.com.

Submission of previously published work or work under review is allowed, in particular NIPS-2013 submissions. However, preference will be given to novel work or work that was not yet presented elsewhere (for example, recent journal publications or NIPS posters). All double submissions must be clearly declared as such!

Important dates

Submission deadline: October ~~9th~~ 23rd, 2013.
Notification of acceptance: October ~~23rd~~ 30th, 2013.
Workshop: Tuesday December 10, 2013.

Invited speakers

Nathan Srebro, Toyota Technological Institute at Chicago
Optimistic Rates [slides]
7:55-8:40

Worst-case sample complexity bounds generally scale quadratically with the excess error. However, when the target error is small, the dependence on the excess error is more similar to a linear dependence rather than quadratic. In this talk I will discuss when and how such optimistic rates are possible, in particular in the non-parametric scale-sensitive case, and when they are not possible, argue that the "optimistic" regime better captures the relevant regime to learning, and show examples of how an analysis based on such optimistic rates is necessary in order to understand various learning phenomena.

Alekh Agarwal, Microsoft Research New York
Selective sampling algorithms for cost-sensitive multiclass prediction [slides]
9:30-10:00

In this talk, we study the problem of active learning for cost-sensitive multiclass classification. We propose selective sampling algorithms, which process the data in a streaming fashion, querying only a subset of the labels. For these algorithms, we analyze the regret and label complexity when the labels are generated according to a generalized linear model. We establish that the gains of active learning over passive learning can range from none to exponentially large, based on a natural notion of margin. We also present a safety guarantee to guard against model mismatch. Numerical simulations show that our algorithms indeed obtain a low regret with a small number of queries.

Karthik Sridharan, University of Pennsylvania
Localization and Adaptation in Online Learning Through Relaxations
10:00-10:30

The traditional worst-case analysis for online learning problems is often too pessimistic for real world applications. We would like to design adaptive online learning algorithms that enjoy much better (faster rates) regret bounds against "nicer" data sequences while still preserving the worst-case bounds against the worst case data sequences. While in previous works such algorithms have been designed for specific problems, in this talk I shall describe a generic methodology for designing adaptive algorithms for general online learning problems. Specifically I shall introduce the idea of adaptive relaxation and the concept of localization in online learning. Using these concepts I shall provide a general recipe for designing adaptive online learning algorithms for problems. Through examples I shall illustrate the utility of the introduced concepts on several problems. These examples include new adaptive online learning algorithms against iid adversaries and algorithms that can adapt to data geometry.

Tim van Erven, University Paris-Sud
Follow the leader if you can, Hedge if you must [slides]
15:30-16:15

In sequential prediction with expert advice, the intuitive Follow-the-Leader (FTL) algorithm predicts like the expert that has performed best in the past. Although this algorithm performs very well on stochastic data or in other `easy' cases where there are few leader changes, it is known to fail dramatically when the data are adversarial. On the other hand, other algorithms (like exponential weights with conservative parameter tuning) perform well on adversarial data, but learn much slower than FTL when we are indeed lucky enough to encounter the 'easy' data for which FTL performs well. This observation raises the question of whether it is possible to get the best of both worlds: is there a single algorithm that is robust to adversarial data, but exploits `easy' data to learn faster when possible?

I will discuss how previous methods that satisfy so-called second-order bounds get us part of the way: they work both for adversarial and for stochastic data, but do not exploit other `easy' data for which FTL works well. Then I will present a new method, called FlipFlop, which satisfies the same second-order bounds as previous methods, but in addition is provably as good as FTL on any data.

This is joint work with Steven de Rooij, Peter Grünwald and Wouter Koolen. The paper "Follow the Leader If You Can, Hedge If You Must" is available from www.timvanerven.nl/publications.

Sébastien Bubeck, Princeton University
Fast rates for the multi-armed bandit [slide]
16:15-17:00

Since the seminal work of Lai and Robbins (1985) we know bandit strategies with normalized regret of order (i) 1/sqrt(T) for any stochastic bandit, and (ii) log(T) / T for 'benign' distributions. In Bubeck and Slivkins (2012) we designed a new strategy which extends property (i) to adversarial bandits while still having the fast rate given in (ii). I will present this algorithm and I will also discuss the possibility of even faster rates of order 1/T when extra information is available.

Posters

Maria Florina Balcan and Yingyu Liang, Georgia Institute of Technology
Clustering Perturbation Resilient k-Median Instances [paper] [slides]

Kamalika Chaudhuri and Chicheng Zhang, University of California San Diego
Improved Algorithms for Confidence-Rated Prediction with Error Guarantees [paper] [slides]

Maria-Florina Balcan and Christopher Berlind, Georgia Institute of Technology
On Learning Linear Separators with Large L_∞L₁ Margins

Ruth Urner and Shai Ben-David, University of Waterloo
Probabilistic Lipschitzness: A niceness assumption for deterministic labels [paper] [slides]

Schedule

From	To	What
7.30	7.55	Overview by organizers [slides]
7.55	8.40	Invited talk: Nathan Srebro [slides]
8.40	9.00	Poster spotlights
9.00	9.30	COFFEE BREAK
9.30	10.00	Invited talk: Alekh Agarwal [slides]
10.00	10.30	Invited talk: Karthik Sridharan
10.30	15.30	BREAK
15.30	16.15	Invited talk: Tim van Erven [slides]
16.15	17.00	Invited talk: Sébastien Bubeck [slide]
17.00	17.30	COFFEE BREAK
17.30	18.00	Poster session
18.00	18.30	Panel discussion

Recording

The workshop is recorded courtesy of Microsoft.

Organizers

Peter Grünwald, Centrum Wiskunde & Informatica and Leiden University

Wouter M. Koolen, Queensland University of Technology (formerly at CWI)

Alexander Rakhlin, University of Pennsylvania