The last decade has brought rapid advances in the performance of machine-learning systems: computers can now beat the best human players in Chess or Go, the seminal protein-folding problem has been solved computationally, and convolutional neural networks run on our smart phones. While this practical progress continues to happen and intelligent systems begin to appear at all places of society, it is especially important to understand how these systems work. Unfortunately, many ML systems are still comparably badly understood.
This lack of understanding has multiple bad effects. Unintended erratic behavior can sometimes not be excluded; costly and cumbersome manual tuning is often necessary; and engineering efforts can be directed at infeasible problems. Since modern ML stystems builds on techniques from multiple disciplines (statistics, optimization, numerical analysis, etc.), the interdisciplinary collaboration between expert is especially important to solve these challenges.
This workshop organized by the ELLIS Theory, Algorithms and Computations of Modern Learning Systems group is designed as a forum of exchange between some of Europe's leading researchers in the theory of ML. It aims to convey a comprehensive picture of ongoing researchers efforts to analyze ML systems and give orientation to researchers in the rapidly changing ML landscape.
The workshop agenda thus features talks from upcoming and established researches from diverse backgrounds (including Bayesian statistics, optimization, kernel methods, neural networks, etc.). In addition, there will be discussions and panels on the future of the ELLIS Theory, Algorithms and Computations of Modern Learning Systems group and on scientific topics of shared interest.
There will be ample time for brainstorming, networking and collaborating.
The 2022 ELLIS Theory Program Workshop is sponsored by ELISE, the European Network of AI Excellence Centres.
NOTE: The 2022 ELLIS Theory Program Workshop is an invitation-only event, and it is not possible to transfer this invitation to others. No fee is required.
The Workshop will take place from the 20th to the 22nd of June 2022. Activities will start on Monday the 20th afternoon and end, with a lunch, on Wednesday the 22nd.
The workshop will take place at Grand Hotel Arenzano, in Arenzano. The spaces provided by the Hotel follow the highest standards of safety precautions against Covid19, and will guarantee a safe and comfortable experience for all participants.
The booking of single rooms at a reduced price at Grand Hotel Arenzano has been handled via a custom link, which has now expired. Please contact the organizers if you still need to book a room, so that we can advise on the best solution. Options for reimbursement depend on each participant's affiliation and have been therefore shared individually.
Within the H2020 ELISE project, the Theory Program has funds to cover reasonable travel costs of the workshop participants.
Rules for reimbursements depend on the status of your university/company towards the ELISE project.
Centre de Recherche INRIA - SIERRA project-team
Departement d'Informatique de l'Ecole Normale Superieure
PSL Research University
University of Tübingen
Max Planck Institute for Intelligent Systems
Università di Genova
Istituto Italiano di Tecnologia
Massachusetts Institute of Technology
|12:30–14:00||Self-organized lunch - meet in the lobby at 12:30 if company desired|
|14:00–14:20||Welcome by Lorenzo Rosasco|
|14:20–16:20||Talks 1 chaired by Ingo Steinwart|
Sensor AI — Indoor localisation using the magnetic field
Sensor AI focuses on combining physics-based models with models learned from data in order to extract more information from the available sensor data. In this talk, I will exemplify this with our work on indoor localisation using the magnetic field. Indoor localisation is an active research area since GPS does not work well indoors as the signals are blocked by the building. We use the strongly position-dependent indoor magnetic field as a source of position information for indoor localisation. Here, we learn a map of the magnetic field using Gaussian processes and combine this with physics- based models for simultaneous localisation and mapping.
Neural tangent kernel analysis of shallow Stable ReLU neural networks
There is a recent and growing literature on large-width properties of Gaussian neural networks (NNs), namely NNs whose weighs are distributed or initialized according to Gaussian distributions. Two popular problems are: i) the study of the large-width asymptotic behaviour of NNs, which provided a characterization of the infinitely wide limit of a rescaled NN in terms of a Gaussian process; ii) the study of the the large-width training dynamics of NNs, which set forth an equivalence between the rescaled NN and a kernel regression with respect to a deterministic kernel referred to as the neural tangent kernel (NTK). In this paper, we consider these problems for Stable NNs, which generalize Gaussian NNs by assuming that the NN's weights are distributed as Stable distributions, i.e. distributions with heavy-tails or infinite variance. For shallow Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a suitable rescaled NN converges weakly to a Stable process, namely a stochastic process with Stable (finite-dimensional) marginal distributions. It turns out, as a novelty with respect to the Gaussian setting, that in the Stable setting the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide Stable process, the ReLU function requires an additional logarithmic scaling with respect to sub- linear functions. Then, our main contribution is the NTK analysis of shallow Stable ReLU NNs, which leads to an equivalence between the rescaled NN and a kernel regression with respect to a Stable random kernel. The randomness of such a kernel is a further novelty with respect to the Gaussian setting, that is: in the Stable setting, the randomness of the initialization does not vanish in the NTK analysis, thus determining a random initialization for the underlying kernel regression.
is interpolation benign for random forests?
Statistical wisdom suggests that very complex models, interpolating training data, will be poor at prediction on unseen examples. Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity. While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for random forests. In this paper, we study the trade-off between interpolation and consistency for several types of random forest algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved for non-adaptive random forests. Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we introduce and study interpolating Adaptive Median Forests, which are proved to be consistent in a noiseless scenario. Numerical experiments show that Breiman's random forests are consistent while exactly interpolating, when no bootstrap step is involved. We theoretically control the size of the interpolation area, which converges fast enough to zero, so that exact interpolation and consistency occur in conjunction.
|16:20–16:45||setting up of posters, with aperitivo|
|16:45–18:30||poster session (PhD students and Postdocs)|
Felix Dangel, Andrea Della Vecchia, Raffaello Camoriano, Annabelle Carrell, Ella Tamir, Emilia Magnani, Francesco Montagna, Frederik Künstner, Gaspard Beugnot, Junchi Yang, Giulia Luise, Luc Motte, Marco Rando, Nicholas Krämer, Siyuan Guo, Antoine Chatalic, Antonio Ribeiro, Hans Kersting, Martin Trapp, Nicolas Schreuder, Pierre Laforgue
|9:30–12:40||Talks 2 chaired by Arno Solin|
Nonconvex Min-Max Optimization: fundamental limits, acceleration, and adaptivity
Min-max optimization plays a critical role in emerging machine learning applications from training GANs to robust reinforcement learning, and from adversarial training to fairness. In this talk, we discuss some recent results on min-max optimization algorithms with a special focus on nonconvex-strongly-concave problems, including their fundamental limits, acceleration, and adaptivity. We introduce the first accelerated algorithms that achieve near-optimal complexity bounds as well as a family of adaptive algorithms with parameter-agnostic adaptation.
The role of stochasticity in learning algorithms
It has been observed that noise induced by Stochastic Gradient Descent when training neural networks generally enhances generalisation performance in comparison to full-batch training. In this talk, we will try to understand how SGD noise biases the training dynamics towards specific prediction functions for regression tasks. More precisely, we will first show that the dynamics of SGD over diagonal linear networks converges towards a sparser linear estimator than the one retrieved by GD. Then, we will show that adding label noise similarly biases the dynamics towards implicitly solving a Lasso program. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances of stochastic dynamics over deterministic ones, as observed in practice.
Tradeoffs between test error and robustness in different learning regimes for two-layer neural networks
Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermediate scenarios, where initialization still plays a role due to “lazy” training. We consider over- parameterized networks in high dimensions with quadratic targets and infinite samples. Our analysis allows us to identify new tradeoffs between test error and robustness, whereby robustness can only get worse when test error improves, and vice versa. This is joint work with Alberto Bietto (NYU).
Generalization Bounds via Convex Analysis
Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of p- norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
|15:30–19:00||Discussion, Panels & Keynotes chaired by Francis Bach|
|15:30–16:10||Keynote: Florence d’Alché|
Learning to predict complex outputs: a kernel view
Motivated by real-world complex output prediction tasks such as link prediction, molecule identification or functional output regression, we give a short overview of how in practise algorithms can be developed for learning functions with values in infinite-dimensional spaces. In particular, we show how to leverage the notion of output kernel to take into account the nature of output variables whether they be discrete structures or functions. We highlight how the two main advantages of kernel methods - representation theorems - and - regularization tools are even enriched in the vector-valued setting. Eventually we take a step back and evocate a line of research based on Optimal Transport that extends this family of works.
|16:10–16:50||Plenary discussion on ELLIS Theory group|
|17:20–18:00||Keynote: Nicolo Cesa-Bianchi|
The power of cooperation in networks of learning agents
We study the power of cooperation in a network of agents that exchange information with each other to learn faster. In the talk, we show the extent to which cooperation allows to prove performance bounds that are strictly better than the known bounds for noncooperating agents. Our results are formulated within the online learning setting and hold for various types of feedback models.
|18:00–19:00||Panel: One model to rule them all: Is deep learning making other ML models obsolete? (Florence D’Alché, Arthur Gretton, Ferenc Huszar, Julien Mairal, Simo Särkkä; moderation: Hans Kersting), with aperitivo|
|9:30–9:40||Information from organizers Lynn Anthonissen and Giulia Casu|
|9:40–12:40||Talks 3 chaired by Thomas Schön|
Data splitting improves statistical performance in overparameterized regimes
While large training datasets generally offer improvement in model performance, the training process becomes computationally expensive and time consuming. Distributed Learning is a common strategy to reduce the overall training time by exploiting multiple computing devices. Recently, it has been observed in the single machine setting that overparameterization is essential for benign overfitting in ridgeless regression in Hilbert spaces. We show that in this regime, data splitting has a regularizing effect, hence improving statistical performance and computational complexity at the same time. We further provide a unified framework that allows us to analyze both the finite and infinite dimensional setting.
Representing non-negative functions, with applications to non-convex optimization and beyond
Many problems in applied mathematics admit a natural representation in terms of non- negative functions, e.g. probability representation and inference, optimal transport, optimal control, non-convex optimization, to name a few. While linear models are well suited to represent functions with output in R or C, being at the same time very expressive and flexible, the situation is different for the case of non- negative functions where the existing models lack one of these good properties. In this talk we present a model for non-negative functions that promises to bring to these problems, the same benefits that linear models brought to interpolation, approximation, quadrature and supervised learning, leading to a new class of adaptive algorithms with provably fast convergence. In particular, we will show direct applications in numerical methods for probability representation and non-convex optimization. We will see more in detail that the model allows to derive an algorithm for non- convex optimization that is adaptive to the degree of differentiability of the objective This Research Workshop has received funding from the European Union’s Horizon 2020 research and innovation programme under ELISE Grant Agreement No. 951847. function and achieves optimal rates of convergence. Finally, we show how to apply the same technique to other interesting problems in applied mathematics that can be easily expressed in terms of inequalities.
Accurate Quantization of Measures via Interacting Particle-based Optimization
Approximating a target probability distribution can be cast as an optimization problem where the objective functional measures the dissimilarity to the target. This optimization can be addressed by approximating Wasserstein and related gradient flows. In practice, these are simulated by interacting particle systems, whose stationary states define an empirical measure approximating the target distribution. This approach has been popularized recently to design sampling algorithms, e.g. Stein Variational Gradient Descent, or by minimizing the Maximum Mean or Kernel Stein Discrepancy. However, little is known about quantization properties of these approaches, i.e. how well is the target approximated by a finite number particles. We investigate this question theoretically and numerically. In particular, we prove general upper bounds on the quantization error of MMD and KSD at rates which significantly outperform quantization by i.i.d. samples. We conduct experiments which show that the particle systems at study achieve fast rates in practice, and notably outperform greedy algorithms, such as kernel herding. We compare different gradient flows and highlight their quantization rates. Furthermore we introduce a Normalized Stein Variational Gradient Descent and argue in favor of adaptive kernels, which exhibit faster convergence.
Prediction in the presence of performative dynamics
When predictions inform decisions they may influence the target they aim to predict. We call such predictions performative. Performativity is a well-studied phenomenon in social sciences that is often neglected in the theories and practices of supervised learning. In this talk, I will introduce and discuss a risk minimization framework for performative prediction that allows for a predictive model to influence the distribution over future data. The non-stationary nature of the problem gives rise to new solution concepts, and entails interesting considerations and challenges for common machine learning practices. A conceptual novelty is an equilibrium notion we call performative stability. Performative stability implies that the predictions are calibrated not against historical data, but against the future data distribution that manifests from acting on the prediction. I will present conditions under which retraining heuristics provably converges to stable predictors. Then, I contrast performative stability with performative optimality that requires a more fine-grained understanding of the performative effects. Taking inspiration from the literature on causal inference and multi-armed bandits I present a novel algorithm that performs targeted exploration to learn about performative effects and to find the optimal predictor ex-post. Finally, I will conclude the talk with a broader discussion about the role of performativity in socio-technical systems.
|12:40–13:00||Closing remarks by Philipp Hennig|
ELLIS PhD & Postdoc Program Coordinator, ELISE Coordinator Tübingen
lynn (dot) anthonissen (at) uni-tuebingen (dot) de
Lab Manager at MaLGa Center, Coordinator of the ELLIS Genoa Unit
gcasu (dot) malga (at) gmail (dot) com