IS22: Probabilistic foundations of machine learning
date: 7/15/2025, time: 14:00-15:30, room: IM HS
Organizer: Arnaud Guillin (Université Clermont Auvergne)
Chair: Eulalia Nualart (Pompeu Fabra University)
Convergence of continuous-time stochastic gradient descent with applications to linear deep neural networks
Eulalia Nualart (Pompeu Fabra University)
We study a continuous-time approximation of the stochastic gradient descent process forminimizing the expected loss in learning problems. The main results establish general suKicient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized linear neural network training. Joint work with Gabor Lugosi.
Semiconcavity of entropic potentials and exponential convergence of Sinkhorn’s algorithm
Giovanni Conforti (Università degli Studi di Padova)
The entropic optimal transport problem, a.k.a. Schrödinger problem, is a regularised version of the classical optimal transport problem which consists in minimising relative entropy against a reference distribution among all couplings of two given marginals. In this talk, we study stability of optimisers and exponential convergence of Sinkhorn’s algorithm, that is at the heart of the success of entropic regularisation in machine learning and engineering applications. In the first part of the talk, we will illustrate how semiconcavity of dual optimal variables, known as entropic potentials, plays a key role in establishing tight stability bounds and the convergence of Sinkhorn iterates with sharp rates. In the second part of the talk we discuss how to establish semiconcavity bounds in examples of interest such as log-concave marginals or marginals with bounded support.
Bibliography
\([1]\) A.Chiarini, G.Conforti, G.Greco, L.Tamanini (2024). A semiconcavity approach to stability of entropic plans and exponential convergence of Sinkhorn’s algorithm . https://arxiv.org/abs/2412.09235.
Linear convergence of proximal descent schemes on the Wasserstein space
Mateusz Majka (Heriot-Watt University)
In this talk, I will first introduce the problem of training mean-field neural networks as an infinite-dimensional entropy regularized optimization problem on the Wasserstein space, i.e, \[\min_{\mu \in \mathcal{P}_2(\mathbb R^d)} F^{\sigma}(\mu), \text{ with } F^{\sigma}(\mu) := F(\mu) + \sigma\operatorname{KL}(\mu|\pi).\] I will then explain how to solve this problem by utilizing the Wasserstein gradient flow, and, in particular, how to obtain the convergence rate of the flow under different notions of convexity. I will then proceed to discuss the results from \([2]\) on proximal descent methods, inspired by the minimizing movement scheme introduced by Jordan, Kinderlehrer and Otto (JKO), for optimizing entropy-regularized functionals on the Wasserstein space. In particular, the focus will be on the splitting scheme defined by the following update rule: Given a step-size \(\tau > 0\) and starting from \(\mu^0 \in \mathcal{P}_2(\mathbb R^d)\), perform a gradient descent on \(F\): \[\nu^{n+1} = \left(I_d-\tau\nabla \frac{\delta F}{\delta \mu}(\mu^n, \cdot)\right)_{\#}\mu^n\] and then perform a JKO step starting from \(\nu^{n+1}\): \[\mu^{n+1} = \operatorname{argmin}_{\mu \in \mathcal{P}_2(\mathbb R^d)} \left\{\sigma \operatorname{KL}(\mu|\pi) + \frac{1}{2\tau}\mathcal{W}_2^2(\mu, \nu^{n+1})\right\}\,.\] I will explain how to establish linear convergence of such schemes under flat convexity assumptions, thereby relaxing the common reliance on geodesic convexity. The analysis from \([2]\) circumvents the need for discrete-time adaptations of the Evolution Variational Inequality (EVI) such as in \([4]\). Instead, we leverage a uniform logarithmic Sobolev inequality (LSI) and the entropy "sandwich" lemma, extending the analysis for the Wasserstein gradient flow from \([3]\) and \([1]\). The major challenge in the proof via LSI is to show that the relative Fisher information is well-defined at every step of the scheme. Since the relative entropy is not Wasserstein differentiable, we need to prove that along the scheme the iterates belong to a certain class of Sobolev regularity, and hence the relative entropy has a unique Wasserstein sub-gradient, and that the relative Fisher information is indeed finite.
Bibliography
\([1]\) L. Chizat. "Mean-field Langevin dynamics : Exponential convergence and annealing." Transactions on Machine Learning Research, 2022.
\([2]\) R.-A. Lascu, M. B. Majka, D. Šiška and Ł. Szpruch. "Linear convergence of proximal descent schemes on the Wasserstein space." arXiv:2411.15067, 2024.
\([3]\) A. Nitanda, D. Wu, and T. Suzuki. "Convex analysis of the mean field Langevin dynamics." In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, 2022.
\([4]\) A. Salim, A. Korba and G. Luise. "The Wasserstein proximal gradient algorithm." In Advances in Neural Information Processing Systems, volume 33, pages 12356-12366. Curran Associates, Inc., 2020.