CS47: Dynamical Systems Modelling II
date: 7/14/2025, time: 16:00-17:30, room: ICS 119
Organizer: Anna Jaśkiewcz (Wroclaw University of Science and Technology)
Chair: Anna Jaśkiewcz (Wroclaw University of Science and Technology)
Quantitative Bounds for Kernel based Q-learning in continuous spaces
Athanasios Vasileiadis (Karlsruhe Institute of Technology (KIT))
In this talk, we are going to discuss a new proof of convergence with quantitative bounds for Q-learning in continuous spaces using kernel methods. We adapt the main structure from the original idea of Watkins, who first introduced the Action Replay Process (ARP) as a proof device. We give precise bounds for the transitions of ARP and the Markov Decision Process (MDP) that we want to learn.
Suppose we are give an MDP \((\overline {S},\overline {A} ,\overline{\textsf{\sf P}} ,R,\gamma)\) with \(\overline S\) and \(\overline A\) continuous compact spaces, and the transition kernel \(\overline{\textsf{\sf P}}\) being nondegenerate and spread all over the space and the reward \(R\) bound and smooth as a function of state and action. The value of a state \(s\) is \[V^\pi(s) := \mathbb{E}\Bigl[\sum_{k=0}^\infty \gamma^k R\bigl(\mathrm{s}_k,\pi(\mathrm{s}_k)\bigr) \bigm\vert \mathrm{s}_0=s \Bigr],\] An the optimal action value function \[Q^\star(s,a) = R(s,a) + \gamma \int_{\overline{S}} V^\star(s') \overline{\textsf{\sf P}} \bigl( (s,a),ds' \bigr), \quad \forall (s,a) \in \overline{S} \times \overline{A}.\]
Our goal is to learn \(Q^\star\) using kernels. To explain the idea further, say that we interact with the (unknown) environment and obtain a sample \((\underline{\mathrm{s}},\underline{\mathrm a})_n= \{\mathrm{s}_0,\mathrm{a}_0,\mathrm{s}_1, ..., \mathrm{s}_n,\mathrm{a}_n\}\) of size \(n\), we want to define \({K} : {\mathbb R}^{d_{S}} \times {\mathbb R}^{d_{A}} \rightarrow {\mathbb R}\) a smooth non-negative compactly supported function such that based on our sample \((\underline{\mathrm s},\underline{\mathrm a})_n\) and for any function \(f: \overline{S}\times \overline{A} \to \mathbb R\) at a point \((s,a)\) at depth \(n\) we can define the operation \[%\label{eq:kernel operator} \mathcal{A}_{h,n}f(s,a):=\sum_{k=0}^n \displaystyle \frac{{K}_h\bigl(s-\mathrm{s}_k,a-\mathrm{a}_k\bigr)f(\mathrm{s}_k,\mathrm{a}_k) }{\sum_{k=0}^n {K}_h \bigl( s-\mathrm{s}_k,a-\mathrm{a}_k\bigr) },\]
To apply the kernel approximation on the action value function we notice the recursion of the TD-update and let for \(n \geq 1\) \[%\label{eq:recKernelQ:intro} \widehat{Q}_{h,n}(s,a)= \widehat{Q}_{h,n-1}(s,a) + \mathit{a}_{h,n-1}(s,a) \Bigl( R(\mathrm{s}_{n-1},{\mathrm a}_{n-1})+ \gamma {\sup}_{a' \in A} \hat Q_{h,n-1}(\mathrm{s}_{n},a') - \hat{Q}_{h,n-1}(s,a)\Bigr).\] with \[%\label{eq:upalpha:sec:carden:le:2:intro} \mathit{a}_{h,n}(s,a)= \left\{ \begin{array}{ll} \displaystyle \frac{{K}_h\bigl(s-\mathrm{s}_n,a-\mathrm{a}_n\bigr)}{\sum_{j=0}^n {K}_h \bigl( s-\mathrm{s}_j,a-\mathrm{a}_j\bigr) } \quad &\text{if} \quad \displaystyle \sum_{j=0}^n {K}_h \bigl( s-\mathrm{s}_j,a-\mathrm{a}_j\bigr) >0 \\ 0 &\text{otherwise} \end{array} \right..\]
Our main result is the rate of convergence, \(\widehat{Q}_{h,n}(s,a)\) to \(Q^\star(s,a)\) in probability when \(n\) becomes large and \(h\) small, for all \(s, a\) in \(\overline S \times \overline A\)
Theorem
There exists a constant \(C\), depending on the various parameters
underpinning our assumptions, such that, for an error threshold
\(\epsilon>0\), we can find \(h_{\varepsilon}\) and \(n_{\varepsilon}\) such
that the sup* distance between
\(\widehat{Q}_{h_\varepsilon,n_\varepsilon}\) and \(Q^\star\) is less than
\(C (1+\vert \ln(\varepsilon)\vert) \varepsilon\) on an event of
probability greater than \(1-C \varepsilon^D\). The number
\(n_{\varepsilon}\) that is necessary to do so is less than
\(\exp(C \vert \ln(\varepsilon) \vert^3)\).*
A principle of one big jump for the branching random walk
Jakob Stonner (JLU Gießen)
We prove a version of Nagaev’s theorem for the branching random walk with heavy-tailed associated random walk. For a heavy-tailed random walk \((S_n)_{n \in \mathbb{N}}\) on \(\mathbb{R}\), this theorem roughly states that \(\mathbb{P}(S_n > n\mathbb{E}[S_1] + t) = n \mathbb{P}(S_1 > t)(1 + o(1))\) as \(n \to \infty\) uniformly in large \(t > 0\), which is explained probabilistically by the principle of one big jump. For a branching random walk on \(\mathbb{R}\) we consider the random measure \(\widehat{Z}_n = \sum_{|u|=n} \mathrm{e}^{-V_u} \delta_{V_u}\), where \(n \in \mathbb{N}\) and \((V_u)_{|u|=n}\) denote the positions of the particles in the \(n\)-th generation. Our result then states that \(\widehat{Z}_n([n\mathbb{E}[S_1] + t_n, \infty)) = W n \mathbb{P}(S_1 > t_n)(1 + o(1))\) in probability as \(n \to \infty\), where \(W\) is a non-degenerate random variable, \(t_n \uparrow \infty\) grows fast enough, and \((S_n)_{n \in \mathbb{N}}\) is the associated random walk.
Dynamic Server Scheduling with Interruptible Set-up Times in a Network
Dongnuan Tian (Lancaster University)
We consider a dynamic scheduling problem in which a single server operates over a network to process jobs of different types arriving at designated demand points according to independent Poisson processes. Jobs accumulate waiting costs while queued at their respective locations, and the server dynamically travels around the network to process them. The travel time between nodes represents the sequence-dependent, interruptible set-up time required to switch between job types. This network-based formulation allows us to capture complex relationships in switching efforts between different tasks. We model the problem as a Markov decision process (MDP) with the objective of minimizing the long-run average holding cost. We prove the existence of a stationary policy that ensures system stability under a workload condition. Due to the intractability of exact solutions in large state spaces, we propose a class of index-based heuristic policies with desirable structural properties. We also discuss how these heuristics can be modified to scale effectively with problem size and demonstrate their performance through extensive numerical experiments against suitable benchmarks.