Hostname: page-component-68c7f8b79f-fnvtc Total loading time: 0 Render date: 2025-12-20T12:59:23.868Z Has data issue: false hasContentIssue false

Non-asymptotic analysis of Langevin-type Monte Carlo algorithms

Published online by Cambridge University Press:  10 December 2025

Shogo Nakakita*
Affiliation:
University of Tokyo
*
Postal address: Komaba Institute for Science, University of Tokyo, 3-8-1 Komaba, Meguro, Tokyo 153-8902, Japan. Email: nakakita@g.ecc.u-tokyo.ac.jp
Rights & Permissions [Opens in a new window]

Abstract

We study Langevin-type algorithms for sampling from Gibbs distributions such that the potentials are dissipative and their weak gradients have finite moduli of continuity not necessarily convergent to zero. Our main result is a non-asymptotic upper bound on the 2-Wasserstein distance between a Gibbs distribution and the law of general Langevin-type algorithms based on a Liptser–Shiryaev-type condition for change of measures and Poincaré inequalities. We apply this bound to show that the Langevin Monte Carlo algorithm can approximate Gibbs distributions with arbitrary accuracy if the potentials are dissipative and their gradients are uniformly continuous. We also propose Langevin-type algorithms with spherical smoothing for distributions whose potentials are not convex or continuously differentiable and show their polynomial complexities.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

We consider the problem of sampling from a Gibbs distribution $\pi(\text{d} x)\propto \mathrm{e}^{-\beta U(x)}\text{d} x$ on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ , where $U \, : \, \mathbb{R}^{d}\to[0,\infty)$ is a non-negative potential function and $\beta>0$ is the inverse temperature. In particular, we focus on approximate sampling from a distribution $\pi_{\epsilon}$ satisfying $d(\pi_{\epsilon},\pi)\le \epsilon$ for given tolerance $\epsilon\in(0,\infty)$ and distance between probability distributions $d(\cdot,\cdot)$ . The algorithms for this problem can be classified into two types: algorithms whose invariant distributions exactly match $\pi$ (e.g., Metropolis–Hastings algorithms, Gibbs samplers, and piecewise deterministic Markov process Monte Carlo methods) and algorithms whose invariant distributions are not equal to $\pi$ but approximate it with preferable computational complexities. In this study, we examine the algorithms of the latter type and their computational complexities.

One of the extensively used types of algorithms for the approximate sampling problem is the Langevin type motivated by the following Langevin dynamics, the solution of the following d-dimensional stochastic differential equation (SDE):

(1) \begin{align} \text{d} X_{t}=-\nabla U\left(X_{t}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\quad X_{0}=\xi,\end{align}

where $\{B_{t}\}_{t\ge0}$ is a d-dimensional Brownian motion and $\xi$ is a d-dimensional random vector with $|\xi|<\infty$ almost surely. Since the 2-Wasserstein or total variation distance between $\pi$ and the law of $X_{t}$ is convergent under mild conditions, we expect that the laws of Langevin-type algorithms inspired by $X_{t}$ should approximate $\pi$ well. However, most of the theoretical guarantees for the algorithms are based on the convexity of U, the twice continuous differentiability of U, or the Lipschitz continuity of the gradient $\nabla U$ , which do not hold in some modelling in statistics and machine learning. The main interest of this study is a unified approach to analyse and propose Langevin-type algorithms under minimal assumptions.

The stochastic gradient Langevin Monte Carlo (SG-LMC) algorithm or stochastic gradient Langevin dynamics with a constant step size $\eta>0$ is given by the discrete observations $\{Y_{i\eta}\}_{i=0,\ldots,k}$ of the solution of the following d-dimensional SDE:

(2) \begin{align} \text{d} Y_{t}=-{G}\left(Y_{\lfloor t/\eta\rfloor\eta},a_{\lfloor t/\eta\rfloor\eta}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\quad Y_{0}=\xi,\end{align}

where $\{a_{i\eta}\}_{i=0,\ldots,k}$ is a sequence of independent and identically distributed (i.i.d.) random variables on a measurable space $(A,\mathcal{A})$ and G is a $\mathbb{R}^{d}$ -valued measurable function. We assume that $\{a_{i\eta}\}$ , $\{B_{t}\}$ , and $\xi$ are independent. Note that the Langevin Monte Carlo (LMC) algorithm is a special case of SG-LMC; it is represented by the discrete observations $\{Y_{i\eta}\}_{i=0,\ldots,k}$ of the solution of the following diffusion-type SDE:

(3) \begin{align} \text{d} Y_{t}=-\nabla U\left(Y_{\lfloor t/\eta\rfloor\eta}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\quad Y_{0}=\xi,\end{align}

which corresponds to the case ${G}=\nabla U$ .

To see what difficulties we need to deal with, we review a typical analysis [Reference Raginsky, Rakhlin and Telgarsky31] based on the smoothness of U, that is, the twice continuous differentiability of U and the Lipschitz continuity of $\nabla U$ . Firstly, the twice continuous differentiability simplifies discussions or plays significant roles in studies of functional inequalities such as Poincaré inequalities and logarithmic Sobolev inequalities (e.g., [Reference Bakry, Barthe, Cattiaux and Guillin4, Reference Cattiaux, Guillin and Wu9]). Since the functional inequalities for $\pi$ are essential in analysis of Langevin algorithms, the assumption that U is of class $\mathcal{C}^{2}$ frequently appears in previous studies. Secondly, the Lipschitz continuity combined with weak conditions ensures the representation of the likelihood ratio between $\{X_{t}\}$ and $\{Y_{t}\}$ , which is critical when we bound the Kullback–Leibler divergence. Nesterov and Spokoiny [Reference Liptser and Shiryaev24] exhibit much weaker conditions than Novikov’s or Kazamaki’s condition for the explicit representation if (1) has a unique strong solution. Since the Lipschitz continuity of $\nabla U$ is sufficient for the existence and the uniqueness of the strong solution of (1), the framework of [Reference Liptser and Shiryaev24] is applicable.

Our approaches to overcome the non-smoothness of U are mollification, a classical approach to dealing with non-smoothness in differential equations (e.g., see [Reference Menozzi, Pesce and Zhang26, Reference Menozzi and Zhang27]), and the abuse of moduli of continuity for possibly discontinuous functions. We consider the convolution ${\bar{U}}_{r} \,:\!=\,U\ast \rho_{r}$ on U with a weak gradient, and some sufficiently smooth non-negative function $\rho_{r}$ with compact support in a ball of centre $\mathbf{0}$ and radius $r\in(0,1]$ . We can let $\bar{U}_{r}$ be of class $\mathcal{C}^{2}$ and obtain bounds for the constant of Poincaré inequalities for $\bar{\pi}^{r}(\text{d} x)\propto \exp(-\beta\bar{U}_{r}(x))\text{d} x$ , which suffice to show the convergence of the law of the mollified dynamics $\{\bar{X}_{t}^{r}\}$ defined by the SDE

\begin{align*} \text{d} \bar{X}_{t}^{r}=-\nabla \bar{U}_{r}\left(\bar{X}_{t}^{r}\right)\text{d} t +\sqrt{2\beta^{-1}}\text{d} B_{t},\quad \bar{X}_{0}^{r}=\xi,\end{align*}

to the corresponding Gibbs distribution $\bar{\pi}^{r}$ in 2-Wasserstein distance owing to [Reference Bakry, Barthe, Cattiaux and Guillin4, Reference Lehec21, Reference Liu25]. Since the convolution $\nabla \bar{U}_{r}$ is Lipschitz continuous if the modulus of continuity of a representative $\nabla U$ is finite (the convergence to zero is unnecessary), a concise representation of the likelihood ratios between the mollified dynamics $\{\bar{X}_{t}^{r}\}$ and $\{Y_{t}\}$ is available, and we can evaluate the Kullback–Leibler divergence under weak assumptions.

As our analysis relies on mollification, the bias–variance decomposition of G with respect to $\nabla \bar{U}_{r}$ rather than $\nabla U$ is crucial. This decomposition gives us a unified approach to analyse well-known Langevin-type algorithms and propose new algorithms for U without continuous differentiability. Concretely speaking, we show that the sampling error of LMC under the dissipativity of U of class $\mathcal{C}^{1}$ and uniformly continuous $\nabla U$ can be arbitrarily small by controlling k, $\eta$ , and r carefully and letting the bias converge. We also propose new algorithms, called the spherically smoothed Langevin Monte Carlo (SS-LMC) algorithm and the spherically smoothed stochastic gradient Langevin Monte Carlo (SS-SG-LMC) algorithm, whose errors can be arbitrarily small under the dissipativity of U and the boundedness of the modulus of continuity of weak gradients. In addition, we argue zeroth-order versions of these algorithms which are naturally obtained via integration by parts.

1.1. Related works

While the sampling problem from distributions with non-smooth and non-convex potentials has been extensively studied, this study focuses on Langevin-based algorithms. For studies from different viewpoints, see, for example, [Reference Fan, Yuan, Liang and Chen20, Reference Liang and Chen22].

Non-asymptotic analysis of Langevin-based algorithms under convex potentials has been a subject of much attention and intense research [Reference Dalalyan13, Reference Durmus and Moulines15, Reference Durmus and Moulines16], and one without convexity has also gathered keen interest [Reference Erdogdu, Mackey and Shamir18, Reference Raginsky, Rakhlin and Telgarsky31, Reference Xu, Chen, Zou and Gu34]. While most previous studies are based on the Lipschitz continuity of the gradients of potentials, several studies extend the settings to those without global Lipschitz continuity. We can classify the settings of potentials in those studies into three types: (1) potentials with convexity but without smoothness [Reference Chatterji, Diakonikolas, Jordan and Bartlett10, Reference Durmus, Majewski and Miasojedow14, Reference Lehec21, Reference Pereyra30]; (2) potentials with Hölder continuous gradients and degenerate convexity at infinity or outside a ball [Reference Chewi, Erdogdu, Li, Shen and Zhang11, Reference Erdogdu and Hosseinzadeh17, Reference Nguyen29]; and (3) potentials with local Lipschitz gradients [Reference Brosse, Durmus, Moulines and Sabanis8, Reference Zhang, Akyildiz, Damoulas and Sabanis35]. We review (1) and (2) as our study gives the error estimate of LMC with uniformly continuous gradients and Langevin-type algorithms with gradients whose discontinuity is uniformly bounded.

Pereyra [Reference Pereyra30], Chatterji et al. [Reference Chatterji, Diakonikolas, Jordan and Bartlett10], and Lehec [Reference Lehec21] study Langevin-type algorithms under the convexity and non-smoothness of potentials. Pereyra [Reference Pereyra30] presents proximal Langevin-type algorithms for potentials with convexity but without smoothness, which use Moreau approximations and proximity mappings instead of gradients. The algorithms are stable in the sense that they have exponential ergodicity for arbitrary step sizes. Chatterji et al. [Reference Chatterji, Diakonikolas, Jordan and Bartlett10] propose the perturbed Langevin Monte Carlo algorithm for non-smooth potential functions and show its performance to approximate Gibbs distributions. The difference between perturbed LMC and ordinary LMC is the inputs of the gradients; we need to add Gaussian noise not only to the gradients but also to their inputs. The main idea of the algorithm is to use Gaussian smoothing of potential functions studied in the expectation of non-smooth convex potentials with inputs perturbed by Gaussian random vectors is smoother than the vectors is smoother than the potentials themselves. Lehec [Reference Lehec21] investigates the projected LMC for potentials with convexity, global Lipschitz continuity, and discontinuous bounded gradients. The analysis is based on convexity and estimate for local times of diffusion processes with reflecting boundaries. The study also generalizes the result to potentials with local Lipschitz by considering a ball as the support of the algorithm and letting the radius diverge.

Erdogdu and Hosseinzadeh [Reference Erdogdu and Hosseinzadeh17], Chewi et al. [Reference Chewi, Erdogdu, Li, Shen and Zhang11], and Nguyen [Reference Nguyen29] estimate the error of LMC under non-convex potentials with degenerate convexity, weak smoothness, and weak dissipativity. Erdogdu and Hosseinzadeh [Reference Erdogdu and Hosseinzadeh17] show convergence guarantees of LMC under the degenerate convexity at infinity and weak dissipativity of potentials with Hölder continuous gradients, which are the assumptions for modified logarithmic Sobolev inequalities. Nguyen [Reference Nguyen29] relaxes the condition of [Reference Erdogdu and Hosseinzadeh17] by considering the degenerate convexity outside a large ball and the mixture weak smoothness of potential functions. Chewi et al. [Reference Chewi, Erdogdu, Li, Shen and Zhang11] analyse the convergence with respect to the Rényi divergence under either Latała–Oleszkiewics inequalities or modified logarithmic Sobolev inequalities.

Note that our proof of the results uses approaches similar to the smoothing of [Reference Chatterji, Diakonikolas, Jordan and Bartlett10] and the control of the radius of [Reference Lehec21], while our motivations and settings are close to those of the studies under non-convexity.

1.2. Contributions

Theorem 2.1, the main theoretical result of this paper, gives an upper bound for the 2-Wasserstein distance between the law of general SG-LMC given by Eq. (2) and the target distribution $\pi$ under weak conditions. We assume the weak differentiability of U combined with the boundedness of the modulus of continuity of a weak gradient $\nabla U$ rather than the twice continuous differentiability of U and the Lipschitz continuity of $\nabla U$ . The generality of the assumptions results in a concise and general framework for analysis of Langevin-type algorithms. We demonstrate the strength of this framework through analysis of LMC under weak smoothness and proposal for new Langevin-type algorithms without the continuous differentiability or convexity of U.

Our contribution to analysis of LMC is to show a theoretical guarantee of LMC under non-convexity and weak smoothness in a direction different than previous studies. The main difference between our assumptions and those of the other studies under non-convex potentials is whether to assume (a) the strong dissipativity of the potentials and the uniform continuity of the gradients or (b) the degenerate convexity of the potentials and the Hölder continuity of the gradients. Our analysis needs neither degenerate convexity nor Hölder continuity, while we need stronger dissipativity than assumed in previous studies. Since assumptions (a) and (b) do not imply each other, our main contribution in the analysis of LMC is not to strengthen the previous studies but to broaden the theoretical guarantees of LMC under weak smoothness in a different direction.

Moreover, deriving polynomial sampling complexities with first-order or zeroth-order oracles for potentials without convexity, continuous differentiability, or bounded gradients is also a significant contribution. Proposing Langevin-type algorithms and analysing them, we successfully show that sampling complexity from some posterior distributions whose potentials are dissipative and weakly differentiable but neither convex nor continuously differentiable (e.g., some losses with elastic net regularization in nonlinear regression and robust regression) with first-order oracle for potentials is of polynomial order of dimension d, tolerance $\epsilon$ , and Poincaré constant. Furthermore, we also develop zeroth-order algorithms inspired by the recent study of [Reference Roy, Shen, Balasubramanian and Ghadimi32] for black-box sampling and show that the sampling complexity with zeroth-order oracle for potentials without convexity or smoothness is also of polynomial order.

1.3. Outline

This paper is organized as follows. Section 2 displays the main theorem and its concise representation. In Section 3, we apply the result to analysis of LMC and our proposal for Langevin-type algorithms. Section 4 is devoted to preliminary results. Finally, we present the proof of the main theorem in Section 5.

1.4. Notation

Let $|\cdot|$ denote the Euclidean norm of $\mathbb{R}^{\ell}$ for all $\ell\in\mathbb{N}$ . $\langle \cdot,\cdot\rangle $ is the Euclidean inner product of $\mathbb{R}^{\ell}$ . $\|\cdot\|_{2}$ is the spectral norm of matrices, which equals the largest singular value. For an arbitrary matrix A, $A^{\top}$ denotes the transpose of A. For all $x\in\mathbb{R}^{\ell}$ and $R>0$ , let $B_{R}(x)$ and $\bar{B}_{R}(x)$ respectively be an open ball and a closed ball of centre x and radius R with respect to the Euclidean metric. We write $\|f\|_{\infty}\,:\!=\,\sup_{x\in\mathbb{R}^{d}}|f(x)|$ for arbitrary $f:\mathbb{R}^{d}\to\mathbb{R}^{\ell}$ and $d,\ell\in\mathbb{R}^{d}$ .

For two arbitrary probability measures $\mu$ and $\nu$ on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ and $p\ge1$ , we define the p-Wasserstein distance between $\mu$ and $\nu$ such that

\begin{align*} \mathcal{W}_{p}\left(\mu,\nu\right)\,:\!=\,\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\left|x-y\right|^{p}\text{d} \pi\left(x,y\right)\right)^{1/p},\end{align*}

where $\Pi(\mu,\nu)$ is the set of all couplings for $\mu$ and $\nu$ . We also define $D\left(\mu\|\nu\right)$ and $\chi^{2}\left(\mu\|\nu\right)$ , the Kullback–Leibler divergence and the $\chi^{2}$ -divergence of $\mu$ from $\nu$ with $\mu \ll \nu$ respectively such that

\begin{align*} D\left(\mu\|\nu\right)=\int_{\mathbb{R}^{d}} \log\left(\frac{\text{d} \mu}{\text{d} \nu}\right)\text{d} \mu,\quad \chi^{2}\left(\mu\|\nu\right)=\int_{\mathbb{R}^{d}} \left(\frac{\text{d} \mu}{\text{d} \nu}-1\right)^{2}\text{d} \nu.\end{align*}

2. Main results

This section gives the main theorem for non-asymptotic estimates of the error of general SG-LMC algorithms in 2-Wasserstein distance, which is one of the standard criteria for evaluation of sampling algorithms (e.g., [Reference Durmus, Majewski and Miasojedow14, Reference Durmus and Moulines16, Reference Roy, Shen, Balasubramanian and Ghadimi32]).

2.1. Estimate of the errors of general SG-LMC

We consider a compact polynomial mollifier [Reference Anderson3] $\rho:\mathbb{R}^{d}\to[0,\infty)$ of class $\mathcal{C}^{1}$ as follows:

(4) \begin{align} \rho(x)=\begin{cases} \left(\frac{\pi^{d/2}\text{B}(d/2,3)}{\Gamma(d/2)}\right)^{-1}\left(1-|x|^{2}\right)^{2}&\text{if }|x|\le 1,\\[5pt] 0 & \text{otherwise,} \end{cases}\end{align}

where $\text{B}(\cdot,\cdot)$ is the beta function and $\Gamma(\! \cdot \!)$ is the gamma function. Note that $\nabla \rho$ has an explicit $L^{1}$ -bound, which is the reason to adopt $\rho$ as the mollifier in our analysis; we give more detailed discussions on $\rho$ in Section 4.2. Let $\rho_{r}(x)=r^{-d}\rho(x/r)$ with $r>0$ .

We define $\widetilde{G}(x)$ such that for each $x\in\mathbb{R}^{d}$ ,

\begin{align*} \widetilde{G}(x)\,:\!=\,E\left[{G}\left(x,a_{0}\right)\right],\end{align*}

whose measurability is given by Tonelli’s theorem.

We set the following assumptions on U and G.

  1. (A1) $U\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ .

  2. (A2) For each $a\in A$ and $x\in\mathbb{R}^{d}$ , $|G(x,a)|<\infty$ .

Under (A1), we fix a representative $\nabla U$ and consider the assumptions on $\nabla U$ and $\widetilde{G}$ .

  1. (A3) $|\nabla U(\mathbf{0})|<\infty$ and $|\widetilde{G}(\mathbf{0})|<\infty$ , and the moduli of continuity of $\nabla U$ and $\widetilde{G}$ are bounded, that is,

    \begin{align*} \omega_{\nabla U}(r)&\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U(x)-\nabla U(y)\right|<\infty,\\ \omega_{\widetilde{G}}(r)&\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\widetilde{G}(x)-\widetilde{G}(y)\right|<\infty \end{align*}
    for some $r\in(0,1]$ .
  2. (A4) For some $m,\tilde{m},b,\tilde{b}>0$ , for all $x\in\mathbb{R}^{d}$ ,

    \begin{align*} \left\langle x,\nabla U\left(x\right)\right\rangle \ge m\left|x\right|^{2}-b,\ \left\langle x,\widetilde{G}\left(x\right)\right\rangle \ge \tilde{m}\left|x\right|^{2}-\tilde{b}. \end{align*}

Remark 2.1. The boundedness of the moduli of continuity in Assumption (A3) is equivalent to the boundedness for all $r>0$ or for some $r>0$ ; see Lemma 4.3. Note that we allow $\lim_{r\downarrow0}\omega_{\nabla U}(r)\neq0$ and $\lim_{r\downarrow0}\omega_{\widetilde{G}}(r)\neq0$ .

Under (A1) and (A3), we can define the mollification

\begin{align*} \nabla \bar{U}_{r}(x)\,:\!=\,\nabla \left(U\ast \rho_{r}\right)(x)=\nabla \int_{\mathbb{R}^{d}} U\left(y\right)\rho_{r}\left(x-y\right)\text{d} y=\left(\nabla U\ast \rho_{r}\right)(x),\end{align*}

where the last equality holds since (A1) gives the essential boundedness of U and $\nabla U$ on any compact sets and we can approximate $\rho_{r}\in C_{0}^{1}(\mathbb{R}^{d})\cap W^{1,1}(\mathbb{R}^{d})$ with some $\{\varphi_{n}\}\subset\mathcal{C}_{0}^{\infty}(\mathbb{R}^{d})$ . Note that $\bar{U}_{r}\in \mathcal{C}^{2}(\mathbb{R}^{d})$ with $\nabla^{2}\bar{U}_{r}=\nabla U\ast \nabla\rho_{r}$ by this discussion (see Lemma 4.7). We assume quadratic growth of the bias of G with respect to $\nabla \bar{U}_{r}$ and the variance as well.

  1. (A5) For some $\bar{\delta}>0$ and $\delta_{r}\,:\!=\,(\delta_{\mathbf{b},r,0},\delta_{\mathbf{b},r,2},\delta_{\mathbf{v},0},\delta_{\mathbf{v},2})\in[0,\bar{\delta}]^{4}$ , for almost all $x\in\mathbb{R}^{d}$ ,

    \begin{align*} \left|\widetilde{G}(x)-\nabla \bar{U}_{r}(x)\right|^{2}&\le 2\delta_{\mathbf{b},r,2}\left|x\right|^{2}+2\delta_{\mathbf{b},r,0},\\ E\left[\left|{G}\left(x,a_{0}\right)-\widetilde{G}(x)\right|^{2}\right]&\le 2\delta_{\mathbf{v},2}\left|x\right|^{2}+2\delta_{\mathbf{v},0}. \end{align*}

For brevity, we use the notation $\delta_{r,i}=\delta_{\mathbf{b},r,i}+\delta_{\mathbf{v},i}$ for both $i=0,2$ . We also give the condition on the initial value $\xi$ .

  1. (A6) The initial value $\xi$ has the law $\mu_{0}(\text{d} x)=\left(\int_{\mathbb{R}^{d}} \exp(-\Psi(x))\text{d} x\right)^{-1}\exp(\! - \! \Psi(x))\text{d} x$ with $\Psi:\mathbb{R}^{d}\to[0,\infty)$ and $\psi_{0},\psi_{2}>0$ such that $(2\vee \beta(|\nabla U(\mathbf{0})|+\omega_{\nabla U}(1)))|x|^{2}-\psi_{0}\le \Psi(x)\le \psi_{2}|x|^{2}+\psi_{0}$ for all $x\in\mathbb{R}^{d}$ .

Assumption (A6) yields

\begin{align*} \kappa_{0}\,:\!=\,\log\int_{\mathbb{R}^{d}}\mathrm{e}^{|x|^{2}}\mu_{0}(\text{d} x)<\infty.\end{align*}

Let $\mu_{t}$ , $t\ge0$ , denote the probability measure of $Y_{t}$ . The following theorem gives an upper bound for the 2-Wasserstein distance between $\mu_{k\eta}$ and $\pi$ ; its proof is given in Section 5.

Theorem 2.1 (Error estimate of general SG-LMC.) Assume (A1)–(A6) and $\eta\in(0,1\wedge(\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))]$ . Then for any $r\in(0,1]$ and $k\in\mathbb{N}$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le &2C_{1}\left.\left(x^{1/2}+ x^{1/4}\right)\right|_{x=f(\delta_{r},r,k,\eta)}+\sqrt{2C_{2}c_\textrm{P}(\bar{\pi}_{r})}\mathrm{e}^{-k\eta/2\beta c_\textrm{P}\left(\bar{\pi}_{r}\right)}, \end{align*}

where f is the function defined as

\begin{align*} f(\delta_{r},r,k,\eta)\,:\!=\,\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta+\beta r\omega_{\nabla U}(r), \end{align*}

$C_{0},C_{1},C_{2},\kappa_{\infty}>0$ are the positive constants defined as

\begin{align*} C_{0}&\,:\!=\,\left(d+2\right)\left(\frac{\beta}{3}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)\kappa_{\infty}\right)+\frac{d}{2}\right),\\ C_{1}&\,:\!=\,2\sqrt{4\kappa_{0}+\frac{16(\bar{b}+d/\beta)}{m}+\frac{10}{1\wedge (\beta m/4)}},\\ C_{2}&\,:\!=\,3^{\beta b/2}\left(\frac{3\psi_{2}}{m\beta}\right)^{d/2}\exp\left(\beta\left(2\left\|\nabla U\right\|_{\mathbb{M}}+U_{0}\right)+2\psi_{0}\right),\\ \kappa_{\infty}&\,:\!=\,\kappa_{0}+2\left(1\vee \frac{1}{\tilde{m}}\right)\left(\tilde{b}+\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\frac{d}{\beta}\right), \end{align*}

$\bar{b}\,:\!=\,b+\omega_{\nabla U}(1)$ , $U_{0}\,:\!=\,\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}$ , $\|\nabla U\|_{\mathbb{M}},\|\widetilde{G}\|_{\mathbb{M}}>0$ are the positive constants defined as

\begin{align*} \left\|\nabla U\right\|_{\mathbb{M}}\,:\!=\,\left|\nabla U(\mathbf{0})\right|+\omega_{\nabla U}(1),\quad \left\|\widetilde{G}\right\|_{\mathbb{M}}\,:\!=\,\left|\widetilde{G}(\mathbf{0})\right|+\omega_{\widetilde{G}}(1), \end{align*}

$c_\textrm{P}(\bar{\pi}^{r})>0$ is the constant of a Poincaré inequality for $\bar{\pi}^{r}(\text{d} x)\propto \exp(\! - \! \beta\bar{U}_{r}(x))\text{d} x$ such that

\begin{align*} c_\textrm{P}(\bar{\pi}^{r}) &\le \frac{2}{m\beta\left(d+\bar{b}\beta\right)}+\frac{4a\left(d+\bar{b}\beta\right)}{m\beta}\exp\left(\beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\frac{4\left(d+\bar{b}\beta\right)}{m\beta}\right)+U_{0}\right)\right), \end{align*}

and $a>0$ is a positive absolute constant.

2.2. Concise representation of Theorem 2.1

Since the constants and the upper bounds for some of them in Theorem 2.1 depend on various parameters, we give a concise representation of the result for the error analyses below. Assume that $f(\delta_{r},r,k,\eta)\le 1$ and $\eta\in(0,1\wedge (\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))]$ and note that Lemma 4.10 and the perturbation theory [Reference Bakry, Gentil and Ledoux5] yield that $\exp(\! -2 \! \omega_{\nabla U}(1))c_\textrm{P}(\bar{\pi}^{r})\le c_{\text{P}}(\pi)\le \exp(2\omega_{\nabla U}(1))c_\textrm{P}(\bar{\pi}^{r})$ for any $r\in(0,1]$ , where $c_{\text{P}}(\pi)$ is the Poincaré constant of $\pi$ . We then obtain that for some $C\ge 1$ independent of $\delta_{r},r,k,\eta,d,c_{\text{P}}(\pi)$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\omega_{\nabla U}(r)}{r}\eta+d\delta_{r,2}+\delta_{r,0}\right)k\eta+r\omega_{\nabla U}(r)}\\ &\quad+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right).\end{align*}

Remark 2.2. While $c_{\text{P}}(\pi)=\mathcal{O}(\exp(\mathcal{O}(d)))$ in general, there are some known structures to relax the dependence on dimension.

  1. (i) If U is $\lambda$ -strongly convex with $\lambda>0$ , then $c_{\text{P}}(\pi)\le 1/(\beta\lambda)$ .

  2. (ii) (Perturbation theory; see [Reference Bakry, Gentil and Ledoux5]) If $U=F+V$ with essentially bounded F and $\lambda$ -strongly convex V with $\lambda>0$ , then $c_{\text{P}}(\pi)\le \exp(2\beta \|F\|_{L^{\infty}})/(\beta\lambda)=\mathcal{O}(1)$ .

  3. (iii) (Miclo’s trick; see [Reference Bardet, Gozlan, Malrieu and Zitt6]) If $U=U_{l}+U_{c}$ with M-Lipschitz continuous $U_{l}$ with $M>0$ and $\lambda$ -strongly convex $V\in\mathcal{C}^{2}$ with $\lambda>0$ , then $c_{\text{P}}(\pi)\le (4/(\beta\lambda))\exp(4\beta M^{2}\sqrt{2d}/(\lambda\sqrt{\pi}))=\mathcal{O}(\exp(\mathcal{O}(\sqrt{d})))$ .

2.3. Discussion on Theorem 2.1

We note a potential direction to improve the bound. Some recent studies (e.g., [Reference Altschuler and Talwar1, Reference Vempala and Wibisono33]) directly evaluate the divergence of algorithms from the target distribution or their invariant distributions and successfully yield tight bounds. Hence, extension of their analysis to our problem can improve our results; for instance, if we obtain a tight bound for the Kullback–Leibler divergence of SG-LMC from $\pi$ , and $\pi$ satisfies a logarithmic Sobolev inequality, then we can combine them with Talagrand’s transportation-entropy inequality and yield a tighter bound for the 2-Wasserstein distance. To extend the method of [Reference Vempala and Wibisono33], which is suitable for our non-convex set-up, we need to rigorously validate the exchange of limits and integration by parts. We leave it as an open problem as its reasonable sufficient conditions are not very obvious (in particular, we need to examine the growth of $E[G(Y_{\lfloor t/\eta\rfloor\eta},a_{\lfloor t/\eta\rfloor\eta})|Y_{t}]$ and the density of the law of $Y_{t}$ in $|Y_{t}|$ or sufficient conditions on $a_{i\eta}$ to control their growth).

Our proof strategy, taken mainly from [Reference Raginsky, Rakhlin and Telgarsky31], successfully avoids the technical difficulty appearing in the direct comparison. We obtain our bound by the triangular inequality for the 2-Wasserstein distance and separately evaluating the distance between (i) the laws of SG-LMC and a Langevin process, (ii) the law of the Langevin process and its invariant measure, and (iii) the invariant measure and target measure. A dependence on $f^{1/4}$ is inevitable as long as we compare the SG-LMC and the Langevin process via Kullback–Leibler divergence and use the weighted Csiszár–Kullback–Pinsker inequality for the 2-Wasserstein distance [Reference Bolley and Villani7], which are tight for general purposes. This dependence would be largely relaxed if we can give a direct comparison of the algorithm with the target or invariant measure as in [Reference Vempala and Wibisono33] or [Reference Altschuler and Talwar1].

3. Sampling complexities of Langevin-type algorithms

We analyse LMC and the SS-LMC and SS-SG-LMC algorithms to show their sampling complexities for achieving $\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$ with arbitrary $\epsilon>0$ . We also discuss zeroth-order versions of SS-LMC and SS-SG-LMC.

3.1. Analysis of the LMC algorithm for U of class $\mathcal{C}^{1}$ with uniformly continuous gradient

We examine the LMC algorithm for U with uniformly continuous gradient, that is, $\omega_{\nabla U}(r)\to 0$ as $r\to0$ . Under the LMC algorithm, we use $G=\nabla U$ and thus $\widetilde{G}=\nabla U$ . Therefore, the bias–variance decomposition in (A5) is given as $\delta_{\mathbf{b},r,0}=(\omega_{\nabla U}(r))^{2}/2$ , $\delta_{\mathbf{b},r,2}=\delta_{\mathbf{v},0}=\delta_{\mathbf{v},2}=0$ by Lemma 4.9 below.

In this section we make the following assumptions.

  1. (B1) $U\in\mathcal{C}^{1}(\mathbb{R}^{d})$ .

  2. (B2) $\nabla U$ is uniformly continuous, that is, the modulus of continuity $\omega_{\nabla U}$ defined as

    \begin{align*} \omega_{\nabla U}\left(r\right)\,:\!=\,\sup_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U(x)-\nabla U(y)\right|<\infty \end{align*}
    with $r\ge0$ is continuous at zero.
  3. (B3) There exist $m,b>0$ such that for all $x\in\mathbb{R}^{d}$ ,

    \begin{align*} \left\langle x,\nabla U(x)\right\rangle \ge m\left|x\right|^{2}-b. \end{align*}

(A1)–(A4) hold immediately by (B1)–(B3); therefore, we obtain the following corollary.

Corollary 3.1. (Error estimate of LMC.) Under (B1)–(B3) and (A6), there exists a constant $C\ge1$ independent of $r,\alpha,k,\eta,d,c_{\text{P}}(\pi)$ such that for all $k\in\mathbb{N}$ , $\eta\in(0,1\wedge (m/2(\omega_{\nabla U}(1))^{2})]$ , and $r\in(0,1]$ with $\left(d^{2}(\omega_{\nabla U}(r)/r)\eta+(\omega_{\nabla U}(r))^{2}\right)k\eta+r\omega_{\nabla U}(r)\le 1$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\omega_{\nabla U}(r)}{r}\eta+\left(\omega_{\nabla U}\left(r\right)\right)^{2}\right)k\eta+r\omega_{\nabla U}(r)}\\ &\quad+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right).\end{align*}

3.1.1. The sampling complexity of the LMC algorithm.

We present propositions regarding the sampling complexity to achieve the approximation $\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$ for arbitrary $\epsilon>0$ . Define generalized inverses of $\omega_{\nabla U}$ as follows: for any $s>0$ ,

\begin{align*} \omega_{\nabla U}^{\dagger}\left(s\right)\,:\!=\,\sup\left\{r\ge 0:\omega_{\nabla U}(r)\le s\right\}.\end{align*}

The continuity of $\nabla U$ under (B2) along with its monotonicity gives $\omega_{\nabla U}(\omega_{\nabla U}^{\dagger}(s))= s$ . We also define a generalized inverse of $r\omega_{\nabla U}(r)$ such that for all $s>0$ ,

\begin{align*} \iota\left(s\right)\,:\!=\,\sup\left\{r\ge 0:r\omega_{\nabla U}(r)\le s\right\}.\end{align*}

The following proposition yields the sampling complexity using this generalized inverse.

Proposition 3.1. Assume that (B1)–(B3) and (A6) hold and fix $\epsilon\in(0,1]$ . We set $\bar{r}_{1},\bar{r}_{2}>0$ such that

\begin{align*} \bar{r}_{1}\,:\!=\,\omega_{\nabla U}^{\dagger}\left(\sqrt{\frac{\epsilon^{4}}{48C^{4}d^{2}\left(Cc_{\text{P}}(\pi)\left(\log\left(2/\epsilon\right)+Cd\right)+1\right)}}\right),\quad \bar{r}_{2}\,:\!=\,\iota\left(\frac{\epsilon^{4}}{48C^{4}d^{2}}\right). \end{align*}

If $r= \bar{r}_{1}\wedge \bar{r}_{2}$ and

\begin{align*} \eta\le 1&\wedge \frac{m}{2\left(\omega_{\nabla U}(1)\right)^{2}}\wedge \left(\frac{r}{\omega_{\nabla U}(r)}\frac{\epsilon^{4}}{48C^{4}d^{4}\left(Cc_{\text{P}}(\pi)\left(\log\left(2/\epsilon\right)+Cd\right)+1\right)}\right), \end{align*}

then $\mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le \epsilon$ with $k=\lceil Cc_{\text{P}}(\pi)\left(\log(2/\epsilon)+Cd\right)/\eta\rceil$ .

Proof. We just need to confirm

\begin{align*} \max\left\{d^{2}\frac{\omega_{\nabla U}\left(r\right)}{r}k\eta^{2},\left(\omega_{\nabla U}\left(r\right)\right)^{2}k\eta,r\omega_{\nabla U}\left(r\right)\right\}\le \frac{\epsilon^{4}}{48C^{4}d^{2}},\quad \mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right)\le \frac{\epsilon}{2}.\end{align*}

$r\omega_{\nabla U}\left(r\right)\le \epsilon^{4}/48C^{4}d^{2}$ is immediate. Since $\eta\le 1$ , we have

\begin{align*} Cc_{\text{P}}(\pi)\left(\log\left(2/\epsilon\right)+Cd\right)\le k\eta\le Cc_{\text{P}}(\pi)\left(\log\left(2/\epsilon\right)+Cd\right)+1,\end{align*}

and the other bounds also hold.

We can apply Proposition 3.1 to analysis of the sampling complexity of LMC with $\alpha$ -mixture weakly smooth gradients [Reference Chatterji, Diakonikolas, Jordan and Bartlett10, Reference Nguyen29]. Assume that there exist $M>0$ and $\alpha\in(0,1]$ such that for all $x,y\in\mathbb{R}^{d}$ ,

\begin{align*} \left|\nabla U(x)-\nabla U(y)\right|\le M\left(\left|x-y\right|^{\alpha}\vee\left|x-y\right|\right),\end{align*}

which is a weaker assumption than both $\alpha$ -Hölder continuity and Lipschitz continuity. This allows the gradient $\nabla U(x)$ to be at most of linear growth, while $\alpha$ -Hölder continuity with $\alpha\in(0,1)$ lets the gradient be at most of sublinear growth. Since $\omega_{\nabla U}(r)\le M(r^{\alpha}\vee r)$ for all $r\ge0$ , we have $\omega_{\nabla U}^{\dagger}(s)\ge (s/M)^{1/\alpha}$ for $s\in(0,1/M]$ . Rough estimates of $r/\omega_{\nabla U}(r)$ by the inequalities $r/\omega_{\nabla U}(r)\ge r^{1-\alpha}/M$ , $\omega_{\nabla U}^{\dagger}(s)/\omega_{\nabla U}(\omega_{\nabla U}^{\dagger}(s))= \omega_{\nabla U}^{\dagger}(s)/s\ge s^{1/\alpha-1}/M^{1/\alpha}$ , $\iota(s)\ge(s/M)^{1/(1+\alpha)}$ , and $\iota(s)/\omega_{\nabla U}(\iota(s))\ge \iota(s)^{1-\alpha}/M\ge s^{(1-\alpha)/(1+\alpha)}/M^{2/(1+\alpha)}$ for sufficiently small $r,s>0$ yield the sampling complexity

\begin{align*} k=\mathcal{O}\left(\frac{d^{4}c_{\text{P}}(\pi)^{2}\left(\log\epsilon^{-1}+d\right)^{2}}{\epsilon^{4}}\left(\left(\frac{d^{2}c_{\text{P}}(\pi)\left(\log\epsilon^{-1}+d\right)}{\epsilon^{4}}\right)^{\frac{1-\alpha}{2\alpha}}\vee \left(\frac{d^{2}}{\epsilon^{4}}\right)^{\frac{1-\alpha}{1+\alpha}}\right)\right).\end{align*}

3.2. The spherically smoothed Langevin Monte Carlo algorithm

We consider a stochastic gradient G unbiased for $\nabla \bar{U}_{r}$ with fixed $r\in(0,1]$ such that the sampling error can be sufficiently small.

Note that $\rho$ is the density of random variables which we can generate as a product of a random variable following the uniform distribution on $S^{d-1}=\{x\in\mathbb{R}^{d} \, : \, |x|=1\}$ and the square root of a random variable following the beta distribution $\textrm{Beta}(d/2,3)$ independently. Therefore, we can consider spherical smoothing with the random variables whose density is $\rho_{r}$ as an analogue to Gaussian smoothing of [Reference Chatterji, Diakonikolas, Jordan and Bartlett10].

Set the stochastic gradient

\begin{align*} {G}\left(x,a_{i\eta}\right)=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\nabla U\left(x+r'\zeta_{i,j}\right),\end{align*}

where $N_{B}\in\mathbb{N}$ , $r'\in(0,1]$ , $a_{i\eta}=[\zeta_{i,1},\ldots,\zeta_{i,N_{B}}]$ , and $\{\zeta_{i,j}\}$ is a sequence of i.i.d. random variables with the density $\rho$ . Then for any $x\in\mathbb{R}^{d}$ , $E[{G}(x,a_{i\eta})]=\nabla \bar{U}_{r'}(x)$ ,

\begin{align*} &E\left[\left|{G}\left(x,a_{i\eta}\right)-\nabla \bar{U}_{r'}(x)\right|^{2}\right]\\ &\quad=\frac{1}{N_{B}}\int _{\mathbb{R}^{d}}\left|\nabla U(x-y)-\nabla \bar{U}_{r'}(x)\right|^{2}\rho_{r'}(y)\text{d} y\\ &\quad\le \frac{1}{N_{B}}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}} \left|\nabla U(x-y)-\nabla U(x-z)\right|^{2}\rho_{r'}(y)\rho_{r'}(z)\text{d} y\text{d} z\\ &\quad\le \frac{1}{N_{B}}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}} \left(\left|\nabla U(x-y)-\nabla U(x)\right|+\left|\nabla U(x-z)-\nabla U(x)\right|\right)^{2}\rho_{r'}(y)\rho_{r'}(z)\text{d} y\text{d} z\\ &\quad\le \frac{\left(2\omega_{\nabla U}(r')\right)^{2}}{N_{B}}\end{align*}

by Jensen’s inequality, and (A5) holds if $\nabla \bar{U}_{r'}(x)$ is well defined and $\omega_{\nabla U}(r')<\infty$ exists.

The main idea is to let $r'=r$ , where r is the radius of the implicit mollification and r’ is the radius of the support of the random noise which we control. Hence, the stochastic gradient G with $r'=r$ is an unbiased estimator of the mollified gradient $\nabla \bar{U}_{r}(x)$ . We call the algorithm with this G the spherically smoothed Langevin Monte Carlo algorithm. We can control the sampling error of SS-LMC to be close to zero by taking a sufficiently small r.

3.2.1. Regularity conditions.

Let us set the following assumptions.

  1. (C1) $U\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ .

  2. (C2) $|\nabla U(\mathbf{0})|<\infty$ and the modulus of continuity of $\nabla U$ is bounded, that is,

    \begin{align*} \omega_{\nabla U}(r)\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U(x)-\nabla U(y)\right|<\infty \end{align*}
    for some $r\in(0,1]$ .
  3. (C3) There exist $m,b>0$ such that for all $x\in\mathbb{R}^{d}$ ,

    \begin{align*} \left\langle x,\nabla U\left(x\right)\right\rangle &\ge m\left|x\right|^{2}-b. \end{align*}

Let us observe that (C1)–(C3) yield (A1)–(A5). Assumption (A1) is the same as (C1). Assumption (C2) yields (A2) by Lemma 4.4 and (A3) by $|\nabla \bar{U}_{r}(\mathbf{0})|\le |\nabla U(\mathbf{0})|+\omega_{\nabla U}(1)$ and $\omega_{\nabla \bar{U}_{r}}(1)\le 3\omega_{\nabla U}(1)<\infty$ . Assumption (A4) also holds since

\begin{align*} \left\langle x,\nabla \bar{U}_{r}\left(x\right)\right\rangle&\ge m|x|^{2}-(b+\omega_{\nabla U}(1));\end{align*}

Section 5 gives the detailed derivation of this inequality. Assumption (A5) is given by (C2) and the discussion above.

3.2.2. Examples of distributions with the regularity conditions.

We show a simple class of potential functions satisfying (C1)–(C3) and some examples in Bayesian inference; assume $\beta=1$ for simplicity of interpretation. Let us consider a possibly non-convex loss with elastic net regularization such that

\begin{align*} U\left(x\right)=L\left(x\right)+\frac{\lambda_{1}}{\sqrt{d}}R_{1}\left(x\right)+\lambda_{2}R_{2}\left(x\right),\end{align*}

where $L \, : \, \mathbb{R}^{d}\to[0,\infty)$ is in $W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ with a weak gradient $\nabla L$ satisfying $\|\nabla L\|_{\infty}<\infty$ , $\lambda_{1}\ge 0$ , $\lambda_{2}>0$ , $R_{1}(x)=\sum_{i=1}^{d}|x^{(i)}|$ with $x^{(i)}$ indicating the ith component of x, and $R_{2}(x)=|x|^{2}$ . Fix a weak gradient of $R_{1}$ as $\nabla R_{1}(x)=\left(\text{sgn}(x^{(1)}),\ldots,\text{sgn}(x^{(d)})\right)$ ; then $\omega_{\nabla U}(1)\le 2(\|\nabla L\|_{\infty}+\lambda_{1}+\lambda_{2})<\infty$ and $\langle x,\nabla U(x)\rangle\ge \lambda_{2}|x|^{2}-\|\nabla L\|_{\infty}^{2}/4\lambda_{2}$ since $\langle x,\nabla R_{1}(x)\rangle \ge 0$ for all $x\in\mathbb{R}^{d}$ . Note that regularization corresponds to the potentials of prior distributions in Bayesian inference; we can regard L(x) as a negative quasi-log-likelihood and $\lambda_{1}/\sqrt{d}R_{1}(x)+\lambda_{2}R_{2}$ as a combination of Laplace prior and Gaussian prior [Reference Cui, Tong and Zahm12].

Non-convex losses with bounded weak gradients often appear in nonlinear and robust regression. We first examine a squared loss for nonlinear regression (or equivalently nonlinear regression with Gaussian errors) such that $L_\textrm{NLR}(x)=\frac{1}{2\sigma^{2}}\sum_{\ell=1}^{N}\left(y_{\ell}-\phi_{\ell}\left(x\right)\right)^{2}$ , where $N\in\mathbb{N}$ , $\sigma>0$ is fixed, $y_{\ell}\in\mathbb{R}$ , and $\phi_{\ell}\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ with $\|\phi_{\ell}\|_{\infty}+\|\nabla \phi_{\ell}\|_{\infty}<\infty$ for some $\nabla \phi_{\ell}$ (e.g., a two-layer neural network with clipped rectified linear unit activation such that $\phi_{\ell}(x)=(1/W)\sum_{w=1}^{W}a_{w}\varphi_{[0,c]}(\langle x_{w},f_{\ell}\rangle)$ , where $\varphi_{[0,c]}(t)=(0\vee t)\wedge c$ with $t\in\mathbb{R}$ , $a_{w}\in\{-1,1\}$ and $c>0$ are fixed, $f_{\ell}\in\mathbb{R}^{F}$ , $x=(x_{1},\ldots,x_{W})\in \mathbb{R}^{FW}$ , $F,W\in\mathbb{N}$ , and $d=FW$ ). This $L_\textrm{NLR}$ indeed satisfies $\|\nabla L_\textrm{NLR}\|_{\infty}\le \sum_{\ell=1}^{N}(|y_{\ell}|+\|\phi_{\ell}\|_{\infty})\|\nabla \phi_{\ell}\|_{\infty}/\sigma^{2}<\infty$ . Another example is a Cauchy loss for robust linear regression (or equivalently linear regression with Cauchy errors) such that $L_\textrm{RLR}(x)=\sum_{\ell=1}^{N}\log(1+|y_{\ell}-\langle f_{\ell},x\rangle|^{2}/\sigma^{2})$ , where $N\in\mathbb{N}$ , $\sigma>0$ is fixed, $y_{\ell}\in\mathbb{R}$ , and $f_{\ell}\in\mathbb{R}^{d}$ . The fact $|\frac{\text{d}}{\text{d} t}\log(1+t^{2}/\sigma^{2})|=|2t/(t^{2}+\sigma^{2})|\le 1/\sigma$ for all $t\in\mathbb{R}$ yields $\|\nabla L_\textrm{RLR}\|_{\infty}\le \sum_{\ell=1}^{N}|f_{\ell}|/\sigma<\infty$ .

3.2.3. Error estimate and sampling complexity of SS-LMC.

We now give our error estimate for SS-LMC.

Corollary 3.2. (Error estimate of SS-LMC.) Under (C1)–(C3) and (A6), there exists a constant $C\ge 1$ independent of $N_{B},r,k,\eta,d,c_{\text{P}}(\pi)$ such that for all $k\in\mathbb{N}$ , $\eta\in(0,1\wedge (m/(4(\omega_{\nabla U}(1))^{2}))]$ , $r\in(0,1]$ , and $N_{B}\in\mathbb{N}$ with $(d^{2}(\omega_{\nabla U}(r)/r)\eta+(\omega_{\nabla U}(r))^{2}/N_{B})k\eta+r\omega_{\nabla U}(r)\le 1$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\omega_{\nabla U}(r)}{r}\eta+\frac{(\omega_{\nabla U}(r))^{2}}{N_{B}}\right)k\eta+r\omega_{\nabla U}(r)}\\ &\quad+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right). \end{align*}

We give a version of the error estimate for convenience; to see that the convergence $\omega_{\nabla U}(r)\downarrow 0$ is unnecessary, we consider a rough version of the upper bound by replacing $\omega_{\nabla U}(r)$ with the constant $\omega_{\nabla U}(1)$ .

Corollary 3.3. Under (C1)–(C3) and (A6), there exists a constant $C\ge1$ independent of $N_{B}$ , r, k, $\eta$ , d, and $c_{\text{P}}(\pi)$ such that for all $N_{B}\in\mathbb{N}$ , $k\in\mathbb{N}$ , $\eta\in(0,1\wedge (m/(4(\omega_{\nabla U}(1))^{2}))]$ , and $r\in(0,1]$ with $\left(d^{2}r^{-1}\eta+N_{B}^{-1}\right)k\eta+r\le 1$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}r^{-1}\eta+N_{B}^{-1}\right)k\eta+r}+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right).\end{align*}

We obtain the following estimate of the sampling complexity; the proof is identical to that of Proposition 3.1.

Proposition 3.2. Assume (C1)–(C3) and (A6) and fix $\epsilon\in(0,1]$ . If $r=\epsilon^{4}/48C^{4}d^{2}$ , $N_{B}\ge 48C^{4}d^{2}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)/\epsilon^{4}$ , and $\eta$ satisfies

\begin{align*} \eta\le 1&\wedge \frac{m}{4\left(\omega_{\nabla U}(1)\right)^{2}}\wedge \frac{r\epsilon^{4}}{48C^{4}d^{4}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)}, \end{align*}

then $\mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le \epsilon$ for $k=\lceil Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)/\eta\rceil$ .

Since the complexities of $N_{B}$ and k are given as $N_{B}=\mathcal{O}(d^{2}c_{\text{P}}(\pi)(\log\epsilon^{-1}+d)/\epsilon^{4})$ and $k=\mathcal{O}(d^{6}c_{\text{P}}(\pi)^{2}(\log\epsilon^{-1}+d)^{2}/\epsilon^{8})$ , we obtain the sampling complexity of SS-LMC as $N_{B}k=\mathcal{O}(d^{8}c_{\text{P}}(\pi)^{3}(\log\epsilon^{-1}+d)^{3}/\epsilon^{12})$ or $N_{B}k=\widetilde{\mathcal{O}}(d^{11}c_{\text{P}}(\pi)^{3}/\epsilon^{12})$ , where $\widetilde{\mathcal{O}}$ ignores logarithmic factors.

3.3. The spherically smoothed stochastic gradient Langevin Monte Carlo algorithm

We consider a sampling algorithm for potentials such that for some $N\in\mathbb{N}$ ,

(5) \begin{align} U\left(x\right)=\frac{1}{N}\sum_{\ell=1}^{N}U_{\ell}\left(x\right),\end{align}

where $U_{\ell}(x)$ are non-negative functions with the following assumptions.

  1. (D1) For all $\ell=1,\ldots,N$ , $U_{\ell}\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ .

  2. (D2) $|\nabla U_{\ell}(\mathbf{0})|<\infty$ for all $\ell=1,\ldots,N$ and there exists a function $\hat{\omega} \, : \, [0,\infty)\to[0,\infty)$ such that for all $r\in(0,1]$ and $\ell=1,\ldots,N$ ,

    \begin{align*} \sup_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U_{\ell}(x)-\nabla U_{\ell}(y)\right|\le \hat{\omega}\left(r\right)<\infty. \end{align*}
  3. (D3) There exist $m,b>0$ such that for all $x\in\mathbb{R}^{d}$ ,

    \begin{align*} \left\langle x,\nabla U\left(x\right)\right\rangle &\ge m\left|x\right|^{2}-b. \end{align*}

We define the stochastic gradient

\begin{align*} {G}\left(x,a_{i\eta}\right)=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\nabla U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right),\end{align*}

where $N_{B}\in\mathbb{N}$ , $r'\in(0,1]$ , $a_{i\eta}=[\lambda_{i,1},\ldots,\lambda_{i,N_{B}},\zeta_{i,1},\ldots,\zeta_{i,N_{B}}]$ , $\{\lambda_{i,j}\}$ is a sequence of i.i.d. random variables with the discrete uniform distribution on the integers $1,\ldots,N$ , and $\{\zeta_{i,j}\}$ is a sequence of i.i.d. random variables with the density $\rho$ and independence of $\{\lambda_{i,j}\}$ . Then for any $x\in\mathbb{R}^{d}$ , we have

$$E[{G}(x,a_{i\eta})]=\nabla \bar{U}_{r'}(x)$$

and

\begin{align*} E\left[\left|{G}\left(x,a_{i\eta}\right)-\nabla \bar{U}_{r'}(x)\right|^{2}\right]&= \frac{1}{N_{B}}E\left[\left|\nabla U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)-\nabla \bar{U}_{r'}(x)\right|^{2}\right]\\ &\le \frac{1}{N_{B}}E\left[\left|\nabla U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)\right|^{2}\right]\\ &\le \frac{2}{N_{B}}\max_{\ell=1,\ldots,N}((\omega_{\nabla U_{\ell}}(1))^{2}|x|^{2}+(|\nabla U_{\ell}(\mathbf{0})|+2\omega_{\nabla U_{\ell}}(1))^{2})\end{align*}

by Lemma 4.4. We obtain (A5) with $\delta_{\mathbf{b},r',0}=\delta_{\mathbf{b},r',2}=0$ , $\delta_{\mathbf{v},0}=(\max_{\ell}|\nabla U_{\ell}(\mathbf{0})|+2\hat{\omega}(1))^{2})/N_{B}$ , and $\delta_{\mathbf{v},2}=(\hat{\omega}(1))^{2}/N_{B}$ . We call this algorithm the spherically smoothed stochastic gradient Langevin Monte Carlo algorithm.

Assumptions (D1)–(D3) yield (A1)–(A5) with the same discussion as for SS-LMC.

Corollary 3.4 (Error estimate of SS-SG-LMC.) Under (D1)–(D3) and (A6), there exists a constant $C\ge 1$ independent of $N_{B},r,k,\eta,d,c_{\text{P}}(\pi)$ such that for all $k\in\mathbb{N}$ , $\eta\in(0,1\wedge (m/(8(\hat{\omega}(1))^{2}))]$ , $r\in(0,1]$ , and $N_{B}\in\mathbb{N}$ with $(d^{2}(\hat{\omega}(r)/r)\eta+d/N_{B})k\eta+r\hat{\omega}(r)\le 1$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\hat{\omega}(r)}{r}\eta+\frac{d}{N_{B}}\right)k\eta+r\hat{\omega}(r)}+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right). \end{align*}

We give a rough upper bound by replacing $\hat{\omega}(r)$ with the constant $\hat{\omega}(1)$ as in the discussion on SS-LMC.

Corollary 3.5. Under (D1)–(D3) and (A6), there exists a constant $C\ge1$ independent of $N_{B}$ , r, k, $\eta$ , d, and $c_{\text{P}}(\pi)$ such that for all $N_{B}\in\mathbb{N}$ , $k\in\mathbb{N}$ , $\eta\in(0,1\wedge (m/(8(\hat{\omega}(1))^{2}))]$ , and $r\in(0,1]$ with $\left(d^{2}r^{-1}\eta+dN_{B}^{-1}\right)k\eta+r\le 1$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}r^{-1}\eta+dN_{B}^{-1}\right)k\eta+r}+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right).\end{align*}

Using this estimate, we yield the following estimate of the sampling complexity, which is lower than that of SS-LMC for U given by Eq. (5) if $N>d$ since the complexity to compute G in SS-LMC for this U increases by a factor of N and the sampling complexity of SS-SG-LMC deteriorates by a factor of d in comparison to that of SS-LMC.

Proposition 3.3. Assume (D1)–(D3) and (A6) and fix $\epsilon\in(0,1]$ . If $r=\epsilon^{4}/48C^{4}d^{2}$ , $N_{B}\ge 48C^{4}d^{3}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)/\epsilon^{4}$ , and $\eta$ satisfies

\begin{align*} \eta\le 1&\wedge \frac{m}{8\left(\hat{\omega}(1)\right)^{2}}\wedge \frac{r\epsilon^{4}}{48C^{4}d^{4}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)}, \end{align*}

then $\mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le \epsilon$ for $k=\lceil Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)/\eta\rceil$ .

3.4. Zeroth-order Langevin algorithms

Let us consider a zeroth-order version of SS-LMC as an analogue to [Reference Roy, Shen, Balasubramanian and Ghadimi32] with the following G under (C1)–(C3) and the assumption $|U(x)|<\infty$ for all $x\in\mathbb{R}^{d}$ :

\begin{align*} G(x,a_{i\eta})=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}G_{j}(x,a_{i\eta})\,:\!=\,\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\frac{U\left(x+r'\zeta_{i,j}\right)-U\left(x\right)}{r'}\frac{4\zeta_{i,j}}{\left(1-|\zeta_{i,j}|^{2}\right)},\end{align*}

where $N_{B}\in\mathbb{N}$ , $r'\in(0,1]$ , and $\{\zeta_{i,j}\}$ is an i.i.d. sequence of random variables with the density $\rho$ . The fact that

\begin{align*} \frac{U\left(x+r'\zeta_{i,j}\right)-U\left(x\right)}{r'}\frac{4\zeta_{i,j}}{\left(1-|\zeta_{i,j}|^{2}\right)}=\frac{U\left(x+r'\zeta_{i,j}\right)-U\left(x\right)}{r'}\frac{-\nabla \rho\left(\zeta_{i,j}\right)}{\rho\left(\zeta_{i,j}\right)},\end{align*}

the symmetricity of $\rho$ , approximation of $\rho\in \mathcal{C}_{0}^{1}(\mathbb{R}^{d})\cap W^{1,1}(\mathbb{R}^{d})$ , and the essential boundedness of U and $\nabla U$ on compact sets by Lemmas 4.4 and 4.6 yield that for all $x\in\mathbb{R}^{d}$ ,

\begin{align*} E\left[G_{j}(x,a_{i\eta})\right] &=\int_{\mathbb{R}^{d}}\frac{U\left(x+r'z\right)-U\left(x\right)}{r'}\frac{-\nabla \rho\left(z\right)}{\rho\left(z\right)}\rho\left(z\right)\text{d} z\\ &=-\int_{\mathbb{R}^{d}}\frac{U\left(x+r'z\right)-U\left(x\right)}{r'}\nabla \rho\left(z\right)\text{d} z\\ &=-\int_{\mathbb{R}^{d}}\left(U\left(x+y\right)-U\left(x\right)\right)\left(\frac{1}{(r')^{d+1}}\nabla \rho\left(\frac{y}{r'}\right)\right)\text{d} y\\ &=\int_{\mathbb{R}^{d}}\nabla U\left(x+y\right)\rho_{r'}\left(y\right)\text{d} y\\ &=\nabla \bar{U}_{r'}\left(x\right).\end{align*}

Lemma 4.5, the convexity of $f(a)=a^{2}$ with $a\in\mathbb{R}$ , and the equality

\begin{align*} \int_{B_{1}(\mathbf{0})}\frac{\left|\nabla \rho\left(z\right)\right|^{2}}{\rho\left(z\right)}\text{d} z &=\frac{16\Gamma(d/2)}{\pi^{d/2}\text{B}(d/2,3)}\int_{B_{1}(\mathbf{0})}\left|z\right|^{2}\text{d} z =\frac{32}{\text{B}(d/2,3)}\int_{0}^{1}r^{d+1}\text{d} r\\ &=\frac{32\Gamma(d/2+3)}{\Gamma(d/2)\Gamma(3)(d+2)} =\frac{16(d/2+2)(d/2+1)(d/2)}{(d+2)}\\ &=2d(d+4)\end{align*}

give that for almost all $x\in\mathbb{R}^{d}$ ,

\begin{align*} E\left[\left|G_{1}(x,a_{i\eta})\right|^{2}\right] &=\int_{B_{1}(\mathbf{0})}\left(\frac{U\left(x+r'z\right)-U\left(x\right)}{r'}\right)^{2}\frac{\left|\nabla \rho\left(z\right)\right|^{2}}{\rho\left(z\right)}\text{d} z\\ &\le \left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}+\omega_{\nabla U}\left(1\right)\left|x\right|\right)^{2}\int_{B_{1}(\mathbf{0})}\frac{\left|\nabla \rho\left(z\right)\right|^{2}}{\rho\left(z\right)}\text{d} z\\ &= 2d(d+4)\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}+\omega_{\nabla U}\left(1\right)\left|x\right|\right)^{2}\\ &\le d(d+4)\left(9\left\|\nabla U\right\|_{\mathbb{M}}^{2}+4\left(\omega_{\nabla U}(1)\right)^{2}\left|x\right|^{2}\right).\end{align*}

These properties along with

\begin{align*} E\left[\left|G(x,a_{i\eta})-\nabla \bar{U}_{r}(x)\right|^{2}\right]=\frac{1}{N_{B}}E\left[\left|G_{1}(x,a_{i\eta})-\nabla \bar{U}_{r}(x)\right|^{2}\right]\le \frac{1}{N_{B}}E\left[\left|G_{1}(x,a_{i\eta})\right|^{2}\right]\end{align*}

yield (A5) with $\delta_{\mathbf{b},r,0}=\delta_{\mathbf{b},r,2}=0$ , $\delta_{\mathbf{v},0}=9d(d+4)\left\|\nabla U\right\|_{\mathbb{M}}^{2}/2N_{B}$ , and $\delta_{\mathbf{v},2}=2d(d+4)(\omega_{\nabla U}(1))^{2}/N_{B}$ if $r=r'$ . Hence, the SG-LMC with this G also can achieve $\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$ for arbitrary $\epsilon>0$ . Note that the complexity deteriorates by a factor of $\mathcal{O}(d^{3})$ in comparison to SS-LMC; the batch size $N_{B}$ to achieve $\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$ is of order $\mathcal{O}(d^{5}c_{\text{P}}(\pi)(\log\epsilon^{-1}+d)/\epsilon^{4})$ since $\delta_{\mathbf{v},2}=0$ does not hold and both $\delta_{\mathbf{v},0}$ and $\delta_{\mathbf{v},2}$ are of order $\mathcal{O}(d^{2}/N_{B})$ .

We can also consider a zeroth-order version of SS-SG-LMC with the potential U in Eq. (5) and the following G under (D1)–(D3) and the assumption $|U_{\ell}(x)|<\infty$ for all $\ell=1,\ldots,N$ and $x\in\mathbb{R}^{d}$ :

\begin{align*} G(x,a_{i\eta})=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}G_{j}(x,a_{i\eta})\,:\!=\,\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\frac{U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)-U_{\lambda_{i,j}}\left(x\right)}{r'}\frac{4\zeta_{i,j}}{1-|\zeta_{i,j}|^{2}},\end{align*}

where $N_{B}\in\mathbb{N}$ , $r'\in(0,1]$ , $a_{i\eta}=[\lambda_{i,1},\ldots,\lambda_{i,N_{B}},\zeta_{i,1},\ldots,\zeta_{i,N_{B}}]$ , $\{\lambda_{i,j}\}$ is a sequence of i.i.d. random variables with the discrete uniform distribution on $\{1,\ldots,N\}$ , and $\{\zeta_{i,j}\}$ is a sequence of i.i.d. random variables with the density $\rho$ and independence of $\{\lambda_{i,j}\}$ . We see that for all $x\in\mathbb{R}^{d}$ ,

\begin{align*} E\left[G(x,a_{i\eta})\right]=\frac{1}{NN_{B}}\sum_{j=1}^{N_{B}}\sum_{\ell=1}^{N}\int\frac{U_{\ell}\left(x+r'z\right)-U_{\ell}\left(x\right)}{r'}(\! - \! \nabla \rho(z))\text{d} z =\nabla \bar{U}_{r'}(x),\end{align*}

and for almost all $x\in\mathbb{R}^{d}$ ,

\begin{align*} E\left[\left|G(x,a_{i\eta})-\nabla \bar{U}_{r'}(x)\right|^{2}\right] &\le \frac{1}{N_{B}}E\left[\left|\frac{U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)-U_{\lambda_{i,j}}\left(x\right)}{r'}\frac{4\zeta_{i,j}}{(1-|\zeta_{i,j}|^{2})}\right|^{2}\right]\\ &=\frac{1}{NN_{B}}\sum_{\ell=1}^{N}\int\left|\frac{U_{\ell}\left(x+r'z\right)-U_{\ell}\left(x\right)}{r'}\right|^{2}\frac{\left|\nabla \rho(z)\right|^{2}}{\rho(z)}\text{d} z\\ &\le \frac{1}{N_{B}}d(d+4)\max_{\ell=1,\ldots,N}\left(9\left\|\nabla U_{\ell}\right\|_{\mathbb{M}}^{2}+4\left(\hat{\omega}(1)\right)^{2}\left|x\right|^{2}\right).\end{align*}

Hence, assumption (A5) for this G holds with $\delta_{\mathbf{b},r',0}=\delta_{\mathbf{b},r',2}=0$ , $\delta_{\mathbf{v},0}=9d(d+4)(\max_{\ell}\left|\nabla U_{\ell}(\mathbf{0})\right|+\hat{\omega}(1))^{2}/2N_{B}$ , and $\delta_{\mathbf{v},2}=2d(d+4)(\hat{\omega}(1))^{2}/N_{B}$ if $r=r'$ . Therefore, this SG-LMC can achieve $\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$ for any $\epsilon>0$ , though the complexity is worse than that of SS-SG-LMC by a factor of $\mathcal{O}(d^{2})$ .

4. Preliminary results

We give preliminary results on the compact polynomial mollifier, mollification of functions with the finite moduli of continuity, and the representation of the likelihood ratio between the solutions of SDEs via a Liptser–Shiryaev-type condition for change of measures. We also introduce the fundamental theorem of calculus for weakly differentiable functions, a well-known sufficient condition of Poincaré inequalities and convergence in $\mathcal{W}_{2}$ with the inequalities, and upper bounds for Wasserstein distances.

4.1. The fundamental theorem of calculus for weakly differentiable functions

We use the following result on the fundamental theorem of calculus for functions in $W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ .

Proposition 4.1. (Lieb and Loss [Reference Lieb and Loss23], Anastassiou [Reference Anastassiou2].) For each $f\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ , for almost all $x,y\in\mathbb{R}^{d}$ ,

(6) \begin{align} f(y)-f(x)=\int_{0}^{1}\left\langle \nabla f\left(x+t\left(y-x\right)\right),y-x\right\rangle \text{d} t. \end{align}

4.2. Properties of the compact polynomial mollifier

We analyse the mollifier $\rho$ proposed in Eq. (4). Note that our non-asymptotic analysis needs mollifiers of class $\mathcal{C}^{1}$ whose gradients have explicit $L^{1}$ -bounds and whose supports are in the unit ball of $\mathbb{R}^{d}$ , and it is non-trivial to obtain explicit $L^{1}$ -bounds for the gradients of well-known $\mathcal{C}^{\infty}$ mollifiers.

Remark 4.1. We need mollifiers of class $\mathcal{C}^{1}$ to let $U\ast\rho$ with $U\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ be of class $\mathcal{C}^{2}$ and give a bound for the constant of a Poincaré inequality by [Reference Bakry, Barthe, Cattiaux and Guillin4]; see Lemma 4.7 and Proposition 4.4.

The following lemma gives some properties of $\rho$ .

Lemma 4.1.

  1. (1) $\rho\in\mathcal{C}^{1}(\mathbb{R}^{d})$ .

  2. (2) $\int \rho(x)\text{d} x=1$ .

  3. (3) $\int |\nabla \rho(x)|\text{d} x\le d+2$ .

Proof.

  1. (1) We check the behaviour of $\nabla \rho$ on a neighbourhood of $\{x\in\mathbb{R}^{d} \, : \, |x|=1\}$ . For all $x\in\mathbb{R}^{d}$ with $|x|<1$ ,

    \begin{align*} \nabla \rho(x)&=\left(\frac{\pi^{d/2}\text{B}(d/2,3)}{\Gamma(d/2)}\right)^{-1}\left(-4\right)\left(1-|x|^{2}\right)x \end{align*}
    and thus $\nabla \rho(x)$ is continuous at any $x\in\mathbb{R}^{d}$ by $\nabla \rho(x)=\mathbf{0}$ for all $x\in\mathbb{R}^{d}$ with $|x|=1$ .
  2. (2) We have

    \begin{align*} \int \rho(x)\text{d} x&=\frac{2}{\text{B}(d/2,3)}\int_{0}^{1} r^{d-1}\left(1-r^{2}\right)^{2}\text{d} r=\frac{1}{\text{B}(d/2,3)}\int_{0}^{1} s^{d/2-1}\left(1-s\right)^{2}\text{d} s=1 \end{align*}
    with the change of coordinates from Euclidean to hyperspherical, and the change of variables such that $\sqrt{s}=r$ and $(1/2\sqrt{s})\text{d} s=\text{d} r$ .
  3. (3) With respect to the $L^{1}$ -norm of the gradient, we have

    \begin{align*} \int |\nabla \rho(x)|\text{d} x&=\int_{|x|\le 1}\left(\frac{\pi^{d/2}\text{B}(d/2,3)}{\Gamma(d/2)}\right)^{-1}4\left(1-|x|^{2}\right)|x|\text{d} x\\ &=\frac{8}{\text{B}(d/2,3)}\int_{0}^{1}r^{d}\left(1-r^{2}\right)\text{d} r =\frac{4}{\text{B}(d/2,3)}\int_{0}^{1}s^{d/2-1/2}\left(1-s\right)\text{d} s\\ &=\frac{4\text{B}(d/2+1/2,2)}{\text{B}(d/2,3)} =\frac{4\Gamma(d/2+1/2)\Gamma(2)\Gamma(d/2+3)}{\Gamma(d/2+5/2)\Gamma(d/2)\Gamma(3)}\\ &=\frac{(d+4)(d+2)d}{(d+3)(d+1)} \le d+2 \end{align*}
    because $(d+4)d\le(d+3)(d+1)$ . Therefore, the statement holds true.

We show the optimality of the compact polynomial mollifier; the $L^{1}$ -norms of the gradients of $\mathcal{C}^{1}$ non-negative mollifiers with supports in $B_{1}(\mathbf{0})$ are bounded below by d.

Lemma 4.2. Assume that $p \, : \, \mathbb{R}^{d}\to[0,\infty)$ is a continuously differentiable non-negative function whose support is in the unit ball of $\mathbb{R}^{d}$ such that $\int p(x)\text{d} x=1$ . Then

\begin{align*} \int_{\mathbb{R}^{d}}\left|\nabla p(x)\right|\text{d} x\ge d.\end{align*}

Proof. Since $p\in \mathcal{C}^{1}(\mathbb{R}^{d})$ , the $L^{1}$ -norm of the gradient equals the total variation; that is, for arbitrary $R> 1$ ,

\begin{align*} \int_{\mathbb{R}^{d}}\left|\nabla p(x)\right|\text{d} x&=\int_{B_{R}\left(\mathbf{0}\right)}\left|\nabla p(x)\right|\text{d} x\\ &=\sup\left\{\int_{B_{R}\left(\mathbf{0}\right)}p(x)\mathrm{div}\varphi(x)\text{d} x\left|\varphi\in\mathcal{C}_{0}^{1}\left(B_{R}\left(\mathbf{0}\right) \, ; \, \mathbb{R}^{d}\right),\left\|\varphi\right\|_{\infty}\le 1\right.\right\}, \end{align*}

where $\mathcal{C}_{0}^{1}(B_{R}(\mathbf{0});\mathbb{R}^{d})$ is a class of continuously differentiable functions $\varphi \, : \, \mathbb{R}^{d}\to\mathbb{R}^{d}$ with compact supports in $B_{R}(\mathbf{0})\subset\mathbb{R}^{d}$ . For all $\delta\in(0,1]$ , by fixing $\varphi_{\delta}\in\mathcal{C}_{0}^{1}(B_{R}(\mathbf{0});\mathbb{R}^{d})$ such that $\varphi_{\delta}(x)=(1-\delta)x$ for all $x\in B_{1}(\mathbf{0})$ and $\|\varphi_{\delta}\|_{\infty}\le 1$ , we have

\begin{align*} \int_{\mathbb{R}^{d}}\left|\nabla p(x)\right|\text{d} x&\ge \int_{B_{R}\left(\mathbf{0}\right)}p(x)\mathrm{div}\varphi_{\delta}(x)\text{d} x=\int_{B_{1}\left(\mathbf{0}\right)}p(x)\mathrm{div}\varphi_{\delta}(x)\text{d} x=(1-\delta)d. \end{align*}

We obtain the conclusion by taking the limit as $\delta\to0$ .

4.3. Functions with the finite moduli of continuity and their properties

We consider a class of possibly discontinuous functions and show lemmas useful for the analysis of SG-LMC such that $\nabla U$ and $\widetilde{G}$ are in this class.

Let $\mathbb{M}=\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$ with fixed $d,\ell\in\mathbb{N}$ denote a class of measurable functions $\phi \, : \, (\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))\to(\mathbb{R}^{\ell},\mathcal{B}(\mathbb{R}^{\ell}))$ with (1) $|\phi(\mathbf{0})|<\infty$ and (2) $\omega_{\phi}(1)<\infty$ , where $\omega_{\phi}(\! \cdot \!)$ is the well-known modulus of continuity defined as

\begin{align*} \omega_{\phi}(r)\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\phi(x)-\phi(y)\right|,\end{align*}

where $r>0$ . Note that we use the modulus of continuity not to measure the continuity of $\phi$ , but to measure the fluctuation of $\phi$ within $\bar{B}_{r}(x)$ for all $x\in\mathbb{R}^{d}$ . An intuitive element of $\mathbb{M}$ is $\mathbb{I}_{A}$ for an arbitrary measurable set $A\in\mathcal{B}(\mathbb{R}^{d})$ because $\omega_{\mathbb{I}_{A}}(r)\le 1$ for any A and $r>0$ . In the rest of the paper, we sometimes use the notation $\|\phi\|_{\mathbb{M}}\,:\!=\,|\phi(\mathbf{0})|+\omega_{\phi}(1)$ with $\phi\in\mathbb{M}$ just for brevity (it is easy to see that $\mathbb{M}$ equipped with $\|\cdot\|_{\mathbb{M}}$ is a Banach space).

We introduce the following lemma; this ensures that we can change $r>0$ arbitrarily if $\omega_{\phi}(r)<\infty$ with some $r>0$ , and reveal that considering $r=1$ is sufficient to capture the large-scale behaviour since the lemma leads to $\omega_{\phi}(n)\le n\omega_{\phi}(1)$ for any $n\in\mathbb{N}$ .

Lemma 4.3. For any $r>0$ and $\phi\in\mathbb{M}$ , $\omega_{\phi}(r)=\sup_{t>0}\lceil t\rceil^{-1}\omega_{\phi}(rt)$ .

Proof. $\omega_{\phi}(r)\le \sup_{t>0}\lceil t\rceil^{-1}\omega_{\phi}(rt)$ and $\omega_{\phi}(r)\ge \lceil t\rceil^{-1}\omega_{\phi}(rt)$ with $t\in(0,1]$ hold immediately. Thus, we only examine $\omega_{\phi}(r)\ge \lceil t\rceil^{-1}\omega_{\phi}(rt)$ for all $t>1$ .

We fix $t>1$ . For any $x,y\in \mathbb{R}^{d}$ with $|x-y|\le rt$ ,

\begin{align*} \left|\phi\left(x\right)-\phi\left(y\right)\right|&\le \sum_{i=1}^{\lceil t\rceil}\left|\phi\left(\frac{(\lceil t\rceil-i+1)x+(i-1)y}{\lceil t\rceil}\right)-\phi\left(\frac{(\lceil t\rceil-i)x+iy}{\lceil t\rceil}\right)\right|\\ &\le \lceil t\rceil \omega_{\phi}(r) \end{align*}

because $|((\lceil t\rceil-i+1)x+(i-1)y)/\lceil t\rceil-((\lceil t\rceil-i)x+iy)/\lceil t\rceil|=|x-y|/\lceil t\rceil\le r$ .

Remark 4.2. Note that the continuity and boundedness of the modulus of continuity do not imply each other. For example, $f(x)=x\sin x$ with $x\in\mathbb{R}$ is a continuous function without the finite modulus of continuity. On the other hand, $f(x)=\mathbb{I}_{\mathbb{Q}}\left(x\right)$ with $x\in\mathbb{R}$ is a trivial example of a function with the finite modulus of continuity and without continuity.

Moreover, continuity along with the boundedness of the modulus of continuity does not imply uniform continuity, which we can easily observe by $f(x)=\sin(x^{2})$ with $x\in\mathbb{R}$ .

Figure 1. The chains of implications.

Lemma 4.4. (Linear growth of functions with the finite moduli of continuity.) For any $\phi\in\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$ , we have that for all $x\in\mathbb{R}^{d}$ ,

\begin{align*} \left|\phi\left(x\right)\right| \le \left|\phi(\mathbf{0})\right|+\omega_{\phi}(1)+\omega_{\phi}(1)|x|. \end{align*}

Proof. Fix $x\in\mathbb{R}^{d}$ . Lemma 4.3 gives

\begin{align*} \left|\phi\left(x\right)\right|-\left|\phi(\mathbf{0})\right|\le \left|\phi\left(x\right)-\phi(\mathbf{0})\right|\le \omega_{\phi}(|x|)\le \lceil |x|\rceil\omega_{\phi}(1)\le (1+|x|)\omega_{\phi}(1). \end{align*}

Therefore, the statement holds.

Lemma 4.5 (Local Lipschitz continuity by gradients with the finite moduli of continuity.) Assume that $\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ and a representative weak gradient $\nabla \Phi$ is in $\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$ . Then for almost all $x,y\in\mathbb{R}^{d}$ ,

\begin{align*} \left|\Phi\left(x\right)-\Phi\left(y\right)\right|\le \left(\left|\nabla\Phi(\mathbf{0})\right|+\omega_{\nabla\Phi}(1)\left(1+\frac{\left|x\right|+\left|y\right|}{2}\right)\right)\left|y-x\right|. \end{align*}

Proof. Proposition 4.1 and Lemma 4.4 yield that for almost all $x,y\in\mathbb{R}^{d}$ ,

\begin{align*} |\Phi(x)-\Phi(y)|&= \left|\int_{0}^{1}\left\langle\nabla \Phi\left(x+t(y-x)\right),y-x\right\rangle\text{d} t\right|\\ &\le \int_{0}^{1}\left|\nabla \Phi\left(x+t(y-x)\right)\right|\text{d} t\left|y-x\right|\\ &\le \int_{0}^{1}\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\omega_{\nabla \Phi}\left(1\right)\left(1+\left|(1-t)x+ty\right|\right)\right)\text{d} t\left|y-x\right|\\ &\le \left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\omega_{\nabla\Phi}(1)\left(1+\frac{|x|+|y|}{2}\right)\right)\left|y-x\right|.\end{align*}

Hence, the lemma is proved.

Lemma 4.6. (Quadratic growth by gradients with the finite moduli of continuity.) Assume that $\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ and a representative weak gradient $\nabla \Phi$ is in $\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$ . Then we have that $\|\Phi\|_{L^{\infty}(B_{1}(\mathbf{0}))}<\infty$ and for almost all $x\in\mathbb{R}^{d}$ ,

\begin{align*} \Phi\left(x\right)&\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|^{2}+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\frac{3}{2}\omega_{\nabla \Phi}(1)\right)\left|x\right|+\left\|\Phi\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}. \end{align*}

Moreover, for all $x\in\mathbb{R}^{d}$ and $r\in(0,1]$ ,

\begin{align*} \bar{\Phi}_{r}(x)&\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|^{2}+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+2\omega_{\nabla \Phi}(1)\right)\left|x\right|+\left\|\Phi\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)} \end{align*}

with $\bar{\Phi}_{r}(x)=(\Phi\ast\rho_{r})(x)$ .

Proof. Lemma 4.5 gives that for almost all $x\in\mathbb{R}^{d}$ and $y \in B_{1}(\mathbf{0})\cap B_{|x|}(x)$ ,

\begin{align*} \Phi\left(x\right)&\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|\left|x-y\right|+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\frac{3}{2}\omega_{\nabla \Phi}(1)\right)\left|x-y\right|+\Phi\left(y\right)\\ &\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|^{2}+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\frac{3}{2}\omega_{\nabla \Phi}(1)\right)\left|x\right|+\left\|\Phi\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}. \end{align*}

Regarding the second statement, we have that

\begin{align*} \bar{\Phi}_{r}(x)&=\int_{\mathbb{R}^{d}} \Phi\left(x-y\right)\rho_{r}\left(y\right)\text{d} y\\ &=\int_{\mathbb{R}^{d}} \left(\Phi\left(-y\right)+\int_{0}^{1}\left\langle\nabla \Phi\left(-y+tx\right),x\right\rangle \text{d} t\right)\rho_{r}\left(y\right)\text{d} y\\ &\le \int_{\mathbb{R}^{d}} \left(\Phi\left(-y\right)+\int_{0}^{1}\left(\left|\nabla \Phi(\mathbf{0})\right|+\omega_{\nabla \Phi}\left(1\right)\left(1+\left|-y+tx\right|\right)\right)\left|x\right|\text{d} t\right)\rho_{r}\left(y\right)\text{d} y\\ &\le \left\|\Phi\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}+\int_{0}^{1}\left(\left|\nabla \Phi(\mathbf{0})\right|+\omega_{\nabla \Phi}\left(1\right)\left(2+t\left|x\right|\right)\right)\left|x\right|\text{d} t\\ &\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|^{2}+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+2\omega_{\nabla \Phi}(1)\right)\left|x\right|+\left\|\Phi\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}. \end{align*}

The lemma is proved.

Lemma 4.7. (Smoothness of convolution.) Assume that $\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ and a representative weak gradient $\nabla \Phi$ is in $\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$ . Then $\bar{\Phi}_{r}\,:\!=\,(\Phi\ast\rho_{r})\in\mathcal{C}^{2}(\mathbb{R}^{d})$ and $\nabla^{2}\bar{\Phi}_{r}=(\nabla\Phi\ast\nabla\rho_{r})$ .

Proof. Since $\Phi$ and $\nabla \Phi$ are essentially bounded on any compact sets, for some $\{\varphi_{n}\}\subset\mathcal{C}_{0}^{\infty}(\mathbb{R}^{d})$ approximating $\rho_{r}\in C_{0}^{1}(\mathbb{R}^{d})\cap W^{1,1}(\mathbb{R}^{d})$ , $\nabla (\Phi\ast\rho_{r})=\Phi\ast\nabla \rho_{r}=\lim_{n}\Phi\ast\nabla \varphi_{n}=\lim_{n}\nabla\Phi\ast\varphi_{n}=\nabla\Phi\ast\rho_{r}$ and thus $\Phi\in\mathcal{C}^{2}(\mathbb{R}^{d})$ with $\nabla^{2}\bar{\Phi}_{r}=(\nabla\Phi\ast\nabla\rho_{r})$ .

Lemma 4.8 (Bounded gradients of convolution.) For all $\phi\in\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$ , $r>0$ , and $x\in\mathbb{R}^{d}$ , we have that

\begin{align*} \left\|\nabla \bar{\phi}_{r}(x)\right\|_{2}\le \left(d+2\right)\frac{\omega_{\phi}(r)}{r},\end{align*}

where $\bar{\phi}_{r}(x)=(\phi\ast\rho_{r})(x)$ .

Proof. We obtain

\begin{align*} \nabla \left(\bar{\phi}_{r}\right)(x) &=\int_{\mathbb{R}^{d}}\phi\left(y\right)\left(\nabla \rho_{r}\right)\left(x-y\right)\text{d} y =\int_{\mathbb{R}^{d}}\left(\phi\left(y\right)-\phi\left(x\right)\right)\left(\nabla \rho_{r}\right)\left(x-y\right)\text{d} y\end{align*}

by using $\int\nabla \rho_{r}(x)\text{d} x=0$ , and thus

\begin{align*} \left\|\nabla \left(\bar{\phi}_{r}\right)(x)\right\|_{2}&\le \int_{\mathbb{R}^{d}}\left\|\left(\phi\left(y\right)-\phi\left(x\right)\right)\left(\nabla \rho_{r}\right)\left(x-y\right)\right\|_{2}\text{d} y\\ &= \int_{\mathbb{R}^{d}}\left|\phi\left(y\right)-\phi\left(x\right)\right|\left|\nabla \rho_{r}\left(x-y\right)\right|\text{d} y\le \omega_{\phi}(r) \int_{\mathbb{R}^{d}}\left|\nabla \rho_{r}\left(y\right)\right|\text{d} y\\ &= \omega_{\phi}(r)\int_{\mathbb{R}^{d}}\left| \frac{1}{r^{d+1}}\nabla\rho\left(\frac{y}{r}\right)\right|\text{d} y = \omega_{\phi}(r)\int_{\mathbb{R}^{d}}\left|\frac{1}{r^{d+1}}\nabla \rho\left(z\right)\right|r^{d}\text{d} z\\ &\le \frac{(d+2)\omega_{\phi}(r)}{r}\end{align*}

by the change of variables $z=y/r$ with $r^{d}\text{d} z=\text{d} y$ and Lemma 4.1.

Lemma 4.9. (1-Lipschitz mapping to $\ell^{\infty}$ .) For all $\phi\in\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$ and $r>0$ ,

\begin{align*} \left\|\phi\ast \rho_{r}-\phi\right\|_{\infty}\le \omega_{\phi}(r). \end{align*}

Proof. Since $\int\rho_{r}(x)\text{d} x=1$ , for all $x\in\mathbb{R}^{d}$ ,

\begin{align*} \left|\phi\ast\rho_{r}(x)-\phi(x)\right|&=\left|\int_{\mathbb{R}^{d}}\phi(y)\rho_{r}(x-y)\text{d} y-\phi(x)\right|\\ &=\left|\int_{\mathbb{R}^{d}}\phi(y)\rho_{r}(x-y)\text{d} y-\int_{\mathbb{R}^{d}}\phi(x)\rho_{r}(x-y)\text{d} y\right|\\ &=\left|\int_{\mathbb{R}^{d}}\left(\phi(y)-\phi(x)\right)\rho_{r}(x-y)\text{d} y\right|\\ &\le \int_{\mathbb{R}^{d}}\left|\phi(y)-\phi(x)\right|\rho_{r}(x-y)\text{d} y\\ &\le \omega_{\phi}(r). \end{align*}

This is the desired conclusion.

Lemma 4.10 (Essential supremum of deviations by convolution). Assume that $\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$ and a representative weak gradient $\nabla \Phi$ is in $\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$ . For all $r>0$ ,

\begin{align*} \left\|\bar{\Phi}_{r}-\Phi\right\|_{L^{\infty}(\mathbb{R}^{d})}\le r\omega_{\nabla \Phi}(r) \end{align*}

with $\bar{\Phi}_{r}(x)\,:\!=\,(\Phi\ast\rho_{r})(x)$ .

Proof. By Proposition 4.1 and $\int_{\mathbb{R}^{d}}\langle y,z\rangle \rho_{r}(y)\text{d} y=0$ for any $z\in\mathbb{R}^{d}$ , for almost all $x\in\mathbb{R}^{d}$ ,

\begin{align*} \left|\bar{\Phi}_{r}(x)-\Phi(x)\right|&=\left|\int_{\mathbb{R}^{d}}\left(\Phi(x-y)-\Phi(x)\right)\rho_{r}(y)\text{d} y\right|\\ &=\left|\int_{\mathbb{R}^{d}}\left(\int_{0}^{1}\left\langle \nabla \Phi(x-ty),y\right\rangle\text{d} t\right)\rho_{r}(y)\text{d} y\right|\\ &=\left|\int_{\mathbb{R}^{d}}\left(\int_{0}^{1}\left\langle \nabla \Phi(x-ty)-\nabla \Phi(x),y\right\rangle\text{d} t\right)\rho_{r}(y)\text{d} y\right|\\ &\le \omega_{\nabla \Phi}(r)\int_{\mathbb{R}^{d}}|y|\rho_{r}(y)\text{d} y\\ &\le r\omega_{\nabla \Phi}(r) \end{align*}

and thus the statement holds.

4.4. Liptser–Shiryaev-type condition for change of measures

We show the existence of explicit likelihood ratios of diffusion-type processes based on Theorem 7.19 and Lemma 7.6 of [Reference Liptser and Shiryaev24]. We fix $T>0$ throughout this section. Let $(W_{T},\mathcal{W}_{T})$ be a measurable space of $\mathbb{R}^{d}$ -valued continuous functions $w_{t}$ with $t\in[0,T]$ and $\mathcal{W}_{T}=\sigma(w_{s} \, : \, w\in W_{T},s\le T)$ . We also use the notation $\mathcal{W}_{t}=\sigma(w_{s} \, : \, w\in W_{T}, s\le t)$ for $t\in[0,T]$ . Let $(\Omega, \mathcal{F}, \mu)$ be a complete probability space and $(\tilde{\Omega},\tilde{\mathcal{F}},\tilde{\mu})$ be an identical copy of it. We assume that the filtration $\{\mathcal{F}_{t}\}_{t\in[0,T]}$ satisfies the usual conditions. Let $(B_{t},\mathcal{F}_{t})$ with $t\in[0,T]$ be a d-dimensional Brownian motion and $\xi$ be an $\mathcal{F}_{0}$ -measurable d-dimensional random vector such that $|\xi|<\infty $ $\mu$ -almost surely. We set $\{a_{t}\}_{t\in[0,T]}$ , an $\mathcal{F}_{t}$ -adapted random process such that its trajectory $\{a_{s}(\omega)\}_{s\in[0,t]}$ with $\omega\in\Omega$ for each $t\in[0,T]$ is in a measurable space $(A_{t},\mathcal{A}_{t})$ . Assume that $a=\{a_{t}\}_{t\in[0,T]}$ , $B=\{B_{t}\}_{t\in[0,T]}$ , and $\xi$ are independent of each other. $\mu_{a},\mu_{B}$ , and $\mu_{\xi}$ denote the probability measures induced by a, B, and $\xi$ on $(A_{T},\mathcal{A}_{T})$ , $(W_{T},\mathcal{W}_{T})$ , and $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ , respectively.

Consider the solutions $X^{P}=\{X_{t}^{P}\}_{t\in[0,T]}$ and $X^{Q}=\{X_{t}^{Q}\}_{t\in[0,T]}$ of the following SDEs:

(7) \begin{align} \text{d} X_{t}^{P}&=b^{P}\left(t,a,X^{P}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\ X_{0}^{P}=\xi, \end{align}

(8) \begin{align} \text{d} X_{t}^{Q}&=b^{Q}\left(X_{t}^{Q}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\ X_{0}^{Q}=\xi.\end{align}

We set the following assumptions, partially adapted from [Reference Liptser and Shiryaev24] but containing some differences in $\xi$ and the structure of $X^{Q}$ .

  1. (LS1) $X_{t}^{P}$ is a strong solution of Eq. (7), that is, there exists a measurable functional $F_{t}$ for each t such that

    \begin{align*} X_{t}^{P}(\omega)=F_{t}(a(\omega),B(\omega),\xi(\omega)) \end{align*}
    $\mu$ -almost surely.
  2. (LS2) $b^{P}$ is non-anticipative, that is, $\mathcal{A}_{t}\times \mathcal{W}_{t}$ -measurable for each $t\in[0,T]$ , and for fixed $a\in A_{T}$ and $w\in W_{T}$ ,

    \begin{align*} \int_{0}^{T}\left|b^{P}(t,a,w)\right|\text{d} t<\infty. \end{align*}
  3. (LS3) $b^{Q} \, : \, \mathbb{R}^{d}\to\mathbb{R}^{d}$ is Lipschitz continuous, so that $X^{Q}$ is the unique strong solution of Eq. (8).

  4. (LS4) We have that

    \begin{align*} &\mu\left(\int_{0}^{T}\left(\left|b^{P}\left(t,a,X^{P}\right)\right|^{2}+\left|b^{Q}\left(X_{t}^{P}\right)\right|^{2}\right)\text{d} t<\infty\right)\\ &\quad=\mu\left(\int_{0}^{T}\left(\left|b^{P}\left(t,a,X^{Q}\right)\right|^{2}+\left|b^{Q}\left(X_{t}^{Q}\right)\right|^{2}\right)\text{d} t<\infty\right)=1. \end{align*}

We consider a variant of (7) with fixed $a\in A_{T}$ :

\begin{align*} \text{d} X_{t}^{P|a}=b^{P}\left(t,a,X^{P|a}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\ X_{0}^{P|a}=\xi.\end{align*}

Then assumption (LS1) yields that

(9) \begin{align} X_{t}^{P|a}(\omega)=F_{t}(a,B(\omega),\xi(\omega))\end{align}

$\mu_{a}\times \mu$ -almost surely. We assume that $\Omega=A_{T}\times W_{T}\times \mathbb{R}^{d}$ , $\mathcal{F}=\mathcal{A}_{T}\times \mathcal{W}_{T}\times\mathcal{B}(\mathbb{R}^{d})$ , and $\mu=\mu_{a}\times\mu_{B}\times\mu_{\xi}$ without loss of generality. Then each $\omega\in\Omega$ has the form $\omega=(a,B,\xi)$ and we can assume that a, B, and $\xi$ are projections such as $a(\omega)=a$ , $B(\omega)=B$ , and $\xi(\omega)=\xi$ ; therefore, Eq. (9) holds $\mu_{a}\times\mu_{B}\times\mu_{\xi}$ -almost surely.

We consider a process on the product space $(\Omega\times\tilde{\Omega},\mathcal{F}\times\tilde{\mathcal{F}},\mu\times \tilde{\mu})$ :

\begin{align*} \text{d} X_{t}^{P|a(\omega)}(\tilde{\omega})=b^{P}\left(t,a(\omega),X^{P|a(\omega)}(\tilde{\omega})\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t}(\tilde{\omega}),\ X_{0}^{P|a(\omega)}=\xi(\tilde{\omega}).\end{align*}

Assumption (LS1) gives that

\begin{align*} X_{t}^{P|a(\omega)}(\tilde{\omega})=F_{t}\left(a(\omega),B(\tilde{\omega}),\xi(\tilde{\omega})\right)\end{align*}

$\mu\times\tilde{\mu}$ -almost surely.

Lemma 4.11. Under (LS1), for any $C\in\mathcal{W}_{T}$ ,

\begin{align*} \mu\left(X^{P}(a,B,\xi)\in C|\sigma(a)\right)=\tilde{\mu}\left(X^{P}\left(a,\tilde{B},\tilde{\xi}\right)\left(\tilde{\omega}\right)\in C\right) \end{align*}

$\mu$ -almost surely.

Proof. The proof is essentially identical to that of Lemma 7.5 of [Reference Liptser and Shiryaev24] except for the randomness of $\xi$ . We first show that for fixed $t\in[0,T]$ and $C_{t}\in\mathcal{B}(\mathbb{R}^{d})$ ,

\begin{align*} \mu\left(F_{t}(a,B,\xi)\in C_{t}|\sigma(a)\right)=\tilde{\mu}\left(F_{t}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t}\right) \end{align*}

$\mu$ -almost surely. Note that the following probability for fixed a is $\mathcal{A}_{T}$ -measurable owing to (LS1) and Fubini’s theorem:

\begin{align*} \tilde{\mu}\left(F_{t}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t}\right)=\left(\mu_{B}\times\mu_{\xi}\right)\left(F_{t}(a,B,\xi)\in C_{t}\right). \end{align*}

Let $f(a(\omega))$ be a $\sigma(a)$ -measurable bounded random variable. Again Fubini’s theorem gives that

\begin{align*} E\left[f(a(\omega))\mathbb{I}_{F_{t}(a,B,\xi)\in C_{t}}\right]&=\int_{A_{T}}\int_{W_{T}}\int_{\mathbb{R}^{d}}f(a)\mathbb{I}_{F_{t}(a,w,x)\in C_{t}}\mu_{a}(\text{d} a)\mu_{B}(\text{d} w)\mu_{\xi}(\text{d} x)\\ &=\int_{A_{T}}f(a)\left(\int_{W_{T}}\int_{\mathbb{R}^{d}}\mathbb{I}_{F_{t}(a,w,x)\in C_{t}}\mu_{B}(\text{d} w)\mu_{\xi}(\text{d} x)\right)\mu_{a}(\text{d} a)\\ &=\int_{A_{T}}f(a)\left(\mu_{B}\times \mu_{\xi}\right)\left(F_{t}(a,B,\xi)\in C_{t}\right)\mu_{a}(\text{d} a)\\ &=\int_{A_{T}}f(a)\tilde{\mu}\left(F_{t}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t}\right)\mu_{a}(\text{d} a)\\ &=E\left[f(a)\tilde{\mu}\left(F_{t}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t}\right)\right] \end{align*}

and thus the definition of conditional expectation yields the result. Similarly, we obtain that for all $n\in\mathbb{N}$ , $0\le t_{1} < \cdots < t_{n}\le T$ , and $C_{t_{i}}\in \mathcal{B}(\mathbb{R}^{d})$ , $i=1,\ldots,n$ ,

\begin{align*} &\mu\left(F_{t_{1}}(a,B,\xi)\in C_{t_{1}}\cdots F_{t_{n}}(a,B,\xi)\in C_{t_{n}}|\sigma(a)\right)\\ &\quad=\tilde{\mu}\left(F_{t_{1}}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t_{1}},\cdots,F_{t_{n}}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t_{n}}\right). \end{align*}

Therefore, the statement holds true.

Let $P_{T}$ and $Q_{T}$ denote the laws of $\{(a_{t},X_{t}^{P}) \, : \, t\in[0,T]\}$ and $\{(a_{t},X_{t}^{Q}) \, : \, t\in[0,T]\}$ . Note that $a_{t}$ and $X_{t}^{Q}$ are independent of each other by the assumptions. The following proposition gives the equivalence and the representation of the likelihood ratio.

Proposition 4.2. Under (LS1)–(LS4), we have that

\begin{align*} &\frac{\text{d} Q_{T}}{\text{d} P_{T}}\left(a,X^{P}\right)\\ &\quad=\exp\left(-\sqrt{\frac{\beta}{2}}\int_{0}^{T}\left\langle \left(b^{P}-b^{Q}\right)\left(t,a,X^{P}\right),\text{d} B_{t}\right\rangle -\frac{\beta}{4}\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(t,a,X^{P}\right)\right|^{2}\text{d} t\right). \end{align*}

Proof. The proof is quite parallel to that of Lemma 7.6 of [Reference Liptser and Shiryaev24]. For arbitrary set $\Gamma=\Gamma_{1}\times \Gamma_{2}$ , $\Gamma_{1}\in\mathcal{A}_{T}$ and $\Gamma_{2}\in\mathcal{W}_{T}$ , by Lemma 4.11,

\begin{align*} \mu\left(\left(a,X^{P}\right)\in\Gamma\right)&=\int_{A_{T}\times W_{T}\times \mathbb{R}^{d}}\mathbb{I}_{a\in \Gamma_{1}}\mathbb{I}_{X^{P}(a,w,x)\in\Gamma_{2}}\mu_{a}(\text{d} a)\mu_{B}\left(\text{d} w\right)\mu_{\xi}\left(\text{d} x\right)\\ &=\int_{a\in \Gamma_{1}}\mu\left(X^{P}(a,B,\xi)\in \Gamma_{2}|\sigma(a)\right)\mu_{a}(\text{d} a)\\ &=\int_{a\in \Gamma_{1}}\tilde{\mu}\left(X^{P}\left(a,\tilde{B},\tilde{\xi}\right)\in \Gamma_{2}\right)\mu_{a}(\text{d} a)\\ &=\int_{a\in \Gamma_{1}}\left(P|a\right)_{T}(\Gamma_{2})\mu_{a}(\text{d} a),\end{align*}

where $(P|a)_{T}$ is the law of (9). Let $(Q|a)_{T}$ denote the law of $X^{Q}$ . For $\mu_{a}$ -almost all a, under (LS1)–(LS4) and Theorem 7.19 of [Reference Liptser and Shiryaev24], $(P|a)_{T}\sim (Q|a)_{T}$ and the likelihood ratio is given as

\begin{align*} &\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}\left(X^{Q}\right)\\ &\quad=\exp\left(\frac{\beta}{2}\int_{0}^{T}\left\langle \left(b^{P}-b^{Q}\right)\left(t,a,X^{Q}\right),\text{d} B_{t}\right\rangle -\frac{\beta^{2}}{4}\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(t,a,X^{Q}\right)\right|^{2}\text{d} t\right).\end{align*}

Therefore, we have

\begin{align*} \mu\left(\left(a,X^{P}\right)\in\Gamma\right)&=\int_{\Gamma_{1}}\int_{\Gamma_{2}}\left(\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}(\text{d} w)(Q|a)_{T}(\text{d} w)\right)\mu_{a}(\text{d} a)\\ &=\int_{\Gamma_{1}}\int_{\Gamma_{2}}\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}(\text{d} w)\left(\mu_{a}\times (Q|a)_{T}\right)(\text{d} a \text{d} w)\\ &=\int_{\Gamma}\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}(\text{d} w)Q_{T}(\text{d} a \text{d} w).\end{align*}

Since $Q_{T}(a,w \, : \, (\text{d} (P|a)_{T})/(\text{d} (Q|a)_{T})(w)=0)=0$ , Lemma 6.8 of [Reference Liptser and Shiryaev24] yields the desired conclusion.

We obtain the following result.

Proposition 4.3 (Kullback–Leibler divergence.) Under (LS1)–(LS4) and the assumption

\begin{equation*} E\left[\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right)\right|^{2}\text{d} s\right]<\infty, \end{equation*}

we obtain

\begin{align*} D\left(P_{T}\left\|Q_{T}\right.\right)= \frac{\beta}{4}E\left[\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right)\right|^{2}\text{d} s\right]. \end{align*}

Proof. Using Proposition 4.2, we obtain

\begin{align*} &D\left(P_{T}\left\|Q_{T}\right.\right)\\ &\quad=E\left[\log\left(\frac{\text{d} P_{T}}{\text{d} Q_{T}}\right)(a,X^{P})\right]\\ &\quad=E\left[\frac{\beta}{4}\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right)\right|^{2}\text{d} s+\sqrt{\frac{\beta}{2}}\int_{0}^{T}\left\langle \left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right),\text{d} B_{s}\right\rangle\right]\\ &\quad=\frac{\beta}{4}E\left[\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right)\right|^{2}\text{d} s\right], \end{align*}

since the local martingale term is a martingale by the assumption. Hence, the proposition holds.

4.5. Poincaré inequalities

Let us consider Poincaré inequalities for a probability measure $P_{\Phi}$ whose density is $\left(\int \mathrm{e}^{-\Phi(x)}\text{d} x\right)^{-1}\mathrm{e}^{-\Phi(x)}$ with lower-bounded $\Phi\in\mathcal{C}^{2}(\mathbb{R}^{d})$ such that $\int \mathrm{e}^{-\Phi(x)}\text{d} x<\infty$ . Let $L\,:\!=\,\Delta -\langle \nabla \Phi,\nabla \rangle$ , which is $P_{\Phi}$ -symmetric, $P_{t}$ be the Markov semigroup with the infinitesimal generator L, and $\mathcal{E}$ denote the Dirichlet form

\begin{align*} \mathcal{E}(g)\,:\!=\,\lim_{t\to0}\frac{1}{t}\int_{\mathbb{R}^{d}}g \left(g-P_{t}g\right) \text{d} P_{\Phi},\end{align*}

where $g\in L^{2}(P_{\Phi})$ such that the limit exists. Here, we say that a probability measure $P_{\Phi}$ satisfies a Poincaré inequality with constant $c_\textrm{P}(P_{\Phi})$ (the Poincaré constant) if for any $Q\ll P_{\Phi}$ ,

\begin{align*} \chi^{2}\left(Q\|P_{\Phi}\right)\le c_\textrm{P}(P_{\Phi})\mathcal{E}\left(\sqrt{\frac{\text{d} Q}{\text{d} P_{\Phi}}}\right).\end{align*}

We adopt the following statement from [Reference Raginsky, Rakhlin and Telgarsky31]; although it is different from the original discussion of [Reference Bakry, Barthe, Cattiaux and Guillin4], the difference is negligible because Eq. (2.3) of [Reference Bakry, Barthe, Cattiaux and Guillin4] yields the same upper bound.

Proposition 4.4. (Bakry et al. [Reference Bakry, Barthe, Cattiaux and Guillin4].) Assume that there exists a Lyapunov function $V\in\mathcal{C}^{2}(\mathbb{R}^{d})$ with $V:\mathbb{R}^{d}\to[1,\infty)$ such that

\begin{align*} \frac{LV\left(x\right)}{V\left(x\right)}\le -\lambda_{0}+\kappa_{0}\mathbb{I}_{B_{\tilde{R}}(\mathbf{0})}\left(x\right) \end{align*}

for some $\lambda_{0}>0$ , $\kappa_{0}\ge 0$ , and $\tilde{R}>0$ , where $LV(x)=\Delta V-\langle \nabla \Phi,\nabla V\rangle$ . Then $P_{\Phi}$ satisfies a Poincaré inequality with constant $c_\textrm{P}(P_{\Phi})$ such that

\begin{align*} c_\textrm{P}(P_{\Phi})\le \frac{1}{\lambda_{0}}\left(1+a\kappa_{0}\tilde{R}^{2}\mathrm{e}^{\mathrm{Osc}_{\tilde{R}}}\right), \end{align*}

where $a>0$ is an absolute constant and $\mathrm{Osc}_{\tilde{R}}\,:\!=\,\max_{x:|x|\le \tilde{R}}\Phi(x)-\min_{x:|x|\le \tilde{R}}\Phi(x)$ .

The next proposition shows the convergence in $\mathcal{W}_{2}$ by $\chi^{2}$ -divergence using the recent study by [Reference Liu25].

Proposition 4.5. (Lehec [Reference Lehec21], Lemma 9.) Assume that $P_{\Phi}$ satisfies Poincaré inequalities with constant $c_\textrm{P}(P_{\Phi})$ and $\nabla \Phi$ is at most of linear growth. Then for any probability measure $\nu$ on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ with $\nu\ll P_{\Phi}$ and every $t>0$ ,

\begin{align*} \mathcal{W}_{2}\left(\nu P_{t},P_{\Phi}\right)\le \sqrt{2c_\textrm{ P}(P_{\Phi})\chi^{2}\left(\nu\|P_{\Phi}\right)}\exp\left(-\frac{t}{2c_\textrm{P}(P_{\Phi})}\right), \end{align*}

where $\nu P_{t}$ is the law of the unique weak solution $Z_{t}$ of the SDE

\begin{align*} \text{d} Z_{t}=-\nabla \Phi\left(Z_{t}\right)\text{d} t+\sqrt{2}\text{d} B_{t},\quad Z_{0}\sim \nu. \end{align*}

4.6. A bound for the 2-Wasserstein distance by KL divergence

The next proposition is an immediate result by [Reference Bolley and Villani7].

Proposition 4.6. (Bolley and Villani [Reference Bolley and Villani7].) Let $\mu,\nu$ be probability measures on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ . Assume that there exists a constant $\lambda>0$ such that $\int \exp(\lambda |x|^{2})\nu(\text{d} x)<\infty$ . Then for any $\mu$ ,

\begin{align*} \mathcal{W}_{2}\left(\mu,\nu\right)\le C_{\nu}\left(D(\mu\|\nu)^{1/2}+\left(\frac{D(\mu\|\nu)}{2}\right)^{1/4}\right), \end{align*}

where

\begin{align*} C_{\nu}\,:\!=\,2\inf_{\lambda>0}\left(\frac{1}{\lambda}\left(\frac{3}{2}+\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\nu\left(\text{d} x\right)\right)\right)^{1/2}. \end{align*}

5. Proof of the main theorem

In this section, we use the notation $\|\nabla U\|_{\mathbb{M}}\,:\!=\,|\nabla U(\mathbf{0})|+\omega_{\nabla U}(1)$ and $\|\widetilde{G}\|_{\mathbb{M}}\,:\!=\,|\widetilde{G}(\mathbf{0})|+\omega_{\widetilde{G}}(1)$ under (A3). We denote by $\bar{X}_{t}^{r}$ the unique strong solution of the following SDE under (A3) (Lemma 4.8 gives the existence and uniqueness):

(10) \begin{align} \text{d} \bar{X}_{t}^{r}=-\nabla \bar{U}_{r}\left(\bar{X}_{t}^{r}\right)\text{d} t+\sqrt{2\beta^{-1}}\text{d} B_{t},\ \bar{X}_{0}^{r}=\xi,\end{align}

and $\bar{\nu}_{t}^{r}$ represents the probability measure of $\bar{X}_{t}^{r}$ . We use the notation $\pi$ and $\bar{\pi}^{r}$ , probability measures on $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ , as

\begin{align*} \pi\left(\text{d} x\right)=\frac{1}{\mathcal{Z}\left(\beta\right)}\exp\left(-\beta U\left(x\right)\right)\text{d} x,\quad \bar{\pi}^{r}\left(\text{d} x\right)\,:\!=\,\frac{1}{\bar{\mathcal{Z}}^{r}\left(\beta\right)}\exp\left(-\beta \bar{U}_{r}\left(x\right)\right)\text{d} x,\end{align*}

where $\mathcal{Z}(\beta)=\int\exp(\! - \! \beta U(x))\text{d} x$ and $\bar{\mathcal{Z}}^{r}(\beta)=\int\exp(\! - \! \beta \bar{U}_{r}(x))\text{d} x$ . Note that $\bar{U}_{r}$ is $(\bar{m},\bar{b})$ -dissipative with $\bar{m}\,:\!=\,m,\bar{b}\,:\!=\,b+\omega_{\nabla U}(1)$ as

\begin{align*} \left\langle x,\nabla \bar{U}_{r}\left(x\right)\right\rangle&=\int_{\mathbb{R}^{d}}\left\langle x,\nabla U\left(x-y\right)\right\rangle \rho_{r}(y)\text{d} y\\ &= \int_{\mathbb{R}^{d}}\left\langle x-y,\nabla U\left(x-y\right)\right\rangle \rho_{r}(y)\text{d} y\\ &\quad+\int_{\mathbb{R}^{d}}\left\langle y,\nabla U\left(x-y\right)-\nabla U\left(x\right)\right\rangle \rho_{r}(y)\text{d} y\\ &\ge \int_{\mathbb{R}^{d}}\left(m\left| x-y\right|^{2}-b\right) \rho_{r}(y)\text{d} y-\omega_{\nabla U}(r) \int_{\mathbb{R}^{d}}\left| y\right|\rho_{r}(y)\text{d} y\\ &\ge m\left| x\right|^{2}-b+\int_{\mathbb{R}^{d}}\left|y\right|^{2} \rho_{r}(y)\text{d} y-r\omega_{\nabla U}(r)\\ &\ge m|x|^{2}-(b+\omega_{\nabla U}(1))\end{align*}

owing to $r\le 1$ and $\int_{\mathbb{R}^{d}}\langle y,z\rangle \rho_{r}(y)\text{d} y=0$ for each $z\in\mathbb{R}^{d}$ .

5.1. Moments of SG-LMC algorithms

Lemma 5.1. (Uniform $L^{2}$ moments.) Assume that (A1)–(A6) hold.

  1. (1) For all $k\in\mathbb{N}$ and $0<\eta\le 1\wedge (\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))$ , $Y_{k\eta},G(Y_{k\eta},a_{k\eta})\in L^{2}$ . Moreover,

    \begin{align*} \sup_{k\ge 0}E\left[\left|Y_{k\eta}\right|^{2}\right]&\le \kappa_{0}+2\left(1\vee \frac{1}{\tilde{m}}\right)\left(\tilde{b}+\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\frac{d}{\beta}\right)= \! : \, \kappa_{\infty}. \end{align*}
  2. (2) For any $t\ge0$ and $r\in(0,1]$ ,

    \begin{align*} E\left[\left|\bar{X}_{t}^{r}\right|^{2}\right]&\le \kappa_{0}\mathrm{e}^{-2\bar{m}t}+\frac{\bar{b}+d/\beta}{\bar{m}}\left(1-\mathrm{e}^{-2\bar{m}t}\right). \end{align*}

Proof. The proof is adapted from Lemma 3 of [Reference Raginsky, Rakhlin and Telgarsky31].

  1. (1) We first show $Y_{k\eta}\in L^{2}$ for each $k\in\mathbb{N}$ since

    \begin{align*} E\left[\left.\left|G(Y_{k\eta},a_{k\eta})\right|^{2}\right|Y_{k\eta}\right]&\le 2E\left[\left.\left|G(Y_{k\eta},a_{k\eta})-\widetilde{G}(Y_{k\eta})\right|^{2}\right|Y_{k\eta}\right]+2\left|\widetilde{G}(Y_{k\eta})\right|^{2}\\ &\le 4\delta_{\mathbf{v},2}\left|Y_{k\eta}\right|^{2}+4\delta_{\mathbf{v},0}+\left(4\left\|\widetilde{G}\right\|_{\mathbb{M}}+4\left(\omega_{\widetilde{G}}\left(1\right)\right)^{2}\left|Y_{k\eta}\right|^{2}\right) \end{align*}
    almost surely and thus $Y_{k\eta}\in L^{2}$ implies $G(Y_{k\eta},a_{k\eta})\in L^{2}$ . Assumptions (A3), (A5), and (A6) together with Lemma 4.4 give
    \begin{align*} E\left[\left|Y_{(k+1)\eta}\right|^{2}\right] &=E\left[\left|Y_{k\eta}-\eta{G}\left(Y_{k\eta},a_{k\eta}\right)+\sqrt{2\beta^{-1}}\left(B_{(k+1)\eta}-B_{k\eta}\right)\right|^{2}\right]\\ &\le 2E\left[\left|Y_{k\eta}-\eta{G}\left(Y_{k\eta},a_{k\eta}\right)\right|^{2}\right]+2E\left[\left|\sqrt{2\beta^{-1}}\left(B_{(k+1)\eta}-B_{k\eta}\right)\right|^{2}\right]\\ &\le 4E\left[\left|Y_{k\eta}-\eta \widetilde{G}(Y_{k\eta})\right|^{2}\right]+4\eta^{2}E\left[\left|\widetilde{G}(Y_{k\eta})-{G}\left(Y_{k\eta},a_{k\eta}\right)\right|^{2}\right]+\frac{4\eta d}{\beta}\\ &\le \left(8+16(\omega_{\widetilde{G}}(1))^{2}+8\delta_{\mathbf{v},2}\right)E\left[\left|Y_{k\eta}\right|^{2}\right]+\left(16\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+8\delta_{\mathbf{v},0}+\frac{4d}{\beta}\right). \end{align*}
    Hence, $Y_{k\eta}\in L^{2}$ as there exist $\gamma_{2},\gamma_{0}>1$ such that
    \begin{align*} E[|Y_{(k+1)\eta}|^{2}] &\le \gamma_{2}E[|Y_{k\eta}|^{2}]+\gamma_{0}\\ &\le \gamma_{2}^{k+1}E[|\xi|^{2}]+\gamma_{0}(\gamma_{2}^{k+1}-1)/(\gamma_{2}-1)\\ &\le \gamma_{2}^{k+1}(\log E[\exp(|\xi|^{2})]+\gamma_{0}/(\gamma_{2}-1))\\ &\le \gamma_{2}^{k+1}(\kappa_{0}+\gamma_{0}/(\gamma_{2}-1))<\infty \end{align*}
    for arbitrary $k\in\mathbb{N}$ by Jensen’s inequality.

The independence among $Y_{k\eta}$ , $a_{k\eta}$ , and $B_{(k+1)\eta}-B_{k\eta}$ and the square integrability of $Y_{k\eta}$ and $G(Y_{k\eta},a_{k\eta})$ lead to

\begin{align*} E\left[\left|Y_{(k+1)\eta}\right|^{2}\right] &=E\left[\left|Y_{k\eta}-\eta \widetilde{G}(Y_{k\eta})\right|^{2}\right]+\eta^{2}E\left[\left|\widetilde{G}(Y_{k\eta})-{G}\left(Y_{k\eta},a_{k\eta}\right)\right|^{2}\right]+\frac{2\eta d}{\beta}. \end{align*}

Lemma 4.4 gives

\begin{align*} &E\left[\left|Y_{k\eta}-\eta \widetilde{G}(Y_{k\eta})\right|^{2}\right]\\ &\quad=E\left[\left|Y_{k\eta}\right|^{2}\right]-2\eta E\left[\left\langle Y_{k\eta}, \widetilde{G}(Y_{k\eta})\right\rangle \right]+\eta^{2}E\left[\left|\widetilde{G}(Y_{k\eta})\right|^{2}\right]\\ &\quad\le E\left[\left|Y_{k\eta}\right|^{2}\right]+2\eta\left(\tilde{b}-\tilde{m}E\left[\left|Y_{k\eta}\right|^{2}\right]\right)+2\eta^{2}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\left(\omega_{\widetilde{G}}(1)\right)^{2}E\left[\left|Y_{k\eta}\right|^{2}\right]\right)\\ &\quad=\left(1-2\eta \tilde{m}+2\eta^{2}\left(\omega_{\widetilde{G}}(1)\right)^{2}\right)E\left[\left|Y_{k\eta}\right|^{2}\right]+2\eta \tilde{b}+2\eta^{2}\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}. \end{align*}

By assumption (A5) and the independence between $a_{k\eta}$ and $Y_{k\eta}$ , we also have

\begin{align*} E\left[\left|\widetilde{G}(Y_{k\eta})-{G}\left(Y_{k\eta},a_{k\eta}\right)\right|^{2}\right]\le 2\delta_{\mathbf{v},2}E\left[\left|Y_{k\eta}\right|^{2}\right]+2\delta_{\mathbf{v},0}. \end{align*}

Hence, for $\gamma\,:\!=\,1-2\eta \tilde{m}+2\eta^{2}((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2})<1$ , we have that

\begin{align*} E\left[\left|Y_{(k+1)\eta}\right|^{2}\right] &\le \gamma E\left[\left|Y_{k\eta}\right|^{2}\right]+2\eta \tilde{b}+2\eta^{2}\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+2\eta^{2}\delta_{\mathbf{v},0}+\frac{2\eta d}{\beta}. \end{align*}

If $\gamma\le 0$ , then it is obvious that

\begin{align*} E\left[\left|Y_{(k+1)\eta}\right|^{2}\right]\le 2\tilde{b}+2\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+2\delta_{\mathbf{v},0}+\frac{2d}{\beta}, \end{align*}

and if $\gamma\in(0,1)$ ,

\begin{align*} E\left[\left|Y_{k\eta}\right|^{2}\right]&\le \gamma^{k}E\left[\left|Y_{0}\right|^{2}\right]+\frac{2\eta \tilde{b}+2\eta^{2}\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+2\eta^{2}\delta_{\mathbf{v},0}+\frac{2\eta d}{\beta}}{2\eta \tilde{m}-2\eta^{2}\left(\left(\omega_{\widetilde{G}}(1)\right)^{2}+\delta_{\mathbf{v},2}\right)}\\ &\le E\left[\left|Y_{0}\right|^{2}\right]+\frac{2\tilde{b}+2\eta\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+2\eta \delta_{\mathbf{v},0}+\frac{2d}{\beta}}{2\tilde{m}-2\eta\left(\left(\omega_{\widetilde{G}}(1)\right)^{2}+\delta_{\mathbf{v},2}\right)}\\ &\le \kappa_{0}+\frac{2}{\tilde{m}}\left(\tilde{b}+\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\frac{d}{\beta}\right) \end{align*}

since Jensen’s inequality yields $E[|\xi|^{2}]\le \log E[\exp(|\xi|^{2})]= \kappa_{0}$ .

(2) Itô’s formula yields

\begin{align*} \mathrm{e}^{2\bar{m}t}\left|\bar{X}_{t}^{r}\right|^{2}&=\left|\bar{X}_{0}^{r}\right|^{2}+\int_{0}^{t}\left(\mathrm{e}^{2\bar{m}s}\left\langle -\nabla U\left(\bar{X}_{s}^{r}\right),2\bar{X}_{s}^{r}\right\rangle+\mathrm{e}^{2\bar{m}s}\frac{2d}{\beta}+2\bar{m}\mathrm{e}^{2\bar{m}s}\left|\bar{X}_{s}^{r}\right|^{2}\right)\text{d} s\\ &\quad+\sqrt{2\beta^{-1}}\int_{0}^{t}\mathrm{e}^{2\bar{m}s}\left\langle 2\bar{X}_{s}^{r},\text{d} B_{s}\right\rangle. \end{align*}

The dissipativity and the martingale property of the last term lead to

\begin{align*} E\left[\left|\bar{X}_{t}^{r}\right|^{2}\right]&=\mathrm{e}^{-2\bar{m}t}E\left[\left|\xi\right|^{2}\right]\\ &\quad+2\int_{0}^{t}\mathrm{e}^{2\bar{m}(s-t)}\left(E\left[\left\langle -\nabla U\left(\bar{X}_{s}^{r}\right),\bar{X}_{s}^{r}\right\rangle+\bar{m}\left|\bar{X}_{s}^{r}\right|^{2}\right]+\frac{d}{\beta}\right)\text{d} s\\ &\le \mathrm{e}^{-2\bar{m}t}E\left[\left| \xi\right|^{2}\right]+2\int_{0}^{t}\mathrm{e}^{2\bar{m}(s-t)}\left(E\left[-\bar{m}\left|\bar{X}_{s}^{r}\right|^{2}+\bar{b}+\bar{m}\left|\bar{X}_{s}^{r}\right|^{2}\right]+\frac{d}{\beta}\right)\text{d} s\\ &\le \mathrm{e}^{-2\bar{m}t}\kappa_{0}+\frac{\bar{b}+d/\beta}{\bar{m}}\left(1-\mathrm{e}^{-2\bar{m}t}\right), \end{align*}

and we are done.

Lemma 5.2. (Exponential integrability of mollified Langevin dynamics.) Assume (A1)–(A4) and (A6). For all $r\in(0,1]$ and $\alpha\in(0,\beta \bar{m}/2)$ such that $E[\exp(\alpha|\xi|^{2})]<\infty$ ,

\begin{align*} E\left[\exp\left(\alpha |\bar{X}_{t}^{r}|^{2}\right)\right]&\le E\left[\mathrm{e}^{\alpha |\xi|^{2}}\right]\mathrm{e}^{-2\alpha \left(\bar{b}+d/\beta\right)t}+2 \mathrm{e}^{\frac{2\alpha(\bar{b}+d/\beta)}{\bar{m}-2\alpha/\beta}}\left(1- \mathrm{e}^{-2\alpha \left(\bar{b}+d/\beta\right)t}\right). \end{align*}

In particular, for $\alpha=1\wedge (\beta \bar{m}/4)$ ,

\begin{align*} \sup_{t\ge0}\log E\left[\mathrm{e}^{\alpha|\bar{X}_{t}^{r}|^{2}}\right]&\le \log\left(\mathrm{e}^{\alpha\kappa_{0}}\vee 2\mathrm{e}^{4\alpha(\bar{b}+d/\beta)/\bar{m}}\right)\le \alpha\kappa_{0}+\frac{4\alpha(\bar{b}+d/\beta)}{\bar{m}}+1. \end{align*}

Proof. Let $V_{\alpha}(x)\,:\!=\,\exp(\alpha|x|^{2})$ . Note that

\begin{align*} \nabla V_{\alpha}(x)= 2\alpha V_{\alpha}(x)x,\quad \nabla^{2} V_{\alpha}(x)=4\alpha^{2} V_{\alpha}(x)xx^{\top}+2\alpha V_{\alpha}(x)I_{d}. \end{align*}

Let $\bar{\mathcal{L}}^{r}$ denote the extended generator of $\bar{X}_{t}^{r}$ such that $\bar{\mathcal{L}}^{r}f\,:\!=\,\beta^{-1}\Delta f-\langle \nabla \bar{U}_{r}, \nabla f\rangle$ for $f\in\mathcal{C}^{2}(\mathbb{R}^{d})$ . We have that

\begin{align*} \bar{\mathcal{L}}^{r}V_{\alpha}(x)&\le -2\alpha V_{\alpha}(x)\left\langle \nabla \bar{U}_{r}(x),x\right\rangle +2(\alpha/\beta)V_{\alpha}(x)\left(2\alpha\left|x\right|^{2}+d\right)\\ &\le 2\alpha V_{\alpha}(x)\left(\left(-\bar{m}\left|x\right|^{2}+\bar{b}\right) +\left(2\alpha\left|x\right|^{2}+d\right)/\beta\right)\\ &=2\alpha V_{\alpha}(x)\left(\left(2\alpha/\beta-\bar{m}\right)\left|x\right|^{2}+\bar{b}+ d/\beta\right). \end{align*}

Let $R^{2}= 2(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)$ be a fixed constant. We obtain that for all $x\in\mathbb{R}^{d}$ with $|x|\ge R$ ,

\begin{align*} \bar{\mathcal{L}}^{r}V_{\alpha}(x)&\le -2\alpha \left(\bar{b}+ d/\beta\right)V_{\alpha}(x), \end{align*}

and trivially for all $x\in\mathbb{R}^{d}$ with $|x|<R$ ,

\begin{align*} \bar{\mathcal{L}}^{r}V_{\alpha}(x)&\le 2\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)\\ &\le 4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)-2\alpha \left(\bar{b}+ d/\beta\right)V_{\alpha}(x). \end{align*}

Thus, we have for all $x\in\mathbb{R}^{d}$ ,

\begin{align*} \bar{\mathcal{L}}^{r}V_{\alpha}(x)\le -2\alpha \left(\bar{b}+ d/\beta\right)V_{\alpha}(x)+4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right). \end{align*}

By Itô’s formula, there exists a sequence of stopping times $\{\sigma_{n}\in[0,\infty)\}_{n\in\mathbb{N}}$ with $\sigma_{n}<\sigma_{n+1}$ for all $n\in\mathbb{N}$ and $\sigma_{n}\uparrow \infty$ as $n\to\infty$ almost surely such that for all $n\in\mathbb{N}$ and $t\ge0$ ,

\begin{align*} &E\left[\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)(t\wedge\sigma_{n})}V_{\alpha}\left(\bar{X}_{t\wedge\sigma_{n}}^{r}\right)\right]\\ &\quad= E\left[V_{\alpha}\left(\bar{X}_{0}^{r}\right)\right]\\ &\qquad+E\left[\int_{0}^{t\wedge\sigma_{n}}\left(\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\bar{\mathcal{L}}^{r}V_{\alpha}\left(\bar{X}_{s}^{r}\right)+2\alpha \left(\bar{b}+ d/\beta\right)\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}V_{\alpha}\left(\bar{X}_{s}^{r}\right)\right)\text{d} s\right]. \end{align*}

We have that

\begin{align*} &E\left[\int_{0}^{t\wedge\sigma_{n}}\left(\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\bar{\mathcal{L}}^{r}V_{\alpha}\left(\bar{X}_{s}^{r}\right)+2\alpha \left(\bar{b}+ d/\beta\right)\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}V_{\alpha}\left(\bar{X}_{s}^{r}\right)\right)\text{d} s\right]\\ &\quad\le 4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)E\left[\int_{0}^{t\wedge\sigma_{n}}\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\text{d} s\right]\\ &\quad\le 4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)\int_{0}^{t}\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\text{d} s, \end{align*}

and thus Fatou’s lemma gives

\begin{align*} &E\left[\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)t}V_{\alpha}\left(\bar{X}_{t}^{r}\right)\right]\\ &\quad=E\left[\lim_{n\to\infty} \mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)(t\wedge\sigma_{n})}V_{\alpha}\left(\bar{X}_{t\wedge\sigma_{n}}^{r}\right)\right]\\ &\quad\le \liminf_{n\to\infty}E\left[\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)(t\wedge\sigma_{n})}V_{\alpha}\left(\bar{X}_{t\wedge\sigma_{n}}^{r}\right)\right]\\ &\quad\le E\left[V_{\alpha}\left(\bar{X}_{0}^{r}\right)\right]+4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)\int_{0}^{t}\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\text{d} s. \end{align*}

Therefore,

\begin{align*} E\left[V_{\alpha}\left(\bar{X}_{t}^{r}\right)\right]&\le E\left[\mathrm{e}^{\alpha \left|\bar{X}_{0}^{r}\right|^{2}}\right]\mathrm{e}^{-2\alpha \left(\bar{b}+ d/\beta\right)t}+2 \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(1- \mathrm{e}^{-2\alpha \left(\bar{b}+ d/\beta\right)t}\right) \end{align*}

and we obtain the desired conclusion.

5.2. Poincaré inequalities for distributions with mollified potentials

Let $\bar{L}^{r}$ be an operator such that $\bar{L}^{r}f\,:\!=\,\Delta f-\beta\langle \nabla \bar{U}_{r},\nabla f\rangle $ for all $f\in\mathcal{C}^{2}(\mathbb{R}^{d})$ . Note that Lemma 4.7 yields $\bar{U}_{r}\in\mathcal{C}^{2}(\mathbb{R}^{d})$ .

Lemma 5.3. (A bound for the constant of a Poincaràinequality for $\bar{\pi}^{r}$ .) Under (A1)–(A4), for some absolute constant $a>0$ , for all $r\in(0,1]$ ,

\begin{align*} c_\textrm{P}(\bar{\pi}^{r})&\le \frac{2}{\bar{m}\beta\left(d+\bar{b}\beta\right)}+\frac{4a\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\exp\left(\beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\right)+U_{0}\right)\right), \end{align*}

where $U_{0}\,:\!=\,\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}<\infty$ .

Remark 5.1. Note that this upper bound is independent of r.

Proof. We adapt the discussion of [Reference Raginsky, Rakhlin and Telgarsky31]. We set a Lyapunov function $V(x)=\mathrm{e}^{\bar{m}\beta|x|^{2}/4}$ . Since $-\langle \nabla \bar{U}_{r}(x),x\rangle \le -\bar{m}|x|^{2}+\bar{b}$ for all $x\in\mathbb{R}^{d}$ , we have that

\begin{align*} \bar{L}^{r}V(x)&=\left(\frac{d\bar{m}\beta}{2}+\frac{\left(\bar{m}\beta\right)^{2}}{4}\left|x\right|^{2}-\frac{\bar{m}\beta^{2}}{2}\left\langle \nabla \bar{U}_{r}\left(x\right),x\right\rangle \right)V(x)\\ &\le \left(\frac{d\bar{m}\beta}{2}+\frac{\left(\bar{m}\beta\right)^{2}}{4}\left|x\right|^{2}-\frac{\bar{m}^{2}\beta^{2}}{2}\left|x\right|^{2}+\frac{\bar{m}\beta^{2}\bar{b}}{2}\right)V(x)\\ &= \left(\frac{\bar{m}\beta\left(d+\bar{b}\beta\right)}{2}-\frac{\bar{m}^{2}\beta^{2}}{4}|x|^{2}\right)V(x). \end{align*}

We fix the constants

\begin{align*} \kappa=\frac{\bar{m}\beta\left(d+\bar{b}\beta\right)}{2},\quad \gamma=\frac{\bar{m}^{2}\beta^{2}}{4},\quad \tilde{R}^{2}=\frac{2\kappa}{\gamma}=\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}. \end{align*}

Lemma 4.6, $2a\le a^{2}+1$ for $a>0$ , and $U(x)\ge 0$ give $\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}<\infty$ and

\begin{align*} \text{Osc}_{\tilde{R}}\left(\beta \bar{U}_{r}\right)&\le \beta\left(\frac{\omega_{\nabla U}(1)}{2}\tilde{R}^{2}+2\left\|\nabla U\right\|_{\mathbb{M}}\tilde{R}+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right)\\ &\le \beta\left(\left(\left\|\nabla U\right\|_{\mathbb{M}}+\frac{\omega_{\nabla U}(1)}{2}\right)\tilde{R}^{2}+\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right)\\ &\le \beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\tilde{R}^{2}+\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right)\\ &\le \beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\tilde{R}^{2}\right)+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right). \end{align*}

Proposition 4.4 with $\lambda_{0}=\kappa_{0}=\kappa$ yields that for some absolute constant $a>0$ ,

\begin{align*} c_\textrm{P}\left(\bar{\pi}^{r}\right)&\le \frac{2}{\bar{m}\beta\left(d+\bar{b}\beta\right)}\left(1+a\frac{\bar{m}\beta\left(d+\bar{b}\beta\right)}{2}\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\mathrm{e}^{\text{Osc}_{\tilde{R}}\left(\beta \bar{U}_{r}\right)}\right)\\ &=\frac{2}{\bar{m}\beta\left(d+\bar{b}\beta\right)}+\frac{4a\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\exp\left(\beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\right)+U_{0}\right)\right). \end{align*}

Hence, the statement holds true.

5.3. Kullback–Leibler and $\chi^{2}$ -divergences

Lemma 5.4. Under (A1)–(A6), for any $k\in\mathbb{N}$ and $\eta\in(0,1\wedge (\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))]$ , we have that

\begin{align*} D(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r})\le \left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta, \end{align*}

where $C_{0}$ is a positive constant such that

\begin{align*} C_{0}&=\left(d+2\right)\left(\frac{\beta}{3}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)\kappa_{\infty}\right)+\frac{d}{2}\right). \end{align*}

Proof. We set $A_{t}\,:\!=\,\{\mathbf{a}_{s}=a_{\lfloor s/\eta\rfloor \eta} \, : \, a_{i\eta}\in A,i=0,\ldots,\lfloor t/\eta\rfloor,s\le t\}$ with $t\le k\eta$ , and $\mathcal{A}_{t}\,:\!=\,\sigma(\{\mathbf{a}\in A_{t} \, : \, \mathbf{a}_{s_{j}}\in C_{j},j=1,\ldots,n\} \, : \, s_{j}\in[0,t],C_{j}\in\mathcal{A},n\in\mathbb{N})$ . Let $P_{k\eta}$ and $Q_{k\eta}$ denote the probability measures on $(A_{k\eta}\times W_{k\eta},\mathcal{A}_{k\eta}\times \mathcal{W}_{k\eta})$ of $\{(a_{\lfloor t/\eta\rfloor \eta},Y_{t}) \, : \, 0\le t\le T\}$ and $\{(a_{\lfloor t/\eta\rfloor \eta},\bar{X}_{t}^{r}) \, : \, 0\le t\le T\}$ , respectively. Note that $\bar{X}_{t}^{r}$ is the unique strong solution to Eq. (10) and such a unique strong solution of this equation exists for any $r>0$ since $\nabla \bar{U}_{r}$ is Lipschitz continuous by Lemma 4.8. We obtain

\begin{align*} &\frac{\beta}{4}E\left[\int_{0}^{k\eta}\left|\nabla \bar{U}_{r}\left(Y_{t}\right)-G\left(Y_{\lfloor t/\eta\rfloor \eta},a_{\lfloor t/\eta\rfloor\eta}\right)\right|^{2}\text{d} t\right]\\ &\quad\le \frac{\beta}{2}\sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|\nabla \bar{U}_{r}\left(Y_{t}\right)-\nabla \bar{U}_{r}\left(Y_{\lfloor t/\eta\rfloor \eta}\right)\right|^{2}\text{d} t\right]\\ &\qquad+\frac{\beta}{2}\sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|\nabla \bar{U}_{r}\left(Y_{\lfloor t/\eta\rfloor \eta}\right)-G\left(Y_{\lfloor t/\eta\rfloor \eta},a_{\lfloor t/\eta\rfloor\eta}\right)\right|^{2}\text{d} t\right]\\ &\quad\le \frac{\beta}{2}\frac{\left(d+2\right)\omega_{\nabla U}(r)}{r}\sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|Y_{t}-Y_{\lfloor t/\eta\rfloor \eta}\right|^{2}\text{d} t\right]\\ &\qquad+\frac{\beta}{2}\sum_{j=0}^{k-1}E\left[\eta\left|\nabla \bar{U}_{r}\left(Y_{j \eta}\right)-G\left(Y_{j\eta},a_{j\eta}\right)\right|^{2}\right]. \end{align*}

Note that $E[\langle G(Y_{j\eta},a_{j\eta})-\widetilde{G}(Y_{j\eta}), f(Y_{j\eta})\rangle ]=0$ for any measurable $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ of linear growth since $Y_{j\eta}$ is square integrable by Lemma 5.1 and $\sigma(Y_{(j-1)\eta},a_{(j-1)\eta},B_{j\eta}-B_{(j-1)\eta})$ -measurable, and $a_{j\eta}$ is independent of this $\sigma$ -algebra. For all $t\in[j\eta,(j+1)\eta)$ , by Lemmas 4.4 and 5.1,

\begin{align*} &E\left[\left|Y_{t}-Y_{\lfloor t/\eta\rfloor \eta}\right|^{2}\right]\\ &\quad=E\left[\left|-\left(t-j\eta\right)G\left(Y_{j\eta},a_{j\eta}\right)+\sqrt{2\beta^{-1}}\left(B_{t}-B_{j\eta}\right)\right|^{2}\right]\\ &\quad=\left(t-j\eta\right)^{2}E\left[\left|G\left(Y_{j\eta},a_{j\eta}\right)-\widetilde{G}\left(Y_{j\eta}\right)+\widetilde{G}\left(Y_{j\eta}\right)\right|^{2}\right]+2\beta^{-1}E\left[\left|B_{t}-B_{j\eta}\right|^{2}\right]\\ &\quad\le \left(t-j\eta\right)^{2}\left(2\delta_{\mathbf{v},2}E\left[\left|Y_{j\eta}\right|^{2}\right]+2\delta_{\mathbf{v},0}+E\left[\left|\widetilde{G}\left(Y_{j\eta}\right)\right|^{2}\right]\right)+2\beta^{-1}d\left(t-j\eta\right)\\ &\quad\le 2\left(t-j\eta\right)^{2}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)E\left[\left|Y_{j\eta}\right|^{2}\right]\right)+2\beta^{-1}d\left(t-j\eta\right)\\ &\quad\le 2\left(t-j\eta\right)^{2}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)\kappa_{\infty}\right)+2\beta^{-1}d\left(t-j\eta\right)\\ &\quad = \! : \, 2\left(t-j\eta\right)^{2}C'+2\beta^{-1}d\left(t-j\eta\right) \end{align*}

and thus

\begin{align*} \sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|Y_{t}-Y_{\lfloor t/\eta\rfloor \eta}\right|^{2}\text{d} t\right]\le \left(\frac{2C'\eta^{3}}{3}+\frac{d\eta^{2}}{\beta}\right)k\le \left(\frac{2C'}{3}+\frac{d}{\beta}\right)k\eta^{2}. \end{align*}

We have that

\begin{align*} E\left[\left|\nabla \bar{U}_{r}\left(Y_{j\eta}\right)-G\left(Y_{j\eta},a_{j\eta}\right)\right|^{2}\right] &\le E\left[2\left(\delta_{\mathbf{b},r,2}+\delta_{\mathbf{v},2}\right)\left|Y_{j\eta}\right|^{2}+2\left(\delta_{\mathbf{b},r,0}+\delta_{\mathbf{v},0}\right)\right]\\ &\le 2\delta_{r,2}\kappa_{\infty}+2\delta_{r,0}. \end{align*}

Assumptions (LS1)–(LS4) of Propositions 4.2 and 4.3 are satisfied owing to (A1)–(A6), Lemma 5.1, and the linear growths of $\widetilde{G}(w_{\lfloor \cdot /\eta\rfloor \eta})$ with respect to $\max_{i=0,\ldots,k}|w_{i\eta}|$ and $\nabla \bar{U}_{r}(w_{t})$ with respect to $|w_{t}|$ . Therefore, the data-processing inequality and Proposition 4.3 give

\begin{align*} D(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r}) &\le \int\log\left(\frac{\text{d} P_{k\eta}}{\text{d} Q_{k\eta}}\right)\text{d} P_{k\eta}\\ &= \frac{\beta}{4}E\left[\int_{0}^{k\eta}\left|\nabla \bar{U}_{r}\left(Y_{t}\right)-G\left(Y_{\lfloor t/\eta\rfloor \eta},a_{\lfloor t/\eta\rfloor\eta}\right)\right|^{2}\text{d} t\right]\\ &\le \frac{(d+2)\omega_{\nabla U}(r)}{r}\left(\frac{C'\beta}{3}+\frac{d}{2}\right)k\eta^{2}+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)k\eta\\ &=\left(\frac{(d+2)\omega_{\nabla U}(r)}{r}\left(\frac{C'\beta}{3}+\frac{d}{2}\right)\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta\\ &=\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta. \end{align*}

This is the desired conclusion.

Lemma 5.5 (Lemma 2 of [Reference Raginsky, Rakhlin and Telgarsky31].) Under (A1) and (A4), for almost all $x\in\mathbb{R}^{d}$ ,

\begin{align*} U(x)\ge \frac{m}{3}|x|^{2}-\frac{b}{2}\log3. \end{align*}

Proof. The proof is adapted from Lemma 2 of [Reference Raginsky, Rakhlin and Telgarsky31]. We first fix $c\in(0,1]$ . Since $\{x\in\mathbb{R}^{d} \, : \, x\text{ or }cx\text{ is in the set such that Eq.~(6) does not hold}\}$ is null, for almost all $x\in\mathbb{R}^{d}$ ,

\begin{align*} U(x)&=U(cx)+\int_{0}^{1}\left\langle \nabla U(cx+t(x-cx)),x-cx\right\rangle \text{d} t\\\ &\ge \int_{0}^{1}\left\langle \nabla U((c+t(1-c))x),(1-c)x\right\rangle \text{d} t\\ &=\int_{0}^{1}\frac{1-c}{c+t(1-c)}\left\langle \nabla U((c+t(1-c))x),(c+t(1-c))x\right\rangle \text{d} t\\ &\ge \int_{0}^{1}\frac{1-c}{c+t(1-c)}\left(m(c+t(1-c))^{2}|x|^{2}-b\right) \text{d} t\\ &=\int_{c}^{1}\frac{1}{s}\left(ms^{2}|x|^{2}-b\right) \text{d} s\\ &=\frac{1-c^{2}}{2}m|x|^{2}+b\log c. \end{align*}

Here, $s=c+t(1-c)$ and thus $\text{d} t=(1-c)^{-1}\text{d} s$ . Then $c=1/\sqrt{3}$ yields the conclusion.

Lemma 5.6. Under (A1)–(A4) and (A6), for all $r\in(0,1]$ , we have that

\begin{align*} \chi^{2}(\mu_{0}\|\bar{\pi}^{r})\le 3^{\beta b/2}\left(\frac{3\psi_{2}}{m\beta}\right)^{d/2}\exp\left(\beta\left(2\left\|\nabla U\right\|_{\mathbb{M}}+U_{0}\right)+2\psi_{0}\right). \end{align*}

Proof. The density of $\bar{\pi}^{r}$ is given as $(\text{d}\bar{\pi}^{r}/\text{d} x)(x)=\bar{\mathcal{Z}}^{r}(\beta)^{-1}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}$ and

\begin{align*} \chi^{2}(\mu_{0}\|\bar{\pi}^{r})&=\int_{\mathbb{R}^{d}}\left(\left(\frac{\left(\int_{\mathbb{R}^{d}}\mathrm{e}^{-\Psi(x)}\text{d} x\right)^{-1}\mathrm{e}^{-\Psi(x)}}{\bar{\mathcal{Z}}^{r}(\beta)^{-1}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}}\right)^{2}-1\right)\bar{\mathcal{Z}}^{r}(\beta)^{-1}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}\text{d} x\\ &=\frac{\bar{\mathcal{Z}}^{r}(\beta)}{\int_{\mathbb{R}^{d}}\mathrm{e}^{-\Psi(x)}\text{d} x}\int_{\mathbb{R}^{d}}\mathrm{e}^{\beta \bar{U}_{r}\left(x\right)-\Psi\left(x\right)}\mu_{0}(\text{d} x)-1\\ &\le \frac{\mathrm{e}^{\psi_{0}}\bar{\mathcal{Z}}^{r}(\beta)}{\int_{\mathbb{R}^{d}}\mathrm{e}^{-\psi_{2}|x|^{2}}\text{d} x}\int_{\mathbb{R}^{d}}\mathrm{e}^{\beta\left(\frac{\omega_{\nabla U}(1)}{2}\left|x\right|^{2}+2\left\|\nabla U\right\|_{\mathbb{M}}\left|x\right|+\left\|U\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}\right)-\Psi(x)}\mu_{0}(\text{d} x)\\ &\le \frac{\mathrm{e}^{\psi_{0}}\bar{\mathcal{Z}}^{r}(\beta)}{(\pi/\psi_{2})^{d/2}}\int_{\mathbb{R}^{d}}\mathrm{e}^{\beta\left(\left\|\nabla U\right\|_{\mathbb{M}}\left|x\right|^{2}+2\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}\right)-\Psi(x)}\mu_{0}(\text{d} x)\\ &\le \frac{\bar{\mathcal{Z}}^{r}(\beta)}{(\pi/\psi_{2})^{d/2}}\mathrm{e}^{\beta\left(2\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}\right)+2\psi_{0}} \end{align*}

by Lemma 4.6 and $2|x|\le |x|^{2}/2+2$ . Lemma 5.5, Jensen’s inequality, and the convexity of $\mathrm{e}^{-x}$ yield

\begin{align*} \bar{\mathcal{Z}}^{r}(\beta)&=\int_{\mathbb{R}^{d}}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}\text{d} x\\ &=\int_{\mathbb{R}^{d}}\mathrm{e}^{-\beta \int _{\mathbb{R}^{d}}U(x-y)\rho_{r}(y)\text{d} y}\text{d} x\\ &\le \int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\mathrm{e}^{-\beta U(x-y)}\text{d} x\rho_{r}(y)\text{d} y\\ &\le \mathrm{e}^{\frac{1}{2}\beta b\log 3}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\mathrm{e}^{-m\beta|x-y|^{2}/3}\text{d} x\rho_{r}(y)\text{d} y\\ &=3^{\beta b/2}\left(3\pi/m\beta\right)^{d/2}, \end{align*}

and we are done.

Lemma 5.7. (Kullback--Leibler divergence of Gibbs distributions.) Under (A1)–(A4) and (A6), we have that

\begin{align*} D\left(\pi\|\bar{\pi}^{r}\right)\le \beta r\omega_{\nabla U}(r). \end{align*}

Proof. The divergence of $\pi$ from $\bar{\pi}^{r}$ is

\begin{align*} D\left(\pi\|\bar{\pi}^{r}\right)&=\frac{1}{\mathcal{Z}(\beta)}\int \exp\left(-\beta U\left(x\right)\right)\log\left[\frac{\bar{\mathcal{Z}}^{r}(\beta)\exp\left(-\beta U\left(x\right)\right)}{\mathcal{Z}(\beta)\exp\left(-\beta \bar{U}_{r}\left(x\right)\right)}\right]\text{d} x\\ &=\frac{\beta}{\mathcal{Z}(\beta)}\int \exp\left(-\beta U\left(x\right)\right)\left(\bar{U}_{r}(x)-U\left(x\right)\right)\text{d} x+\left(\log\bar{\mathcal{Z}}^{r}(\beta)-\log\mathcal{Z}(\beta)\right).\end{align*}

Lemma 4.10 yields

\begin{align*} \frac{\beta}{\mathcal{Z}(\beta)}\int_{\mathbb{R}^{d}} \left(\bar{U}_{r}(x)-U\left(x\right)\right)\mathrm{e}^{-\beta U\left(x\right)}\text{d} x &\le \frac{\beta}{\mathcal{Z}(\beta)}\int_{\mathbb{R}^{d}} \left|\bar{U}_{r}(x)-U\left(x\right)\right|\mathrm{e}^{-\beta U\left(x\right)}\text{d} x\\ &\le \frac{\beta }{\mathcal{Z}(\beta)}\int_{\mathbb{R}^{d}} r\omega_{\nabla U}(r)\mathrm{e}^{-\beta U\left(x\right)}\text{d} x\\ &\le \beta r\omega_{\nabla U}(r).\end{align*}

Jensen’s inequality and Fubini’s theorem give

\begin{align*} \bar{\mathcal{Z}}^{r}\left(\beta\right)&=\int_{\mathbb{R}^{d}}\exp\left(-\beta \int_{\mathbb{R}^{d}}U\left(x-y\right)\rho_{r}\left(y\right)\text{d} y\right)\text{d} x\\ &\le \int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp\left(-\beta U\left(x-y\right)\right)\rho_{r}\left(y\right)\text{d} y\text{d} x\\ &=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp\left(-\beta U\left(x-y\right)\right)\text{d} x\rho_{r}\left(y\right)\text{d} y\\ &= \mathcal{Z}\left(\beta\right)\end{align*}

and thus $\log\bar{\mathcal{Z}}^{r}(\beta)-\log\mathcal{Z}(\beta)\le 0$ .

5.4. Proof of Theorem 2.1

We complete the proof of Theorem 2.1.

Proof of Theorem 2.1. We decompose the 2-Wasserstein distance as follows:

\begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\pi)\le \underbrace{\mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})}_\textrm{(1)}+\underbrace{\mathcal{W}_{2}(\bar{\nu}_{k\eta}^{r},\bar{\pi}^{r})}_\textrm{(2)}+\underbrace{\mathcal{W}_{2}(\bar{\pi}^{r},\pi)}_\textrm{(3)}.\end{align*}
  1. (1) We first consider an upper bound for $\mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})$ . Proposition 4.6 gives

    \begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})\le C_{\bar{\nu}_{k\eta}^{r}}\left(D\left(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r}\right)^{1/2}+\left(\frac{D\left(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r}\right)}{2}\right)^{1/4}\right),\end{align*}
    where
    \begin{align*} C_{\bar{\nu}_{k\eta}^{r}}\,:\!=\,2\inf_{\lambda>0}\left(\frac{1}{\lambda}\left(\frac{3}{2}+\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\bar{\nu}_{k\eta}^{r}\left(\text{d} x\right)\right)\right)^{1/2}.\end{align*}
    We fix $\lambda=1\wedge (\beta \bar{m}/4)$ and then Lemma 5.2 leads to
    \begin{align*} C_{\bar{\nu}_{k\eta}^{r}}&\le \frac{1}{\lambda^{1/2}}\left(6+4\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\bar{\nu}_{k\eta}^{r}\left(\text{d} x\right)\right)^{1/2}\\ &\le \frac{1}{\lambda^{1/2}}\left(6+4\left(\lambda\kappa_{0}+\frac{4\lambda(\bar{b}+d/\beta)}{\bar{m}}+1\right)\right)^{1/2}\\ &\le \left(4\kappa_{0}+\frac{16(\bar{b}+d/\beta)}{\bar{m}}+\frac{10}{1\wedge (\beta \bar{m}/4)}\right)^{1/2}.\end{align*}
    Hence, Lemma 5.4 gives the following bound:
    \begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r}) &\le C_{1}\max\left.\left\{x^{1/2},x^{1/4}\right\}\right|_{x=\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta}.\end{align*}
  2. (2) Let us now give a bound for $\mathcal{W}_{2}(\bar{\nu}_{k\eta}^{r},\bar{\pi}^{r})$ . Proposition 4.5 and Lemma 5.6 yield

    \begin{align*} \mathcal{W}_{2}(\bar{\nu}_{k\eta}^{r},\bar{\pi}^{r})&\le \sqrt{2c_\textrm{ P}(\bar{\pi}^{r})\chi^{2}\left(\mu_{0}\|\bar{\pi}^{r}\right)}\exp\left(-\frac{t}{2\beta c_\textrm{ P}(\bar{\pi}^{r})}\right)\\ &\le \sqrt{2c_\textrm{P}(\bar{\pi}^{r})C_{2}}\exp\left(-\frac{t}{2\beta c_\textrm{P}(\bar{\pi}^{r})}\right).\end{align*}
  3. (3) Next, we consider a bound for $\mathcal{W}_{2}(\bar{\pi}^{r},\pi)$ . Proposition 4.6 gives

    \begin{align*} \mathcal{W}_{2}(\bar{\pi}^{r},\pi)\le C_{\bar{\pi}^{r}}\left(D\left(\pi\|\bar{\pi}^{r}\right)^{1/2}+\left(\frac{D\left(\pi\|\bar{\pi}^{r}\right)}{2}\right)^{1/4}\right),\end{align*}
    where $C_{\bar{\pi}^{r}}\,:\!=\,2\inf_{\lambda>0}\left(\frac{1}{\lambda}\left(\frac{3}{2}+\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\bar{\pi}^{r}\left(\text{d} x\right)\right)\right)^{1/2}$ . We fix $\lambda=1\wedge (\beta \bar{m}/4)$ and then Lemmas 5.2 along with Fatou’s lemma leads to
    \begin{align*} C_{\bar{\pi}^{r}} &\le \left(\frac{16(\bar{b}+d/\beta)}{\bar{m}}+\frac{10}{1\wedge (\beta \bar{m}/4)}\right)^{1/2}.\end{align*}
    Lemma 5.7 yields the bound
    \begin{align*} \mathcal{W}_{2}(\bar{\pi}^{r},\pi)\le C_{1}\max\left.\left\{y^{1/2},y^{1/4}\right\}\right|_{y=\beta r\omega_{\nabla U}(r)}.\end{align*}
  4. (4) By (1) and (3),

    \begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})+\mathcal{W}_{2}(\bar{\pi}^{r},\pi)&\le C_{1}\left(\max\left\{x^{1/2},x^{1/4}\right\}+\max\left\{y^{1/2},y^{1/4}\right\}\right)\\ &\le C_{1}\left(2\left(x+y\right)^{1/2}+2\left(x+y\right)^{1/4}\right),\end{align*}
    where
    \begin{align*} x=\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta,\ {y=\beta r\omega_{\nabla U}(r)}.\end{align*}
    Hence, we obtain the desired conclusion.

Acknowledgements

The author thanks the two anonymous reviewers for their constructive feedback.

Funding information

The author was supported by JSPS KAKENHI Grant No. JP21K20318 and JST CREST Grants Nos. JPMJCR21D2 and JPMJCR2115.

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Altschuler, J. and Talwar, K. (2023). Resolving the mixing time of the langevin algorithm to its stationary distribution for log-concave sampling. In The Thirty Sixth Annual Conference on Learning Theory, pp. 25092510. PMLR.Google Scholar
Anastassiou, G. A. (2009). Distributional Taylor formula. Nonlinear Anal. Theory Methods Appl. 70(9), 31953202.10.1016/j.na.2008.04.022CrossRefGoogle Scholar
Anderson, C. R. (2014). Compact polynomial mollifiers for Poisson’s equation. Technical report, Department of Mathematics, UCLA, Los Angeles, California.Google Scholar
Bakry, D., Barthe, F., Cattiaux, P. and Guillin, A. (2008). A simple proof of the Poincaré inequality for a large class of probability measures. Electron. Commun. Probab. 13, 6066.10.1214/ECP.v13-1352CrossRefGoogle Scholar
Bakry, D., Gentil, I. and Ledoux, M. (2014). Analysis and Geometry of Markov Diffusion Operators. Springer.10.1007/978-3-319-00227-9CrossRefGoogle Scholar
Bardet, J.-B., Gozlan, N., Malrieu, F. and Zitt, P.-A. (2018). Functional inequalities for Gaussian convolutions of compactly supported measures: explicit bounds and dimension dependence. Bernoulli 24(1), 333353.10.3150/16-BEJ879CrossRefGoogle Scholar
Bolley, F. and Villani, C. (2005). Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. Annales de la Faculté des Sciences de Toulouse: Mathématiques 14(3), 331352.Google Scholar
Brosse, N., Durmus, A., Moulines, É. and Sabanis, S. (2019). The tamed unadjusted Langevin algorithm. Stochastic Process. Appl. 129(10), 36383663.10.1016/j.spa.2018.10.002CrossRefGoogle Scholar
Cattiaux, P., Guillin, A. and Wu, L.-M. (2010). A note on Talagrand’s transportation inequality and logarithmic Sobolev inequality. Probab. Theory Related Fields 148, 285304.10.1007/s00440-009-0231-9CrossRefGoogle Scholar
Chatterji, N., Diakonikolas, J., Jordan, M. I. and Bartlett, P. (2020). Langevin Monte Carlo without smoothness. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pp. 17161726.Google Scholar
Chewi, S., Erdogdu, M. A., Li, M., Shen, R., and Zhang, S. (2022). Analysis of Langevin Monte Carlo from Poincare to Log-Sobolev. In Proceedings of Thirty Fifth Conference on Learning Theory, pp. 12.Google Scholar
Cui, T., Tong, X. T. and Zahm, O. (2022). Prior normalization for certified likelihood-informed subspace detection of bayesian inverse problems. Inverse Probl. 38(12), 124002.10.1088/1361-6420/ac9582CrossRefGoogle Scholar
Dalalyan, A. S. (2017). Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(3), 651676.CrossRefGoogle Scholar
Durmus, A., Majewski, S. and Miasojedow, B. (2019). Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res. 20(73), 146.Google Scholar
Durmus, A. and Moulines, E. (2017). Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27(3), 15511587.10.1214/16-AAP1238CrossRefGoogle Scholar
Durmus, A. and Moulines, E. (2019). High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 25(4A), 28542882.CrossRefGoogle Scholar
Erdogdu, M. A. and Hosseinzadeh, R. (2021). On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness. In Proceedings of Thirty Fourth Conference on Learning Theory, pp. 17761822.Google Scholar
Erdogdu, M. A., Mackey, L. and Shamir, O. (2018). Global non-convex optimization with discretized diffusions. In 32nd Conference on Neural Information Processing Systems.Google Scholar
Fan, J., Yuan, B. and Chen, Y. (2023a). Improved dimension dependence of a proximal algorithm for sampling. In The Thirty Sixth Annual Conference on Learning Theory, pp. 14731521. PMLR.Google Scholar
Fan, J., Yuan, B., Liang, J. and Chen, Y. (2023b). Nesterov smoothing for sampling without smoothness. In 2023 62nd IEEE Conference on Decision and Control (CDC), pp. 53135318. IEEE.10.1109/CDC49753.2023.10383529CrossRefGoogle Scholar
Lehec, J. (2023). The Langevin Monte Carlo algorithm in the non-smooth log-concave case. Ann. Appl. Probab. 33(6A), 48584874.10.1214/23-AAP1935CrossRefGoogle Scholar
Liang, J. and Chen, Y. (2023). A proximal algorithm for sampling. Trans. Mach. Learn. Res.Google Scholar
Lieb, E. H. and Loss, M. (2001). Analysis. American Mathematical Society, 2nd edition.Google Scholar
Liptser, R. S. and Shiryaev, A. N. (2001). Statistics of Random Processes: I. General theory. Springer, 2nd edition.Google Scholar
Liu, Y. (2020). The Poincaré inequality and quadratic transportation-variance inequalities. Electron. J. Probab. 25(1), 116.10.1214/19-EJP403CrossRefGoogle Scholar
Menozzi, S., Pesce, A. and Zhang, X. (2021). Density and gradient estimates for non degenerate Brownian SDEs with unbounded measurable drift. J. Differ. Equations 272, 330369.10.1016/j.jde.2020.09.004CrossRefGoogle Scholar
Menozzi, S. and Zhang, X. (2022). Heat kernel of supercritical nonlocal operators with unbounded drifts. J. l’École Polytech.–Math. 9, 537579.10.5802/jep.189CrossRefGoogle Scholar
Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Found. Comput. Math. 17, 527566.10.1007/s10208-015-9296-2CrossRefGoogle Scholar
Nguyen, D. (2022). Unadjusted Langevin algorithm for sampling a mixture of weakly smooth potentials. Braz. J. Probab. Stat. 36(3), 504539.10.1214/22-BJPS538CrossRefGoogle Scholar
Pereyra, M. (2016). Proximal Markov chain Monte Carlo algorithms. Stat. Comput. 26, 745760.10.1007/s11222-015-9567-4CrossRefGoogle Scholar
Raginsky, M., Rakhlin, A. and Telgarsky, M. (2017). Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, pp. 16741703.Google Scholar
Roy, A., Shen, L., Balasubramanian, K. and Ghadimi, S. (2022). Stochastic zeroth-order discretizations of Langevin diffusions for Bayesian inference. Bernoulli 28(3), 18101834.10.3150/21-BEJ1400CrossRefGoogle Scholar
Vempala, S. and Wibisono, A. (2019). Rapid convergence of the unadjusted Langevin Algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, 32.Google Scholar
Xu, P., Chen, J., Zou, D. and Gu, Q. (2018). Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In 32nd Conference on Neural Information Processing Systems.Google Scholar
Zhang, Y., Akyildiz, Ö. D., Damoulas, T. and Sabanis, S. (2023). Nonasymptotic estimates for stochastic gradient Langevin dynamics under local conditions in nonconvex optimization. Appl. Math. Optim. 87(2), 25. 10.1007/s00245-022-09932-6CrossRefGoogle Scholar
Figure 0

Figure 1. The chains of implications.