1. Introduction
We consider the problem of sampling from a Gibbs distribution
$\pi(\text{d} x)\propto \mathrm{e}^{-\beta U(x)}\text{d} x$
on
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$
, where
$U \, : \, \mathbb{R}^{d}\to[0,\infty)$
is a non-negative potential function and
$\beta>0$
is the inverse temperature. In particular, we focus on approximate sampling from a distribution
$\pi_{\epsilon}$
satisfying
$d(\pi_{\epsilon},\pi)\le \epsilon$
for given tolerance
$\epsilon\in(0,\infty)$
and distance between probability distributions
$d(\cdot,\cdot)$
. The algorithms for this problem can be classified into two types: algorithms whose invariant distributions exactly match
$\pi$
(e.g., Metropolis–Hastings algorithms, Gibbs samplers, and piecewise deterministic Markov process Monte Carlo methods) and algorithms whose invariant distributions are not equal to
$\pi$
but approximate it with preferable computational complexities. In this study, we examine the algorithms of the latter type and their computational complexities.
One of the extensively used types of algorithms for the approximate sampling problem is the Langevin type motivated by the following Langevin dynamics, the solution of the following d-dimensional stochastic differential equation (SDE):
where
$\{B_{t}\}_{t\ge0}$
is a d-dimensional Brownian motion and
$\xi$
is a d-dimensional random vector with
$|\xi|<\infty$
almost surely. Since the 2-Wasserstein or total variation distance between
$\pi$
and the law of
$X_{t}$
is convergent under mild conditions, we expect that the laws of Langevin-type algorithms inspired by
$X_{t}$
should approximate
$\pi$
well. However, most of the theoretical guarantees for the algorithms are based on the convexity of U, the twice continuous differentiability of U, or the Lipschitz continuity of the gradient
$\nabla U$
, which do not hold in some modelling in statistics and machine learning. The main interest of this study is a unified approach to analyse and propose Langevin-type algorithms under minimal assumptions.
The stochastic gradient Langevin Monte Carlo (SG-LMC) algorithm or stochastic gradient Langevin dynamics with a constant step size
$\eta>0$
is given by the discrete observations
$\{Y_{i\eta}\}_{i=0,\ldots,k}$
of the solution of the following d-dimensional SDE:
where
$\{a_{i\eta}\}_{i=0,\ldots,k}$
is a sequence of independent and identically distributed (i.i.d.) random variables on a measurable space
$(A,\mathcal{A})$
and G is a
$\mathbb{R}^{d}$
-valued measurable function. We assume that
$\{a_{i\eta}\}$
,
$\{B_{t}\}$
, and
$\xi$
are independent. Note that the Langevin Monte Carlo (LMC) algorithm is a special case of SG-LMC; it is represented by the discrete observations
$\{Y_{i\eta}\}_{i=0,\ldots,k}$
of the solution of the following diffusion-type SDE:
which corresponds to the case
${G}=\nabla U$
.
To see what difficulties we need to deal with, we review a typical analysis [Reference Raginsky, Rakhlin and Telgarsky31] based on the smoothness of U, that is, the twice continuous differentiability of U and the Lipschitz continuity of
$\nabla U$
. Firstly, the twice continuous differentiability simplifies discussions or plays significant roles in studies of functional inequalities such as Poincaré inequalities and logarithmic Sobolev inequalities (e.g., [Reference Bakry, Barthe, Cattiaux and Guillin4, Reference Cattiaux, Guillin and Wu9]). Since the functional inequalities for
$\pi$
are essential in analysis of Langevin algorithms, the assumption that U is of class
$\mathcal{C}^{2}$
frequently appears in previous studies. Secondly, the Lipschitz continuity combined with weak conditions ensures the representation of the likelihood ratio between
$\{X_{t}\}$
and
$\{Y_{t}\}$
, which is critical when we bound the Kullback–Leibler divergence. Nesterov and Spokoiny [Reference Liptser and Shiryaev24] exhibit much weaker conditions than Novikov’s or Kazamaki’s condition for the explicit representation if (1) has a unique strong solution. Since the Lipschitz continuity of
$\nabla U$
is sufficient for the existence and the uniqueness of the strong solution of (1), the framework of [Reference Liptser and Shiryaev24] is applicable.
Our approaches to overcome the non-smoothness of U are mollification, a classical approach to dealing with non-smoothness in differential equations (e.g., see [Reference Menozzi, Pesce and Zhang26, Reference Menozzi and Zhang27]), and the abuse of moduli of continuity for possibly discontinuous functions. We consider the convolution
${\bar{U}}_{r} \,:\!=\,U\ast \rho_{r}$
on U with a weak gradient, and some sufficiently smooth non-negative function
$\rho_{r}$
with compact support in a ball of centre
$\mathbf{0}$
and radius
$r\in(0,1]$
. We can let
$\bar{U}_{r}$
be of class
$\mathcal{C}^{2}$
and obtain bounds for the constant of Poincaré inequalities for
$\bar{\pi}^{r}(\text{d} x)\propto \exp(-\beta\bar{U}_{r}(x))\text{d} x$
, which suffice to show the convergence of the law of the mollified dynamics
$\{\bar{X}_{t}^{r}\}$
defined by the SDE
to the corresponding Gibbs distribution
$\bar{\pi}^{r}$
in 2-Wasserstein distance owing to [Reference Bakry, Barthe, Cattiaux and Guillin4, Reference Lehec21, Reference Liu25]. Since the convolution
$\nabla \bar{U}_{r}$
is Lipschitz continuous if the modulus of continuity of a representative
$\nabla U$
is finite (the convergence to zero is unnecessary), a concise representation of the likelihood ratios between the mollified dynamics
$\{\bar{X}_{t}^{r}\}$
and
$\{Y_{t}\}$
is available, and we can evaluate the Kullback–Leibler divergence under weak assumptions.
As our analysis relies on mollification, the bias–variance decomposition of G with respect to
$\nabla \bar{U}_{r}$
rather than
$\nabla U$
is crucial. This decomposition gives us a unified approach to analyse well-known Langevin-type algorithms and propose new algorithms for U without continuous differentiability. Concretely speaking, we show that the sampling error of LMC under the dissipativity of U of class
$\mathcal{C}^{1}$
and uniformly continuous
$\nabla U$
can be arbitrarily small by controlling k,
$\eta$
, and r carefully and letting the bias converge. We also propose new algorithms, called the spherically smoothed Langevin Monte Carlo (SS-LMC) algorithm and the spherically smoothed stochastic gradient Langevin Monte Carlo (SS-SG-LMC) algorithm, whose errors can be arbitrarily small under the dissipativity of U and the boundedness of the modulus of continuity of weak gradients. In addition, we argue zeroth-order versions of these algorithms which are naturally obtained via integration by parts.
1.1. Related works
While the sampling problem from distributions with non-smooth and non-convex potentials has been extensively studied, this study focuses on Langevin-based algorithms. For studies from different viewpoints, see, for example, [Reference Fan, Yuan, Liang and Chen20, Reference Liang and Chen22].
Non-asymptotic analysis of Langevin-based algorithms under convex potentials has been a subject of much attention and intense research [Reference Dalalyan13, Reference Durmus and Moulines15, Reference Durmus and Moulines16], and one without convexity has also gathered keen interest [Reference Erdogdu, Mackey and Shamir18, Reference Raginsky, Rakhlin and Telgarsky31, Reference Xu, Chen, Zou and Gu34]. While most previous studies are based on the Lipschitz continuity of the gradients of potentials, several studies extend the settings to those without global Lipschitz continuity. We can classify the settings of potentials in those studies into three types: (1) potentials with convexity but without smoothness [Reference Chatterji, Diakonikolas, Jordan and Bartlett10, Reference Durmus, Majewski and Miasojedow14, Reference Lehec21, Reference Pereyra30]; (2) potentials with Hölder continuous gradients and degenerate convexity at infinity or outside a ball [Reference Chewi, Erdogdu, Li, Shen and Zhang11, Reference Erdogdu and Hosseinzadeh17, Reference Nguyen29]; and (3) potentials with local Lipschitz gradients [Reference Brosse, Durmus, Moulines and Sabanis8, Reference Zhang, Akyildiz, Damoulas and Sabanis35]. We review (1) and (2) as our study gives the error estimate of LMC with uniformly continuous gradients and Langevin-type algorithms with gradients whose discontinuity is uniformly bounded.
Pereyra [Reference Pereyra30], Chatterji et al. [Reference Chatterji, Diakonikolas, Jordan and Bartlett10], and Lehec [Reference Lehec21] study Langevin-type algorithms under the convexity and non-smoothness of potentials. Pereyra [Reference Pereyra30] presents proximal Langevin-type algorithms for potentials with convexity but without smoothness, which use Moreau approximations and proximity mappings instead of gradients. The algorithms are stable in the sense that they have exponential ergodicity for arbitrary step sizes. Chatterji et al. [Reference Chatterji, Diakonikolas, Jordan and Bartlett10] propose the perturbed Langevin Monte Carlo algorithm for non-smooth potential functions and show its performance to approximate Gibbs distributions. The difference between perturbed LMC and ordinary LMC is the inputs of the gradients; we need to add Gaussian noise not only to the gradients but also to their inputs. The main idea of the algorithm is to use Gaussian smoothing of potential functions studied in the expectation of non-smooth convex potentials with inputs perturbed by Gaussian random vectors is smoother than the vectors is smoother than the potentials themselves. Lehec [Reference Lehec21] investigates the projected LMC for potentials with convexity, global Lipschitz continuity, and discontinuous bounded gradients. The analysis is based on convexity and estimate for local times of diffusion processes with reflecting boundaries. The study also generalizes the result to potentials with local Lipschitz by considering a ball as the support of the algorithm and letting the radius diverge.
Erdogdu and Hosseinzadeh [Reference Erdogdu and Hosseinzadeh17], Chewi et al. [Reference Chewi, Erdogdu, Li, Shen and Zhang11], and Nguyen [Reference Nguyen29] estimate the error of LMC under non-convex potentials with degenerate convexity, weak smoothness, and weak dissipativity. Erdogdu and Hosseinzadeh [Reference Erdogdu and Hosseinzadeh17] show convergence guarantees of LMC under the degenerate convexity at infinity and weak dissipativity of potentials with Hölder continuous gradients, which are the assumptions for modified logarithmic Sobolev inequalities. Nguyen [Reference Nguyen29] relaxes the condition of [Reference Erdogdu and Hosseinzadeh17] by considering the degenerate convexity outside a large ball and the mixture weak smoothness of potential functions. Chewi et al. [Reference Chewi, Erdogdu, Li, Shen and Zhang11] analyse the convergence with respect to the Rényi divergence under either Latała–Oleszkiewics inequalities or modified logarithmic Sobolev inequalities.
Note that our proof of the results uses approaches similar to the smoothing of [Reference Chatterji, Diakonikolas, Jordan and Bartlett10] and the control of the radius of [Reference Lehec21], while our motivations and settings are close to those of the studies under non-convexity.
1.2. Contributions
Theorem 2.1, the main theoretical result of this paper, gives an upper bound for the 2-Wasserstein distance between the law of general SG-LMC given by Eq. (2) and the target distribution
$\pi$
under weak conditions. We assume the weak differentiability of U combined with the boundedness of the modulus of continuity of a weak gradient
$\nabla U$
rather than the twice continuous differentiability of U and the Lipschitz continuity of
$\nabla U$
. The generality of the assumptions results in a concise and general framework for analysis of Langevin-type algorithms. We demonstrate the strength of this framework through analysis of LMC under weak smoothness and proposal for new Langevin-type algorithms without the continuous differentiability or convexity of U.
Our contribution to analysis of LMC is to show a theoretical guarantee of LMC under non-convexity and weak smoothness in a direction different than previous studies. The main difference between our assumptions and those of the other studies under non-convex potentials is whether to assume (a) the strong dissipativity of the potentials and the uniform continuity of the gradients or (b) the degenerate convexity of the potentials and the Hölder continuity of the gradients. Our analysis needs neither degenerate convexity nor Hölder continuity, while we need stronger dissipativity than assumed in previous studies. Since assumptions (a) and (b) do not imply each other, our main contribution in the analysis of LMC is not to strengthen the previous studies but to broaden the theoretical guarantees of LMC under weak smoothness in a different direction.
Moreover, deriving polynomial sampling complexities with first-order or zeroth-order oracles for potentials without convexity, continuous differentiability, or bounded gradients is also a significant contribution. Proposing Langevin-type algorithms and analysing them, we successfully show that sampling complexity from some posterior distributions whose potentials are dissipative and weakly differentiable but neither convex nor continuously differentiable (e.g., some losses with elastic net regularization in nonlinear regression and robust regression) with first-order oracle for potentials is of polynomial order of dimension d, tolerance
$\epsilon$
, and Poincaré constant. Furthermore, we also develop zeroth-order algorithms inspired by the recent study of [Reference Roy, Shen, Balasubramanian and Ghadimi32] for black-box sampling and show that the sampling complexity with zeroth-order oracle for potentials without convexity or smoothness is also of polynomial order.
1.3. Outline
This paper is organized as follows. Section 2 displays the main theorem and its concise representation. In Section 3, we apply the result to analysis of LMC and our proposal for Langevin-type algorithms. Section 4 is devoted to preliminary results. Finally, we present the proof of the main theorem in Section 5.
1.4. Notation
Let
$|\cdot|$
denote the Euclidean norm of
$\mathbb{R}^{\ell}$
for all
$\ell\in\mathbb{N}$
.
$\langle \cdot,\cdot\rangle $
is the Euclidean inner product of
$\mathbb{R}^{\ell}$
.
$\|\cdot\|_{2}$
is the spectral norm of matrices, which equals the largest singular value. For an arbitrary matrix A,
$A^{\top}$
denotes the transpose of A. For all
$x\in\mathbb{R}^{\ell}$
and
$R>0$
, let
$B_{R}(x)$
and
$\bar{B}_{R}(x)$
respectively be an open ball and a closed ball of centre x and radius R with respect to the Euclidean metric. We write
$\|f\|_{\infty}\,:\!=\,\sup_{x\in\mathbb{R}^{d}}|f(x)|$
for arbitrary
$f:\mathbb{R}^{d}\to\mathbb{R}^{\ell}$
and
$d,\ell\in\mathbb{R}^{d}$
.
For two arbitrary probability measures
$\mu$
and
$\nu$
on
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$
and
$p\ge1$
, we define the p-Wasserstein distance between
$\mu$
and
$\nu$
such that
where
$\Pi(\mu,\nu)$
is the set of all couplings for
$\mu$
and
$\nu$
. We also define
$D\left(\mu\|\nu\right)$
and
$\chi^{2}\left(\mu\|\nu\right)$
, the Kullback–Leibler divergence and the
$\chi^{2}$
-divergence of
$\mu$
from
$\nu$
with
$\mu \ll \nu$
respectively such that
2. Main results
This section gives the main theorem for non-asymptotic estimates of the error of general SG-LMC algorithms in 2-Wasserstein distance, which is one of the standard criteria for evaluation of sampling algorithms (e.g., [Reference Durmus, Majewski and Miasojedow14, Reference Durmus and Moulines16, Reference Roy, Shen, Balasubramanian and Ghadimi32]).
2.1. Estimate of the errors of general SG-LMC
We consider a compact polynomial mollifier [Reference Anderson3]
$\rho:\mathbb{R}^{d}\to[0,\infty)$
of class
$\mathcal{C}^{1}$
as follows:
\begin{align} \rho(x)=\begin{cases} \left(\frac{\pi^{d/2}\text{B}(d/2,3)}{\Gamma(d/2)}\right)^{-1}\left(1-|x|^{2}\right)^{2}&\text{if }|x|\le 1,\\[5pt] 0 & \text{otherwise,} \end{cases}\end{align}
where
$\text{B}(\cdot,\cdot)$
is the beta function and
$\Gamma(\! \cdot \!)$
is the gamma function. Note that
$\nabla \rho$
has an explicit
$L^{1}$
-bound, which is the reason to adopt
$\rho$
as the mollifier in our analysis; we give more detailed discussions on
$\rho$
in Section 4.2. Let
$\rho_{r}(x)=r^{-d}\rho(x/r)$
with
$r>0$
.
We define
$\widetilde{G}(x)$
such that for each
$x\in\mathbb{R}^{d}$
,
whose measurability is given by Tonelli’s theorem.
We set the following assumptions on U and G.
-
(A1)
$U\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
. -
(A2) For each
$a\in A$
and
$x\in\mathbb{R}^{d}$
,
$|G(x,a)|<\infty$
.
Under (A1), we fix a representative
$\nabla U$
and consider the assumptions on
$\nabla U$
and
$\widetilde{G}$
.
-
(A3)
$|\nabla U(\mathbf{0})|<\infty$
and
$|\widetilde{G}(\mathbf{0})|<\infty$
, and the moduli of continuity of
$\nabla U$
and
$\widetilde{G}$
are bounded, that is, for some
\begin{align*} \omega_{\nabla U}(r)&\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U(x)-\nabla U(y)\right|<\infty,\\ \omega_{\widetilde{G}}(r)&\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\widetilde{G}(x)-\widetilde{G}(y)\right|<\infty \end{align*}
$r\in(0,1]$
.
-
(A4) For some
$m,\tilde{m},b,\tilde{b}>0$
, for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left\langle x,\nabla U\left(x\right)\right\rangle \ge m\left|x\right|^{2}-b,\ \left\langle x,\widetilde{G}\left(x\right)\right\rangle \ge \tilde{m}\left|x\right|^{2}-\tilde{b}. \end{align*}
Remark 2.1. The boundedness of the moduli of continuity in Assumption (A3) is equivalent to the boundedness for all
$r>0$
or for some
$r>0$
; see Lemma 4.3. Note that we allow
$\lim_{r\downarrow0}\omega_{\nabla U}(r)\neq0$
and
$\lim_{r\downarrow0}\omega_{\widetilde{G}}(r)\neq0$
.
Under (A1) and (A3), we can define the mollification
where the last equality holds since (A1) gives the essential boundedness of U and
$\nabla U$
on any compact sets and we can approximate
$\rho_{r}\in C_{0}^{1}(\mathbb{R}^{d})\cap W^{1,1}(\mathbb{R}^{d})$
with some
$\{\varphi_{n}\}\subset\mathcal{C}_{0}^{\infty}(\mathbb{R}^{d})$
. Note that
$\bar{U}_{r}\in \mathcal{C}^{2}(\mathbb{R}^{d})$
with
$\nabla^{2}\bar{U}_{r}=\nabla U\ast \nabla\rho_{r}$
by this discussion (see Lemma 4.7). We assume quadratic growth of the bias of G with respect to
$\nabla \bar{U}_{r}$
and the variance as well.
-
(A5) For some
$\bar{\delta}>0$
and
$\delta_{r}\,:\!=\,(\delta_{\mathbf{b},r,0},\delta_{\mathbf{b},r,2},\delta_{\mathbf{v},0},\delta_{\mathbf{v},2})\in[0,\bar{\delta}]^{4}$
, for almost all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left|\widetilde{G}(x)-\nabla \bar{U}_{r}(x)\right|^{2}&\le 2\delta_{\mathbf{b},r,2}\left|x\right|^{2}+2\delta_{\mathbf{b},r,0},\\ E\left[\left|{G}\left(x,a_{0}\right)-\widetilde{G}(x)\right|^{2}\right]&\le 2\delta_{\mathbf{v},2}\left|x\right|^{2}+2\delta_{\mathbf{v},0}. \end{align*}
For brevity, we use the notation
$\delta_{r,i}=\delta_{\mathbf{b},r,i}+\delta_{\mathbf{v},i}$
for both
$i=0,2$
. We also give the condition on the initial value
$\xi$
.
-
(A6) The initial value
$\xi$
has the law
$\mu_{0}(\text{d} x)=\left(\int_{\mathbb{R}^{d}} \exp(-\Psi(x))\text{d} x\right)^{-1}\exp(\! - \! \Psi(x))\text{d} x$
with
$\Psi:\mathbb{R}^{d}\to[0,\infty)$
and
$\psi_{0},\psi_{2}>0$
such that
$(2\vee \beta(|\nabla U(\mathbf{0})|+\omega_{\nabla U}(1)))|x|^{2}-\psi_{0}\le \Psi(x)\le \psi_{2}|x|^{2}+\psi_{0}$
for all
$x\in\mathbb{R}^{d}$
.
Assumption (A6) yields
Let
$\mu_{t}$
,
$t\ge0$
, denote the probability measure of
$Y_{t}$
. The following theorem gives an upper bound for the 2-Wasserstein distance between
$\mu_{k\eta}$
and
$\pi$
; its proof is given in Section 5.
Theorem 2.1 (Error estimate of general SG-LMC.) Assume (A1)–(A6) and
$\eta\in(0,1\wedge(\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))]$
. Then for any
$r\in(0,1]$
and
$k\in\mathbb{N}$
,
where f is the function defined as
$C_{0},C_{1},C_{2},\kappa_{\infty}>0$
are the positive constants defined as
\begin{align*} C_{0}&\,:\!=\,\left(d+2\right)\left(\frac{\beta}{3}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)\kappa_{\infty}\right)+\frac{d}{2}\right),\\ C_{1}&\,:\!=\,2\sqrt{4\kappa_{0}+\frac{16(\bar{b}+d/\beta)}{m}+\frac{10}{1\wedge (\beta m/4)}},\\ C_{2}&\,:\!=\,3^{\beta b/2}\left(\frac{3\psi_{2}}{m\beta}\right)^{d/2}\exp\left(\beta\left(2\left\|\nabla U\right\|_{\mathbb{M}}+U_{0}\right)+2\psi_{0}\right),\\ \kappa_{\infty}&\,:\!=\,\kappa_{0}+2\left(1\vee \frac{1}{\tilde{m}}\right)\left(\tilde{b}+\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\frac{d}{\beta}\right), \end{align*}
$\bar{b}\,:\!=\,b+\omega_{\nabla U}(1)$
,
$U_{0}\,:\!=\,\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}$
,
$\|\nabla U\|_{\mathbb{M}},\|\widetilde{G}\|_{\mathbb{M}}>0$
are the positive constants defined as
$c_\textrm{P}(\bar{\pi}^{r})>0$
is the constant of a Poincaré inequality for
$\bar{\pi}^{r}(\text{d} x)\propto \exp(\! - \! \beta\bar{U}_{r}(x))\text{d} x$
such that
\begin{align*} c_\textrm{P}(\bar{\pi}^{r}) &\le \frac{2}{m\beta\left(d+\bar{b}\beta\right)}+\frac{4a\left(d+\bar{b}\beta\right)}{m\beta}\exp\left(\beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\frac{4\left(d+\bar{b}\beta\right)}{m\beta}\right)+U_{0}\right)\right), \end{align*}
and
$a>0$
is a positive absolute constant.
2.2. Concise representation of Theorem 2.1
Since the constants and the upper bounds for some of them in Theorem 2.1 depend on various parameters, we give a concise representation of the result for the error analyses below. Assume that
$f(\delta_{r},r,k,\eta)\le 1$
and
$\eta\in(0,1\wedge (\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))]$
and note that Lemma 4.10 and the perturbation theory [Reference Bakry, Gentil and Ledoux5] yield that
$\exp(\! -2 \! \omega_{\nabla U}(1))c_\textrm{P}(\bar{\pi}^{r})\le c_{\text{P}}(\pi)\le \exp(2\omega_{\nabla U}(1))c_\textrm{P}(\bar{\pi}^{r})$
for any
$r\in(0,1]$
, where
$c_{\text{P}}(\pi)$
is the Poincaré constant of
$\pi$
. We then obtain that for some
$C\ge 1$
independent of
$\delta_{r},r,k,\eta,d,c_{\text{P}}(\pi)$
,
\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\omega_{\nabla U}(r)}{r}\eta+d\delta_{r,2}+\delta_{r,0}\right)k\eta+r\omega_{\nabla U}(r)}\\ &\quad+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right).\end{align*}
Remark 2.2. While
$c_{\text{P}}(\pi)=\mathcal{O}(\exp(\mathcal{O}(d)))$
in general, there are some known structures to relax the dependence on dimension.
-
(i) If U is
$\lambda$
-strongly convex with
$\lambda>0$
, then
$c_{\text{P}}(\pi)\le 1/(\beta\lambda)$
. -
(ii) (Perturbation theory; see [Reference Bakry, Gentil and Ledoux5]) If
$U=F+V$
with essentially bounded F and
$\lambda$
-strongly convex V with
$\lambda>0$
, then
$c_{\text{P}}(\pi)\le \exp(2\beta \|F\|_{L^{\infty}})/(\beta\lambda)=\mathcal{O}(1)$
. -
(iii) (Miclo’s trick; see [Reference Bardet, Gozlan, Malrieu and Zitt6]) If
$U=U_{l}+U_{c}$
with M-Lipschitz continuous
$U_{l}$
with
$M>0$
and
$\lambda$
-strongly convex
$V\in\mathcal{C}^{2}$
with
$\lambda>0$
, then
$c_{\text{P}}(\pi)\le (4/(\beta\lambda))\exp(4\beta M^{2}\sqrt{2d}/(\lambda\sqrt{\pi}))=\mathcal{O}(\exp(\mathcal{O}(\sqrt{d})))$
.
2.3. Discussion on Theorem 2.1
We note a potential direction to improve the bound. Some recent studies (e.g., [Reference Altschuler and Talwar1, Reference Vempala and Wibisono33]) directly evaluate the divergence of algorithms from the target distribution or their invariant distributions and successfully yield tight bounds. Hence, extension of their analysis to our problem can improve our results; for instance, if we obtain a tight bound for the Kullback–Leibler divergence of SG-LMC from
$\pi$
, and
$\pi$
satisfies a logarithmic Sobolev inequality, then we can combine them with Talagrand’s transportation-entropy inequality and yield a tighter bound for the 2-Wasserstein distance. To extend the method of [Reference Vempala and Wibisono33], which is suitable for our non-convex set-up, we need to rigorously validate the exchange of limits and integration by parts. We leave it as an open problem as its reasonable sufficient conditions are not very obvious (in particular, we need to examine the growth of
$E[G(Y_{\lfloor t/\eta\rfloor\eta},a_{\lfloor t/\eta\rfloor\eta})|Y_{t}]$
and the density of the law of
$Y_{t}$
in
$|Y_{t}|$
or sufficient conditions on
$a_{i\eta}$
to control their growth).
Our proof strategy, taken mainly from [Reference Raginsky, Rakhlin and Telgarsky31], successfully avoids the technical difficulty appearing in the direct comparison. We obtain our bound by the triangular inequality for the 2-Wasserstein distance and separately evaluating the distance between (i) the laws of SG-LMC and a Langevin process, (ii) the law of the Langevin process and its invariant measure, and (iii) the invariant measure and target measure. A dependence on
$f^{1/4}$
is inevitable as long as we compare the SG-LMC and the Langevin process via Kullback–Leibler divergence and use the weighted Csiszár–Kullback–Pinsker inequality for the 2-Wasserstein distance [Reference Bolley and Villani7], which are tight for general purposes. This dependence would be largely relaxed if we can give a direct comparison of the algorithm with the target or invariant measure as in [Reference Vempala and Wibisono33] or [Reference Altschuler and Talwar1].
3. Sampling complexities of Langevin-type algorithms
We analyse LMC and the SS-LMC and SS-SG-LMC algorithms to show their sampling complexities for achieving
$\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$
with arbitrary
$\epsilon>0$
. We also discuss zeroth-order versions of SS-LMC and SS-SG-LMC.
3.1. Analysis of the LMC algorithm for U of class
$\mathcal{C}^{1}$
with uniformly continuous gradient
We examine the LMC algorithm for U with uniformly continuous gradient, that is,
$\omega_{\nabla U}(r)\to 0$
as
$r\to0$
. Under the LMC algorithm, we use
$G=\nabla U$
and thus
$\widetilde{G}=\nabla U$
. Therefore, the bias–variance decomposition in (A5) is given as
$\delta_{\mathbf{b},r,0}=(\omega_{\nabla U}(r))^{2}/2$
,
$\delta_{\mathbf{b},r,2}=\delta_{\mathbf{v},0}=\delta_{\mathbf{v},2}=0$
by Lemma 4.9 below.
In this section we make the following assumptions.
-
(B1)
$U\in\mathcal{C}^{1}(\mathbb{R}^{d})$
. -
(B2)
$\nabla U$
is uniformly continuous, that is, the modulus of continuity
$\omega_{\nabla U}$
defined as with
\begin{align*} \omega_{\nabla U}\left(r\right)\,:\!=\,\sup_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U(x)-\nabla U(y)\right|<\infty \end{align*}
$r\ge0$
is continuous at zero.
-
(B3) There exist
$m,b>0$
such that for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left\langle x,\nabla U(x)\right\rangle \ge m\left|x\right|^{2}-b. \end{align*}
(A1)–(A4) hold immediately by (B1)–(B3); therefore, we obtain the following corollary.
Corollary 3.1. (Error estimate of LMC.) Under (B1)–(B3) and (A6), there exists a constant
$C\ge1$
independent of
$r,\alpha,k,\eta,d,c_{\text{P}}(\pi)$
such that for all
$k\in\mathbb{N}$
,
$\eta\in(0,1\wedge (m/2(\omega_{\nabla U}(1))^{2})]$
, and
$r\in(0,1]$
with
$\left(d^{2}(\omega_{\nabla U}(r)/r)\eta+(\omega_{\nabla U}(r))^{2}\right)k\eta+r\omega_{\nabla U}(r)\le 1$
,
\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\omega_{\nabla U}(r)}{r}\eta+\left(\omega_{\nabla U}\left(r\right)\right)^{2}\right)k\eta+r\omega_{\nabla U}(r)}\\ &\quad+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right).\end{align*}
3.1.1. The sampling complexity of the LMC algorithm.
We present propositions regarding the sampling complexity to achieve the approximation
$\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$
for arbitrary
$\epsilon>0$
. Define generalized inverses of
$\omega_{\nabla U}$
as follows: for any
$s>0$
,
The continuity of
$\nabla U$
under (B2) along with its monotonicity gives
$\omega_{\nabla U}(\omega_{\nabla U}^{\dagger}(s))= s$
. We also define a generalized inverse of
$r\omega_{\nabla U}(r)$
such that for all
$s>0$
,
The following proposition yields the sampling complexity using this generalized inverse.
Proposition 3.1. Assume that (B1)–(B3) and (A6) hold and fix
$\epsilon\in(0,1]$
. We set
$\bar{r}_{1},\bar{r}_{2}>0$
such that
\begin{align*} \bar{r}_{1}\,:\!=\,\omega_{\nabla U}^{\dagger}\left(\sqrt{\frac{\epsilon^{4}}{48C^{4}d^{2}\left(Cc_{\text{P}}(\pi)\left(\log\left(2/\epsilon\right)+Cd\right)+1\right)}}\right),\quad \bar{r}_{2}\,:\!=\,\iota\left(\frac{\epsilon^{4}}{48C^{4}d^{2}}\right). \end{align*}
If
$r= \bar{r}_{1}\wedge \bar{r}_{2}$
and
then
$\mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le \epsilon$
with
$k=\lceil Cc_{\text{P}}(\pi)\left(\log(2/\epsilon)+Cd\right)/\eta\rceil$
.
Proof. We just need to confirm
$r\omega_{\nabla U}\left(r\right)\le \epsilon^{4}/48C^{4}d^{2}$
is immediate. Since
$\eta\le 1$
, we have
and the other bounds also hold.
We can apply Proposition 3.1 to analysis of the sampling complexity of LMC with
$\alpha$
-mixture weakly smooth gradients [Reference Chatterji, Diakonikolas, Jordan and Bartlett10, Reference Nguyen29]. Assume that there exist
$M>0$
and
$\alpha\in(0,1]$
such that for all
$x,y\in\mathbb{R}^{d}$
,
which is a weaker assumption than both
$\alpha$
-Hölder continuity and Lipschitz continuity. This allows the gradient
$\nabla U(x)$
to be at most of linear growth, while
$\alpha$
-Hölder continuity with
$\alpha\in(0,1)$
lets the gradient be at most of sublinear growth. Since
$\omega_{\nabla U}(r)\le M(r^{\alpha}\vee r)$
for all
$r\ge0$
, we have
$\omega_{\nabla U}^{\dagger}(s)\ge (s/M)^{1/\alpha}$
for
$s\in(0,1/M]$
. Rough estimates of
$r/\omega_{\nabla U}(r)$
by the inequalities
$r/\omega_{\nabla U}(r)\ge r^{1-\alpha}/M$
,
$\omega_{\nabla U}^{\dagger}(s)/\omega_{\nabla U}(\omega_{\nabla U}^{\dagger}(s))= \omega_{\nabla U}^{\dagger}(s)/s\ge s^{1/\alpha-1}/M^{1/\alpha}$
,
$\iota(s)\ge(s/M)^{1/(1+\alpha)}$
, and
$\iota(s)/\omega_{\nabla U}(\iota(s))\ge \iota(s)^{1-\alpha}/M\ge s^{(1-\alpha)/(1+\alpha)}/M^{2/(1+\alpha)}$
for sufficiently small
$r,s>0$
yield the sampling complexity
\begin{align*} k=\mathcal{O}\left(\frac{d^{4}c_{\text{P}}(\pi)^{2}\left(\log\epsilon^{-1}+d\right)^{2}}{\epsilon^{4}}\left(\left(\frac{d^{2}c_{\text{P}}(\pi)\left(\log\epsilon^{-1}+d\right)}{\epsilon^{4}}\right)^{\frac{1-\alpha}{2\alpha}}\vee \left(\frac{d^{2}}{\epsilon^{4}}\right)^{\frac{1-\alpha}{1+\alpha}}\right)\right).\end{align*}
3.2. The spherically smoothed Langevin Monte Carlo algorithm
We consider a stochastic gradient G unbiased for
$\nabla \bar{U}_{r}$
with fixed
$r\in(0,1]$
such that the sampling error can be sufficiently small.
Note that
$\rho$
is the density of random variables which we can generate as a product of a random variable following the uniform distribution on
$S^{d-1}=\{x\in\mathbb{R}^{d} \, : \, |x|=1\}$
and the square root of a random variable following the beta distribution
$\textrm{Beta}(d/2,3)$
independently. Therefore, we can consider spherical smoothing with the random variables whose density is
$\rho_{r}$
as an analogue to Gaussian smoothing of [Reference Chatterji, Diakonikolas, Jordan and Bartlett10].
Set the stochastic gradient
\begin{align*} {G}\left(x,a_{i\eta}\right)=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\nabla U\left(x+r'\zeta_{i,j}\right),\end{align*}
where
$N_{B}\in\mathbb{N}$
,
$r'\in(0,1]$
,
$a_{i\eta}=[\zeta_{i,1},\ldots,\zeta_{i,N_{B}}]$
, and
$\{\zeta_{i,j}\}$
is a sequence of i.i.d. random variables with the density
$\rho$
. Then for any
$x\in\mathbb{R}^{d}$
,
$E[{G}(x,a_{i\eta})]=\nabla \bar{U}_{r'}(x)$
,
\begin{align*} &E\left[\left|{G}\left(x,a_{i\eta}\right)-\nabla \bar{U}_{r'}(x)\right|^{2}\right]\\ &\quad=\frac{1}{N_{B}}\int _{\mathbb{R}^{d}}\left|\nabla U(x-y)-\nabla \bar{U}_{r'}(x)\right|^{2}\rho_{r'}(y)\text{d} y\\ &\quad\le \frac{1}{N_{B}}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}} \left|\nabla U(x-y)-\nabla U(x-z)\right|^{2}\rho_{r'}(y)\rho_{r'}(z)\text{d} y\text{d} z\\ &\quad\le \frac{1}{N_{B}}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}} \left(\left|\nabla U(x-y)-\nabla U(x)\right|+\left|\nabla U(x-z)-\nabla U(x)\right|\right)^{2}\rho_{r'}(y)\rho_{r'}(z)\text{d} y\text{d} z\\ &\quad\le \frac{\left(2\omega_{\nabla U}(r')\right)^{2}}{N_{B}}\end{align*}
by Jensen’s inequality, and (A5) holds if
$\nabla \bar{U}_{r'}(x)$
is well defined and
$\omega_{\nabla U}(r')<\infty$
exists.
The main idea is to let
$r'=r$
, where r is the radius of the implicit mollification and r’ is the radius of the support of the random noise which we control. Hence, the stochastic gradient G with
$r'=r$
is an unbiased estimator of the mollified gradient
$\nabla \bar{U}_{r}(x)$
. We call the algorithm with this G the spherically smoothed Langevin Monte Carlo algorithm. We can control the sampling error of SS-LMC to be close to zero by taking a sufficiently small r.
3.2.1. Regularity conditions.
Let us set the following assumptions.
-
(C1)
$U\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
. -
(C2)
$|\nabla U(\mathbf{0})|<\infty$
and the modulus of continuity of
$\nabla U$
is bounded, that is, for some
\begin{align*} \omega_{\nabla U}(r)\,:\!=\,\sup\nolimits_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U(x)-\nabla U(y)\right|<\infty \end{align*}
$r\in(0,1]$
.
-
(C3) There exist
$m,b>0$
such that for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left\langle x,\nabla U\left(x\right)\right\rangle &\ge m\left|x\right|^{2}-b. \end{align*}
Let us observe that (C1)–(C3) yield (A1)–(A5). Assumption (A1) is the same as (C1). Assumption (C2) yields (A2) by Lemma 4.4 and (A3) by
$|\nabla \bar{U}_{r}(\mathbf{0})|\le |\nabla U(\mathbf{0})|+\omega_{\nabla U}(1)$
and
$\omega_{\nabla \bar{U}_{r}}(1)\le 3\omega_{\nabla U}(1)<\infty$
. Assumption (A4) also holds since
Section 5 gives the detailed derivation of this inequality. Assumption (A5) is given by (C2) and the discussion above.
3.2.2. Examples of distributions with the regularity conditions.
We show a simple class of potential functions satisfying (C1)–(C3) and some examples in Bayesian inference; assume
$\beta=1$
for simplicity of interpretation. Let us consider a possibly non-convex loss with elastic net regularization such that
where
$L \, : \, \mathbb{R}^{d}\to[0,\infty)$
is in
$W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
with a weak gradient
$\nabla L$
satisfying
$\|\nabla L\|_{\infty}<\infty$
,
$\lambda_{1}\ge 0$
,
$\lambda_{2}>0$
,
$R_{1}(x)=\sum_{i=1}^{d}|x^{(i)}|$
with
$x^{(i)}$
indicating the ith component of x, and
$R_{2}(x)=|x|^{2}$
. Fix a weak gradient of
$R_{1}$
as
$\nabla R_{1}(x)=\left(\text{sgn}(x^{(1)}),\ldots,\text{sgn}(x^{(d)})\right)$
; then
$\omega_{\nabla U}(1)\le 2(\|\nabla L\|_{\infty}+\lambda_{1}+\lambda_{2})<\infty$
and
$\langle x,\nabla U(x)\rangle\ge \lambda_{2}|x|^{2}-\|\nabla L\|_{\infty}^{2}/4\lambda_{2}$
since
$\langle x,\nabla R_{1}(x)\rangle \ge 0$
for all
$x\in\mathbb{R}^{d}$
. Note that regularization corresponds to the potentials of prior distributions in Bayesian inference; we can regard L(x) as a negative quasi-log-likelihood and
$\lambda_{1}/\sqrt{d}R_{1}(x)+\lambda_{2}R_{2}$
as a combination of Laplace prior and Gaussian prior [Reference Cui, Tong and Zahm12].
Non-convex losses with bounded weak gradients often appear in nonlinear and robust regression. We first examine a squared loss for nonlinear regression (or equivalently nonlinear regression with Gaussian errors) such that
$L_\textrm{NLR}(x)=\frac{1}{2\sigma^{2}}\sum_{\ell=1}^{N}\left(y_{\ell}-\phi_{\ell}\left(x\right)\right)^{2}$
, where
$N\in\mathbb{N}$
,
$\sigma>0$
is fixed,
$y_{\ell}\in\mathbb{R}$
, and
$\phi_{\ell}\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
with
$\|\phi_{\ell}\|_{\infty}+\|\nabla \phi_{\ell}\|_{\infty}<\infty$
for some
$\nabla \phi_{\ell}$
(e.g., a two-layer neural network with clipped rectified linear unit activation such that
$\phi_{\ell}(x)=(1/W)\sum_{w=1}^{W}a_{w}\varphi_{[0,c]}(\langle x_{w},f_{\ell}\rangle)$
, where
$\varphi_{[0,c]}(t)=(0\vee t)\wedge c$
with
$t\in\mathbb{R}$
,
$a_{w}\in\{-1,1\}$
and
$c>0$
are fixed,
$f_{\ell}\in\mathbb{R}^{F}$
,
$x=(x_{1},\ldots,x_{W})\in \mathbb{R}^{FW}$
,
$F,W\in\mathbb{N}$
, and
$d=FW$
). This
$L_\textrm{NLR}$
indeed satisfies
$\|\nabla L_\textrm{NLR}\|_{\infty}\le \sum_{\ell=1}^{N}(|y_{\ell}|+\|\phi_{\ell}\|_{\infty})\|\nabla \phi_{\ell}\|_{\infty}/\sigma^{2}<\infty$
. Another example is a Cauchy loss for robust linear regression (or equivalently linear regression with Cauchy errors) such that
$L_\textrm{RLR}(x)=\sum_{\ell=1}^{N}\log(1+|y_{\ell}-\langle f_{\ell},x\rangle|^{2}/\sigma^{2})$
, where
$N\in\mathbb{N}$
,
$\sigma>0$
is fixed,
$y_{\ell}\in\mathbb{R}$
, and
$f_{\ell}\in\mathbb{R}^{d}$
. The fact
$|\frac{\text{d}}{\text{d} t}\log(1+t^{2}/\sigma^{2})|=|2t/(t^{2}+\sigma^{2})|\le 1/\sigma$
for all
$t\in\mathbb{R}$
yields
$\|\nabla L_\textrm{RLR}\|_{\infty}\le \sum_{\ell=1}^{N}|f_{\ell}|/\sigma<\infty$
.
3.2.3. Error estimate and sampling complexity of SS-LMC.
We now give our error estimate for SS-LMC.
Corollary 3.2. (Error estimate of SS-LMC.) Under (C1)–(C3) and (A6), there exists a constant
$C\ge 1$
independent of
$N_{B},r,k,\eta,d,c_{\text{P}}(\pi)$
such that for all
$k\in\mathbb{N}$
,
$\eta\in(0,1\wedge (m/(4(\omega_{\nabla U}(1))^{2}))]$
,
$r\in(0,1]$
, and
$N_{B}\in\mathbb{N}$
with
$(d^{2}(\omega_{\nabla U}(r)/r)\eta+(\omega_{\nabla U}(r))^{2}/N_{B})k\eta+r\omega_{\nabla U}(r)\le 1$
,
\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)&\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\omega_{\nabla U}(r)}{r}\eta+\frac{(\omega_{\nabla U}(r))^{2}}{N_{B}}\right)k\eta+r\omega_{\nabla U}(r)}\\ &\quad+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right). \end{align*}
We give a version of the error estimate for convenience; to see that the convergence
$\omega_{\nabla U}(r)\downarrow 0$
is unnecessary, we consider a rough version of the upper bound by replacing
$\omega_{\nabla U}(r)$
with the constant
$\omega_{\nabla U}(1)$
.
Corollary 3.3. Under (C1)–(C3) and (A6), there exists a constant
$C\ge1$
independent of
$N_{B}$
, r, k,
$\eta$
, d, and
$c_{\text{P}}(\pi)$
such that for all
$N_{B}\in\mathbb{N}$
,
$k\in\mathbb{N}$
,
$\eta\in(0,1\wedge (m/(4(\omega_{\nabla U}(1))^{2}))]$
, and
$r\in(0,1]$
with
$\left(d^{2}r^{-1}\eta+N_{B}^{-1}\right)k\eta+r\le 1$
,
We obtain the following estimate of the sampling complexity; the proof is identical to that of Proposition 3.1.
Proposition 3.2. Assume (C1)–(C3) and (A6) and fix
$\epsilon\in(0,1]$
. If
$r=\epsilon^{4}/48C^{4}d^{2}$
,
$N_{B}\ge 48C^{4}d^{2}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)/\epsilon^{4}$
, and
$\eta$
satisfies
then
$\mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le \epsilon$
for
$k=\lceil Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)/\eta\rceil$
.
Since the complexities of
$N_{B}$
and k are given as
$N_{B}=\mathcal{O}(d^{2}c_{\text{P}}(\pi)(\log\epsilon^{-1}+d)/\epsilon^{4})$
and
$k=\mathcal{O}(d^{6}c_{\text{P}}(\pi)^{2}(\log\epsilon^{-1}+d)^{2}/\epsilon^{8})$
, we obtain the sampling complexity of SS-LMC as
$N_{B}k=\mathcal{O}(d^{8}c_{\text{P}}(\pi)^{3}(\log\epsilon^{-1}+d)^{3}/\epsilon^{12})$
or
$N_{B}k=\widetilde{\mathcal{O}}(d^{11}c_{\text{P}}(\pi)^{3}/\epsilon^{12})$
, where
$\widetilde{\mathcal{O}}$
ignores logarithmic factors.
3.3. The spherically smoothed stochastic gradient Langevin Monte Carlo algorithm
We consider a sampling algorithm for potentials such that for some
$N\in\mathbb{N}$
,
\begin{align} U\left(x\right)=\frac{1}{N}\sum_{\ell=1}^{N}U_{\ell}\left(x\right),\end{align}
where
$U_{\ell}(x)$
are non-negative functions with the following assumptions.
-
(D1) For all
$\ell=1,\ldots,N$
,
$U_{\ell}\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
. -
(D2)
$|\nabla U_{\ell}(\mathbf{0})|<\infty$
for all
$\ell=1,\ldots,N$
and there exists a function
$\hat{\omega} \, : \, [0,\infty)\to[0,\infty)$
such that for all
$r\in(0,1]$
and
$\ell=1,\ldots,N$
,
\begin{align*} \sup_{x,y\in\mathbb{R}^{d}:|x-y|\le r}\left|\nabla U_{\ell}(x)-\nabla U_{\ell}(y)\right|\le \hat{\omega}\left(r\right)<\infty. \end{align*}
-
(D3) There exist
$m,b>0$
such that for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left\langle x,\nabla U\left(x\right)\right\rangle &\ge m\left|x\right|^{2}-b. \end{align*}
We define the stochastic gradient
\begin{align*} {G}\left(x,a_{i\eta}\right)=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\nabla U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right),\end{align*}
where
$N_{B}\in\mathbb{N}$
,
$r'\in(0,1]$
,
$a_{i\eta}=[\lambda_{i,1},\ldots,\lambda_{i,N_{B}},\zeta_{i,1},\ldots,\zeta_{i,N_{B}}]$
,
$\{\lambda_{i,j}\}$
is a sequence of i.i.d. random variables with the discrete uniform distribution on the integers
$1,\ldots,N$
, and
$\{\zeta_{i,j}\}$
is a sequence of i.i.d. random variables with the density
$\rho$
and independence of
$\{\lambda_{i,j}\}$
. Then for any
$x\in\mathbb{R}^{d}$
, we have
and
\begin{align*} E\left[\left|{G}\left(x,a_{i\eta}\right)-\nabla \bar{U}_{r'}(x)\right|^{2}\right]&= \frac{1}{N_{B}}E\left[\left|\nabla U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)-\nabla \bar{U}_{r'}(x)\right|^{2}\right]\\ &\le \frac{1}{N_{B}}E\left[\left|\nabla U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)\right|^{2}\right]\\ &\le \frac{2}{N_{B}}\max_{\ell=1,\ldots,N}((\omega_{\nabla U_{\ell}}(1))^{2}|x|^{2}+(|\nabla U_{\ell}(\mathbf{0})|+2\omega_{\nabla U_{\ell}}(1))^{2})\end{align*}
by Lemma 4.4. We obtain (A5) with
$\delta_{\mathbf{b},r',0}=\delta_{\mathbf{b},r',2}=0$
,
$\delta_{\mathbf{v},0}=(\max_{\ell}|\nabla U_{\ell}(\mathbf{0})|+2\hat{\omega}(1))^{2})/N_{B}$
, and
$\delta_{\mathbf{v},2}=(\hat{\omega}(1))^{2}/N_{B}$
. We call this algorithm the spherically smoothed stochastic gradient Langevin Monte Carlo algorithm.
Assumptions (D1)–(D3) yield (A1)–(A5) with the same discussion as for SS-LMC.
Corollary 3.4 (Error estimate of SS-SG-LMC.) Under (D1)–(D3) and (A6), there exists a constant
$C\ge 1$
independent of
$N_{B},r,k,\eta,d,c_{\text{P}}(\pi)$
such that for all
$k\in\mathbb{N}$
,
$\eta\in(0,1\wedge (m/(8(\hat{\omega}(1))^{2}))]$
,
$r\in(0,1]$
, and
$N_{B}\in\mathbb{N}$
with
$(d^{2}(\hat{\omega}(r)/r)\eta+d/N_{B})k\eta+r\hat{\omega}(r)\le 1$
,
\begin{align*} \mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le C\sqrt{d}\sqrt[4]{\left(d^{2}\frac{\hat{\omega}(r)}{r}\eta+\frac{d}{N_{B}}\right)k\eta+r\hat{\omega}(r)}+\mathrm{e}^{Cd}\exp\left(-\frac{k\eta}{Cc_{\text{P}}(\pi)}\right). \end{align*}
We give a rough upper bound by replacing
$\hat{\omega}(r)$
with the constant
$\hat{\omega}(1)$
as in the discussion on SS-LMC.
Corollary 3.5. Under (D1)–(D3) and (A6), there exists a constant
$C\ge1$
independent of
$N_{B}$
, r, k,
$\eta$
, d, and
$c_{\text{P}}(\pi)$
such that for all
$N_{B}\in\mathbb{N}$
,
$k\in\mathbb{N}$
,
$\eta\in(0,1\wedge (m/(8(\hat{\omega}(1))^{2}))]$
, and
$r\in(0,1]$
with
$\left(d^{2}r^{-1}\eta+dN_{B}^{-1}\right)k\eta+r\le 1$
,
Using this estimate, we yield the following estimate of the sampling complexity, which is lower than that of SS-LMC for U given by Eq. (5) if
$N>d$
since the complexity to compute G in SS-LMC for this U increases by a factor of N and the sampling complexity of SS-SG-LMC deteriorates by a factor of d in comparison to that of SS-LMC.
Proposition 3.3. Assume (D1)–(D3) and (A6) and fix
$\epsilon\in(0,1]$
. If
$r=\epsilon^{4}/48C^{4}d^{2}$
,
$N_{B}\ge 48C^{4}d^{3}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)/\epsilon^{4}$
, and
$\eta$
satisfies
\begin{align*} \eta\le 1&\wedge \frac{m}{8\left(\hat{\omega}(1)\right)^{2}}\wedge \frac{r\epsilon^{4}}{48C^{4}d^{4}(Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)+1)}, \end{align*}
then
$\mathcal{W}_{2}\left(\mu_{k\eta},\pi\right)\le \epsilon$
for
$k=\lceil Cc_{\text{P}}(\pi)(\log(2/\epsilon)+Cd)/\eta\rceil$
.
3.4. Zeroth-order Langevin algorithms
Let us consider a zeroth-order version of SS-LMC as an analogue to [Reference Roy, Shen, Balasubramanian and Ghadimi32] with the following G under (C1)–(C3) and the assumption
$|U(x)|<\infty$
for all
$x\in\mathbb{R}^{d}$
:
\begin{align*} G(x,a_{i\eta})=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}G_{j}(x,a_{i\eta})\,:\!=\,\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\frac{U\left(x+r'\zeta_{i,j}\right)-U\left(x\right)}{r'}\frac{4\zeta_{i,j}}{\left(1-|\zeta_{i,j}|^{2}\right)},\end{align*}
where
$N_{B}\in\mathbb{N}$
,
$r'\in(0,1]$
, and
$\{\zeta_{i,j}\}$
is an i.i.d. sequence of random variables with the density
$\rho$
. The fact that
the symmetricity of
$\rho$
, approximation of
$\rho\in \mathcal{C}_{0}^{1}(\mathbb{R}^{d})\cap W^{1,1}(\mathbb{R}^{d})$
, and the essential boundedness of U and
$\nabla U$
on compact sets by Lemmas 4.4 and 4.6 yield that for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} E\left[G_{j}(x,a_{i\eta})\right] &=\int_{\mathbb{R}^{d}}\frac{U\left(x+r'z\right)-U\left(x\right)}{r'}\frac{-\nabla \rho\left(z\right)}{\rho\left(z\right)}\rho\left(z\right)\text{d} z\\ &=-\int_{\mathbb{R}^{d}}\frac{U\left(x+r'z\right)-U\left(x\right)}{r'}\nabla \rho\left(z\right)\text{d} z\\ &=-\int_{\mathbb{R}^{d}}\left(U\left(x+y\right)-U\left(x\right)\right)\left(\frac{1}{(r')^{d+1}}\nabla \rho\left(\frac{y}{r'}\right)\right)\text{d} y\\ &=\int_{\mathbb{R}^{d}}\nabla U\left(x+y\right)\rho_{r'}\left(y\right)\text{d} y\\ &=\nabla \bar{U}_{r'}\left(x\right).\end{align*}
Lemma 4.5, the convexity of
$f(a)=a^{2}$
with
$a\in\mathbb{R}$
, and the equality
\begin{align*} \int_{B_{1}(\mathbf{0})}\frac{\left|\nabla \rho\left(z\right)\right|^{2}}{\rho\left(z\right)}\text{d} z &=\frac{16\Gamma(d/2)}{\pi^{d/2}\text{B}(d/2,3)}\int_{B_{1}(\mathbf{0})}\left|z\right|^{2}\text{d} z =\frac{32}{\text{B}(d/2,3)}\int_{0}^{1}r^{d+1}\text{d} r\\ &=\frac{32\Gamma(d/2+3)}{\Gamma(d/2)\Gamma(3)(d+2)} =\frac{16(d/2+2)(d/2+1)(d/2)}{(d+2)}\\ &=2d(d+4)\end{align*}
give that for almost all
$x\in\mathbb{R}^{d}$
,
\begin{align*} E\left[\left|G_{1}(x,a_{i\eta})\right|^{2}\right] &=\int_{B_{1}(\mathbf{0})}\left(\frac{U\left(x+r'z\right)-U\left(x\right)}{r'}\right)^{2}\frac{\left|\nabla \rho\left(z\right)\right|^{2}}{\rho\left(z\right)}\text{d} z\\ &\le \left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}+\omega_{\nabla U}\left(1\right)\left|x\right|\right)^{2}\int_{B_{1}(\mathbf{0})}\frac{\left|\nabla \rho\left(z\right)\right|^{2}}{\rho\left(z\right)}\text{d} z\\ &= 2d(d+4)\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}+\omega_{\nabla U}\left(1\right)\left|x\right|\right)^{2}\\ &\le d(d+4)\left(9\left\|\nabla U\right\|_{\mathbb{M}}^{2}+4\left(\omega_{\nabla U}(1)\right)^{2}\left|x\right|^{2}\right).\end{align*}
These properties along with
yield (A5) with
$\delta_{\mathbf{b},r,0}=\delta_{\mathbf{b},r,2}=0$
,
$\delta_{\mathbf{v},0}=9d(d+4)\left\|\nabla U\right\|_{\mathbb{M}}^{2}/2N_{B}$
, and
$\delta_{\mathbf{v},2}=2d(d+4)(\omega_{\nabla U}(1))^{2}/N_{B}$
if
$r=r'$
. Hence, the SG-LMC with this G also can achieve
$\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$
for arbitrary
$\epsilon>0$
. Note that the complexity deteriorates by a factor of
$\mathcal{O}(d^{3})$
in comparison to SS-LMC; the batch size
$N_{B}$
to achieve
$\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$
is of order
$\mathcal{O}(d^{5}c_{\text{P}}(\pi)(\log\epsilon^{-1}+d)/\epsilon^{4})$
since
$\delta_{\mathbf{v},2}=0$
does not hold and both
$\delta_{\mathbf{v},0}$
and
$\delta_{\mathbf{v},2}$
are of order
$\mathcal{O}(d^{2}/N_{B})$
.
We can also consider a zeroth-order version of SS-SG-LMC with the potential U in Eq. (5) and the following G under (D1)–(D3) and the assumption
$|U_{\ell}(x)|<\infty$
for all
$\ell=1,\ldots,N$
and
$x\in\mathbb{R}^{d}$
:
\begin{align*} G(x,a_{i\eta})=\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}G_{j}(x,a_{i\eta})\,:\!=\,\frac{1}{N_{B}}\sum_{j=1}^{N_{B}}\frac{U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)-U_{\lambda_{i,j}}\left(x\right)}{r'}\frac{4\zeta_{i,j}}{1-|\zeta_{i,j}|^{2}},\end{align*}
where
$N_{B}\in\mathbb{N}$
,
$r'\in(0,1]$
,
$a_{i\eta}=[\lambda_{i,1},\ldots,\lambda_{i,N_{B}},\zeta_{i,1},\ldots,\zeta_{i,N_{B}}]$
,
$\{\lambda_{i,j}\}$
is a sequence of i.i.d. random variables with the discrete uniform distribution on
$\{1,\ldots,N\}$
, and
$\{\zeta_{i,j}\}$
is a sequence of i.i.d. random variables with the density
$\rho$
and independence of
$\{\lambda_{i,j}\}$
. We see that for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} E\left[G(x,a_{i\eta})\right]=\frac{1}{NN_{B}}\sum_{j=1}^{N_{B}}\sum_{\ell=1}^{N}\int\frac{U_{\ell}\left(x+r'z\right)-U_{\ell}\left(x\right)}{r'}(\! - \! \nabla \rho(z))\text{d} z =\nabla \bar{U}_{r'}(x),\end{align*}
and for almost all
$x\in\mathbb{R}^{d}$
,
\begin{align*} E\left[\left|G(x,a_{i\eta})-\nabla \bar{U}_{r'}(x)\right|^{2}\right] &\le \frac{1}{N_{B}}E\left[\left|\frac{U_{\lambda_{i,j}}\left(x+r'\zeta_{i,j}\right)-U_{\lambda_{i,j}}\left(x\right)}{r'}\frac{4\zeta_{i,j}}{(1-|\zeta_{i,j}|^{2})}\right|^{2}\right]\\ &=\frac{1}{NN_{B}}\sum_{\ell=1}^{N}\int\left|\frac{U_{\ell}\left(x+r'z\right)-U_{\ell}\left(x\right)}{r'}\right|^{2}\frac{\left|\nabla \rho(z)\right|^{2}}{\rho(z)}\text{d} z\\ &\le \frac{1}{N_{B}}d(d+4)\max_{\ell=1,\ldots,N}\left(9\left\|\nabla U_{\ell}\right\|_{\mathbb{M}}^{2}+4\left(\hat{\omega}(1)\right)^{2}\left|x\right|^{2}\right).\end{align*}
Hence, assumption (A5) for this G holds with
$\delta_{\mathbf{b},r',0}=\delta_{\mathbf{b},r',2}=0$
,
$\delta_{\mathbf{v},0}=9d(d+4)(\max_{\ell}\left|\nabla U_{\ell}(\mathbf{0})\right|+\hat{\omega}(1))^{2}/2N_{B}$
, and
$\delta_{\mathbf{v},2}=2d(d+4)(\hat{\omega}(1))^{2}/N_{B}$
if
$r=r'$
. Therefore, this SG-LMC can achieve
$\mathcal{W}_{2}(\mu_{k\eta},\pi)\le \epsilon$
for any
$\epsilon>0$
, though the complexity is worse than that of SS-SG-LMC by a factor of
$\mathcal{O}(d^{2})$
.
4. Preliminary results
We give preliminary results on the compact polynomial mollifier, mollification of functions with the finite moduli of continuity, and the representation of the likelihood ratio between the solutions of SDEs via a Liptser–Shiryaev-type condition for change of measures. We also introduce the fundamental theorem of calculus for weakly differentiable functions, a well-known sufficient condition of Poincaré inequalities and convergence in
$\mathcal{W}_{2}$
with the inequalities, and upper bounds for Wasserstein distances.
4.1. The fundamental theorem of calculus for weakly differentiable functions
We use the following result on the fundamental theorem of calculus for functions in
$W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
.
Proposition 4.1. (Lieb and Loss [Reference Lieb and Loss23], Anastassiou [Reference Anastassiou2].) For each
$f\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
, for almost all
$x,y\in\mathbb{R}^{d}$
,
4.2. Properties of the compact polynomial mollifier
We analyse the mollifier
$\rho$
proposed in Eq. (4). Note that our non-asymptotic analysis needs mollifiers of class
$\mathcal{C}^{1}$
whose gradients have explicit
$L^{1}$
-bounds and whose supports are in the unit ball of
$\mathbb{R}^{d}$
, and it is non-trivial to obtain explicit
$L^{1}$
-bounds for the gradients of well-known
$\mathcal{C}^{\infty}$
mollifiers.
Remark 4.1. We need mollifiers of class
$\mathcal{C}^{1}$
to let
$U\ast\rho$
with
$U\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
be of class
$\mathcal{C}^{2}$
and give a bound for the constant of a Poincaré inequality by [Reference Bakry, Barthe, Cattiaux and Guillin4]; see Lemma 4.7 and Proposition 4.4.
The following lemma gives some properties of
$\rho$
.
Lemma 4.1.
-
(1)
$\rho\in\mathcal{C}^{1}(\mathbb{R}^{d})$
. -
(2)
$\int \rho(x)\text{d} x=1$
. -
(3)
$\int |\nabla \rho(x)|\text{d} x\le d+2$
.
Proof.
-
(1) We check the behaviour of
$\nabla \rho$
on a neighbourhood of
$\{x\in\mathbb{R}^{d} \, : \, |x|=1\}$
. For all
$x\in\mathbb{R}^{d}$
with
$|x|<1$
, and thus
\begin{align*} \nabla \rho(x)&=\left(\frac{\pi^{d/2}\text{B}(d/2,3)}{\Gamma(d/2)}\right)^{-1}\left(-4\right)\left(1-|x|^{2}\right)x \end{align*}
$\nabla \rho(x)$
is continuous at any
$x\in\mathbb{R}^{d}$
by
$\nabla \rho(x)=\mathbf{0}$
for all
$x\in\mathbb{R}^{d}$
with
$|x|=1$
.
-
(2) We have
with the change of coordinates from Euclidean to hyperspherical, and the change of variables such that
\begin{align*} \int \rho(x)\text{d} x&=\frac{2}{\text{B}(d/2,3)}\int_{0}^{1} r^{d-1}\left(1-r^{2}\right)^{2}\text{d} r=\frac{1}{\text{B}(d/2,3)}\int_{0}^{1} s^{d/2-1}\left(1-s\right)^{2}\text{d} s=1 \end{align*}
$\sqrt{s}=r$
and
$(1/2\sqrt{s})\text{d} s=\text{d} r$
.
-
(3) With respect to the
$L^{1}$
-norm of the gradient, we have because
\begin{align*} \int |\nabla \rho(x)|\text{d} x&=\int_{|x|\le 1}\left(\frac{\pi^{d/2}\text{B}(d/2,3)}{\Gamma(d/2)}\right)^{-1}4\left(1-|x|^{2}\right)|x|\text{d} x\\ &=\frac{8}{\text{B}(d/2,3)}\int_{0}^{1}r^{d}\left(1-r^{2}\right)\text{d} r =\frac{4}{\text{B}(d/2,3)}\int_{0}^{1}s^{d/2-1/2}\left(1-s\right)\text{d} s\\ &=\frac{4\text{B}(d/2+1/2,2)}{\text{B}(d/2,3)} =\frac{4\Gamma(d/2+1/2)\Gamma(2)\Gamma(d/2+3)}{\Gamma(d/2+5/2)\Gamma(d/2)\Gamma(3)}\\ &=\frac{(d+4)(d+2)d}{(d+3)(d+1)} \le d+2 \end{align*}
$(d+4)d\le(d+3)(d+1)$
. Therefore, the statement holds true.
We show the optimality of the compact polynomial mollifier; the
$L^{1}$
-norms of the gradients of
$\mathcal{C}^{1}$
non-negative mollifiers with supports in
$B_{1}(\mathbf{0})$
are bounded below by d.
Lemma 4.2. Assume that
$p \, : \, \mathbb{R}^{d}\to[0,\infty)$
is a continuously differentiable non-negative function whose support is in the unit ball of
$\mathbb{R}^{d}$
such that
$\int p(x)\text{d} x=1$
. Then
Proof. Since
$p\in \mathcal{C}^{1}(\mathbb{R}^{d})$
, the
$L^{1}$
-norm of the gradient equals the total variation; that is, for arbitrary
$R> 1$
,
\begin{align*} \int_{\mathbb{R}^{d}}\left|\nabla p(x)\right|\text{d} x&=\int_{B_{R}\left(\mathbf{0}\right)}\left|\nabla p(x)\right|\text{d} x\\ &=\sup\left\{\int_{B_{R}\left(\mathbf{0}\right)}p(x)\mathrm{div}\varphi(x)\text{d} x\left|\varphi\in\mathcal{C}_{0}^{1}\left(B_{R}\left(\mathbf{0}\right) \, ; \, \mathbb{R}^{d}\right),\left\|\varphi\right\|_{\infty}\le 1\right.\right\}, \end{align*}
where
$\mathcal{C}_{0}^{1}(B_{R}(\mathbf{0});\mathbb{R}^{d})$
is a class of continuously differentiable functions
$\varphi \, : \, \mathbb{R}^{d}\to\mathbb{R}^{d}$
with compact supports in
$B_{R}(\mathbf{0})\subset\mathbb{R}^{d}$
. For all
$\delta\in(0,1]$
, by fixing
$\varphi_{\delta}\in\mathcal{C}_{0}^{1}(B_{R}(\mathbf{0});\mathbb{R}^{d})$
such that
$\varphi_{\delta}(x)=(1-\delta)x$
for all
$x\in B_{1}(\mathbf{0})$
and
$\|\varphi_{\delta}\|_{\infty}\le 1$
, we have
We obtain the conclusion by taking the limit as
$\delta\to0$
.
4.3. Functions with the finite moduli of continuity and their properties
We consider a class of possibly discontinuous functions and show lemmas useful for the analysis of SG-LMC such that
$\nabla U$
and
$\widetilde{G}$
are in this class.
Let
$\mathbb{M}=\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$
with fixed
$d,\ell\in\mathbb{N}$
denote a class of measurable functions
$\phi \, : \, (\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))\to(\mathbb{R}^{\ell},\mathcal{B}(\mathbb{R}^{\ell}))$
with (1)
$|\phi(\mathbf{0})|<\infty$
and (2)
$\omega_{\phi}(1)<\infty$
, where
$\omega_{\phi}(\! \cdot \!)$
is the well-known modulus of continuity defined as
where
$r>0$
. Note that we use the modulus of continuity not to measure the continuity of
$\phi$
, but to measure the fluctuation of
$\phi$
within
$\bar{B}_{r}(x)$
for all
$x\in\mathbb{R}^{d}$
. An intuitive element of
$\mathbb{M}$
is
$\mathbb{I}_{A}$
for an arbitrary measurable set
$A\in\mathcal{B}(\mathbb{R}^{d})$
because
$\omega_{\mathbb{I}_{A}}(r)\le 1$
for any A and
$r>0$
. In the rest of the paper, we sometimes use the notation
$\|\phi\|_{\mathbb{M}}\,:\!=\,|\phi(\mathbf{0})|+\omega_{\phi}(1)$
with
$\phi\in\mathbb{M}$
just for brevity (it is easy to see that
$\mathbb{M}$
equipped with
$\|\cdot\|_{\mathbb{M}}$
is a Banach space).
We introduce the following lemma; this ensures that we can change
$r>0$
arbitrarily if
$\omega_{\phi}(r)<\infty$
with some
$r>0$
, and reveal that considering
$r=1$
is sufficient to capture the large-scale behaviour since the lemma leads to
$\omega_{\phi}(n)\le n\omega_{\phi}(1)$
for any
$n\in\mathbb{N}$
.
Lemma 4.3. For any
$r>0$
and
$\phi\in\mathbb{M}$
,
$\omega_{\phi}(r)=\sup_{t>0}\lceil t\rceil^{-1}\omega_{\phi}(rt)$
.
Proof.
$\omega_{\phi}(r)\le \sup_{t>0}\lceil t\rceil^{-1}\omega_{\phi}(rt)$
and
$\omega_{\phi}(r)\ge \lceil t\rceil^{-1}\omega_{\phi}(rt)$
with
$t\in(0,1]$
hold immediately. Thus, we only examine
$\omega_{\phi}(r)\ge \lceil t\rceil^{-1}\omega_{\phi}(rt)$
for all
$t>1$
.
We fix
$t>1$
. For any
$x,y\in \mathbb{R}^{d}$
with
$|x-y|\le rt$
,
\begin{align*} \left|\phi\left(x\right)-\phi\left(y\right)\right|&\le \sum_{i=1}^{\lceil t\rceil}\left|\phi\left(\frac{(\lceil t\rceil-i+1)x+(i-1)y}{\lceil t\rceil}\right)-\phi\left(\frac{(\lceil t\rceil-i)x+iy}{\lceil t\rceil}\right)\right|\\ &\le \lceil t\rceil \omega_{\phi}(r) \end{align*}
because
$|((\lceil t\rceil-i+1)x+(i-1)y)/\lceil t\rceil-((\lceil t\rceil-i)x+iy)/\lceil t\rceil|=|x-y|/\lceil t\rceil\le r$
.
Remark 4.2. Note that the continuity and boundedness of the modulus of continuity do not imply each other. For example,
$f(x)=x\sin x$
with
$x\in\mathbb{R}$
is a continuous function without the finite modulus of continuity. On the other hand,
$f(x)=\mathbb{I}_{\mathbb{Q}}\left(x\right)$
with
$x\in\mathbb{R}$
is a trivial example of a function with the finite modulus of continuity and without continuity.
Moreover, continuity along with the boundedness of the modulus of continuity does not imply uniform continuity, which we can easily observe by
$f(x)=\sin(x^{2})$
with
$x\in\mathbb{R}$
.

Figure 1. The chains of implications.
Lemma 4.4. (Linear growth of functions with the finite moduli of continuity.) For any
$\phi\in\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$
, we have that for all
$x\in\mathbb{R}^{d}$
,
Proof. Fix
$x\in\mathbb{R}^{d}$
. Lemma 4.3 gives
Therefore, the statement holds.
Lemma 4.5 (Local Lipschitz continuity by gradients with the finite moduli of continuity.) Assume that
$\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
and a representative weak gradient
$\nabla \Phi$
is in
$\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$
. Then for almost all
$x,y\in\mathbb{R}^{d}$
,
Proof. Proposition 4.1 and Lemma 4.4 yield that for almost all
$x,y\in\mathbb{R}^{d}$
,
\begin{align*} |\Phi(x)-\Phi(y)|&= \left|\int_{0}^{1}\left\langle\nabla \Phi\left(x+t(y-x)\right),y-x\right\rangle\text{d} t\right|\\ &\le \int_{0}^{1}\left|\nabla \Phi\left(x+t(y-x)\right)\right|\text{d} t\left|y-x\right|\\ &\le \int_{0}^{1}\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\omega_{\nabla \Phi}\left(1\right)\left(1+\left|(1-t)x+ty\right|\right)\right)\text{d} t\left|y-x\right|\\ &\le \left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\omega_{\nabla\Phi}(1)\left(1+\frac{|x|+|y|}{2}\right)\right)\left|y-x\right|.\end{align*}
Hence, the lemma is proved.
Lemma 4.6. (Quadratic growth by gradients with the finite moduli of continuity.) Assume that
$\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
and a representative weak gradient
$\nabla \Phi$
is in
$\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$
. Then we have that
$\|\Phi\|_{L^{\infty}(B_{1}(\mathbf{0}))}<\infty$
and for almost all
$x\in\mathbb{R}^{d}$
,
Moreover, for all
$x\in\mathbb{R}^{d}$
and
$r\in(0,1]$
,
with
$\bar{\Phi}_{r}(x)=(\Phi\ast\rho_{r})(x)$
.
Proof. Lemma 4.5 gives that for almost all
$x\in\mathbb{R}^{d}$
and
$y \in B_{1}(\mathbf{0})\cap B_{|x|}(x)$
,
\begin{align*} \Phi\left(x\right)&\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|\left|x-y\right|+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\frac{3}{2}\omega_{\nabla \Phi}(1)\right)\left|x-y\right|+\Phi\left(y\right)\\ &\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|^{2}+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+\frac{3}{2}\omega_{\nabla \Phi}(1)\right)\left|x\right|+\left\|\Phi\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}. \end{align*}
Regarding the second statement, we have that
\begin{align*} \bar{\Phi}_{r}(x)&=\int_{\mathbb{R}^{d}} \Phi\left(x-y\right)\rho_{r}\left(y\right)\text{d} y\\ &=\int_{\mathbb{R}^{d}} \left(\Phi\left(-y\right)+\int_{0}^{1}\left\langle\nabla \Phi\left(-y+tx\right),x\right\rangle \text{d} t\right)\rho_{r}\left(y\right)\text{d} y\\ &\le \int_{\mathbb{R}^{d}} \left(\Phi\left(-y\right)+\int_{0}^{1}\left(\left|\nabla \Phi(\mathbf{0})\right|+\omega_{\nabla \Phi}\left(1\right)\left(1+\left|-y+tx\right|\right)\right)\left|x\right|\text{d} t\right)\rho_{r}\left(y\right)\text{d} y\\ &\le \left\|\Phi\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}+\int_{0}^{1}\left(\left|\nabla \Phi(\mathbf{0})\right|+\omega_{\nabla \Phi}\left(1\right)\left(2+t\left|x\right|\right)\right)\left|x\right|\text{d} t\\ &\le \frac{\omega_{\nabla \Phi}(1)}{2}\left|x\right|^{2}+\left(\left|\nabla \Phi\left(\mathbf{0}\right)\right|+2\omega_{\nabla \Phi}(1)\right)\left|x\right|+\left\|\Phi\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}. \end{align*}
The lemma is proved.
Lemma 4.7. (Smoothness of convolution.) Assume that
$\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
and a representative weak gradient
$\nabla \Phi$
is in
$\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$
. Then
$\bar{\Phi}_{r}\,:\!=\,(\Phi\ast\rho_{r})\in\mathcal{C}^{2}(\mathbb{R}^{d})$
and
$\nabla^{2}\bar{\Phi}_{r}=(\nabla\Phi\ast\nabla\rho_{r})$
.
Proof. Since
$\Phi$
and
$\nabla \Phi$
are essentially bounded on any compact sets, for some
$\{\varphi_{n}\}\subset\mathcal{C}_{0}^{\infty}(\mathbb{R}^{d})$
approximating
$\rho_{r}\in C_{0}^{1}(\mathbb{R}^{d})\cap W^{1,1}(\mathbb{R}^{d})$
,
$\nabla (\Phi\ast\rho_{r})=\Phi\ast\nabla \rho_{r}=\lim_{n}\Phi\ast\nabla \varphi_{n}=\lim_{n}\nabla\Phi\ast\varphi_{n}=\nabla\Phi\ast\rho_{r}$
and thus
$\Phi\in\mathcal{C}^{2}(\mathbb{R}^{d})$
with
$\nabla^{2}\bar{\Phi}_{r}=(\nabla\Phi\ast\nabla\rho_{r})$
.
Lemma 4.8 (Bounded gradients of convolution.) For all
$\phi\in\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$
,
$r>0$
, and
$x\in\mathbb{R}^{d}$
, we have that
where
$\bar{\phi}_{r}(x)=(\phi\ast\rho_{r})(x)$
.
Proof. We obtain
by using
$\int\nabla \rho_{r}(x)\text{d} x=0$
, and thus
\begin{align*} \left\|\nabla \left(\bar{\phi}_{r}\right)(x)\right\|_{2}&\le \int_{\mathbb{R}^{d}}\left\|\left(\phi\left(y\right)-\phi\left(x\right)\right)\left(\nabla \rho_{r}\right)\left(x-y\right)\right\|_{2}\text{d} y\\ &= \int_{\mathbb{R}^{d}}\left|\phi\left(y\right)-\phi\left(x\right)\right|\left|\nabla \rho_{r}\left(x-y\right)\right|\text{d} y\le \omega_{\phi}(r) \int_{\mathbb{R}^{d}}\left|\nabla \rho_{r}\left(y\right)\right|\text{d} y\\ &= \omega_{\phi}(r)\int_{\mathbb{R}^{d}}\left| \frac{1}{r^{d+1}}\nabla\rho\left(\frac{y}{r}\right)\right|\text{d} y = \omega_{\phi}(r)\int_{\mathbb{R}^{d}}\left|\frac{1}{r^{d+1}}\nabla \rho\left(z\right)\right|r^{d}\text{d} z\\ &\le \frac{(d+2)\omega_{\phi}(r)}{r}\end{align*}
by the change of variables
$z=y/r$
with
$r^{d}\text{d} z=\text{d} y$
and Lemma 4.1.
Lemma 4.9. (1-Lipschitz mapping to
$\ell^{\infty}$
.) For all
$\phi\in\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{\ell})$
and
$r>0$
,
Proof. Since
$\int\rho_{r}(x)\text{d} x=1$
, for all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left|\phi\ast\rho_{r}(x)-\phi(x)\right|&=\left|\int_{\mathbb{R}^{d}}\phi(y)\rho_{r}(x-y)\text{d} y-\phi(x)\right|\\ &=\left|\int_{\mathbb{R}^{d}}\phi(y)\rho_{r}(x-y)\text{d} y-\int_{\mathbb{R}^{d}}\phi(x)\rho_{r}(x-y)\text{d} y\right|\\ &=\left|\int_{\mathbb{R}^{d}}\left(\phi(y)-\phi(x)\right)\rho_{r}(x-y)\text{d} y\right|\\ &\le \int_{\mathbb{R}^{d}}\left|\phi(y)-\phi(x)\right|\rho_{r}(x-y)\text{d} y\\ &\le \omega_{\phi}(r). \end{align*}
This is the desired conclusion.
Lemma 4.10 (Essential supremum of deviations by convolution). Assume that
$\Phi\in W_{\text{loc}}^{1,\infty}(\mathbb{R}^{d})$
and a representative weak gradient
$\nabla \Phi$
is in
$\mathbb{M}(\mathbb{R}^{d};\mathbb{R}^{d})$
. For all
$r>0$
,
with
$\bar{\Phi}_{r}(x)\,:\!=\,(\Phi\ast\rho_{r})(x)$
.
Proof. By Proposition 4.1 and
$\int_{\mathbb{R}^{d}}\langle y,z\rangle \rho_{r}(y)\text{d} y=0$
for any
$z\in\mathbb{R}^{d}$
, for almost all
$x\in\mathbb{R}^{d}$
,
\begin{align*} \left|\bar{\Phi}_{r}(x)-\Phi(x)\right|&=\left|\int_{\mathbb{R}^{d}}\left(\Phi(x-y)-\Phi(x)\right)\rho_{r}(y)\text{d} y\right|\\ &=\left|\int_{\mathbb{R}^{d}}\left(\int_{0}^{1}\left\langle \nabla \Phi(x-ty),y\right\rangle\text{d} t\right)\rho_{r}(y)\text{d} y\right|\\ &=\left|\int_{\mathbb{R}^{d}}\left(\int_{0}^{1}\left\langle \nabla \Phi(x-ty)-\nabla \Phi(x),y\right\rangle\text{d} t\right)\rho_{r}(y)\text{d} y\right|\\ &\le \omega_{\nabla \Phi}(r)\int_{\mathbb{R}^{d}}|y|\rho_{r}(y)\text{d} y\\ &\le r\omega_{\nabla \Phi}(r) \end{align*}
and thus the statement holds.
4.4. Liptser–Shiryaev-type condition for change of measures
We show the existence of explicit likelihood ratios of diffusion-type processes based on Theorem 7.19 and Lemma 7.6 of [Reference Liptser and Shiryaev24]. We fix
$T>0$
throughout this section. Let
$(W_{T},\mathcal{W}_{T})$
be a measurable space of
$\mathbb{R}^{d}$
-valued continuous functions
$w_{t}$
with
$t\in[0,T]$
and
$\mathcal{W}_{T}=\sigma(w_{s} \, : \, w\in W_{T},s\le T)$
. We also use the notation
$\mathcal{W}_{t}=\sigma(w_{s} \, : \, w\in W_{T}, s\le t)$
for
$t\in[0,T]$
. Let
$(\Omega, \mathcal{F}, \mu)$
be a complete probability space and
$(\tilde{\Omega},\tilde{\mathcal{F}},\tilde{\mu})$
be an identical copy of it. We assume that the filtration
$\{\mathcal{F}_{t}\}_{t\in[0,T]}$
satisfies the usual conditions. Let
$(B_{t},\mathcal{F}_{t})$
with
$t\in[0,T]$
be a d-dimensional Brownian motion and
$\xi$
be an
$\mathcal{F}_{0}$
-measurable d-dimensional random vector such that
$|\xi|<\infty $
$\mu$
-almost surely. We set
$\{a_{t}\}_{t\in[0,T]}$
, an
$\mathcal{F}_{t}$
-adapted random process such that its trajectory
$\{a_{s}(\omega)\}_{s\in[0,t]}$
with
$\omega\in\Omega$
for each
$t\in[0,T]$
is in a measurable space
$(A_{t},\mathcal{A}_{t})$
. Assume that
$a=\{a_{t}\}_{t\in[0,T]}$
,
$B=\{B_{t}\}_{t\in[0,T]}$
, and
$\xi$
are independent of each other.
$\mu_{a},\mu_{B}$
, and
$\mu_{\xi}$
denote the probability measures induced by a, B, and
$\xi$
on
$(A_{T},\mathcal{A}_{T})$
,
$(W_{T},\mathcal{W}_{T})$
, and
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$
, respectively.
Consider the solutions
$X^{P}=\{X_{t}^{P}\}_{t\in[0,T]}$
and
$X^{Q}=\{X_{t}^{Q}\}_{t\in[0,T]}$
of the following SDEs:
We set the following assumptions, partially adapted from [Reference Liptser and Shiryaev24] but containing some differences in
$\xi$
and the structure of
$X^{Q}$
.
-
(LS1)
$X_{t}^{P}$
is a strong solution of Eq. (7), that is, there exists a measurable functional
$F_{t}$
for each t such that
\begin{align*} X_{t}^{P}(\omega)=F_{t}(a(\omega),B(\omega),\xi(\omega)) \end{align*}
$\mu$
-almost surely.
-
(LS2)
$b^{P}$
is non-anticipative, that is,
$\mathcal{A}_{t}\times \mathcal{W}_{t}$
-measurable for each
$t\in[0,T]$
, and for fixed
$a\in A_{T}$
and
$w\in W_{T}$
,
\begin{align*} \int_{0}^{T}\left|b^{P}(t,a,w)\right|\text{d} t<\infty. \end{align*}
-
(LS3)
$b^{Q} \, : \, \mathbb{R}^{d}\to\mathbb{R}^{d}$
is Lipschitz continuous, so that
$X^{Q}$
is the unique strong solution of Eq. (8). -
(LS4) We have that
\begin{align*} &\mu\left(\int_{0}^{T}\left(\left|b^{P}\left(t,a,X^{P}\right)\right|^{2}+\left|b^{Q}\left(X_{t}^{P}\right)\right|^{2}\right)\text{d} t<\infty\right)\\ &\quad=\mu\left(\int_{0}^{T}\left(\left|b^{P}\left(t,a,X^{Q}\right)\right|^{2}+\left|b^{Q}\left(X_{t}^{Q}\right)\right|^{2}\right)\text{d} t<\infty\right)=1. \end{align*}
We consider a variant of (7) with fixed
$a\in A_{T}$
:
Then assumption (LS1) yields that
$\mu_{a}\times \mu$
-almost surely. We assume that
$\Omega=A_{T}\times W_{T}\times \mathbb{R}^{d}$
,
$\mathcal{F}=\mathcal{A}_{T}\times \mathcal{W}_{T}\times\mathcal{B}(\mathbb{R}^{d})$
, and
$\mu=\mu_{a}\times\mu_{B}\times\mu_{\xi}$
without loss of generality. Then each
$\omega\in\Omega$
has the form
$\omega=(a,B,\xi)$
and we can assume that a, B, and
$\xi$
are projections such as
$a(\omega)=a$
,
$B(\omega)=B$
, and
$\xi(\omega)=\xi$
; therefore, Eq. (9) holds
$\mu_{a}\times\mu_{B}\times\mu_{\xi}$
-almost surely.
We consider a process on the product space
$(\Omega\times\tilde{\Omega},\mathcal{F}\times\tilde{\mathcal{F}},\mu\times \tilde{\mu})$
:
Assumption (LS1) gives that
$\mu\times\tilde{\mu}$
-almost surely.
Lemma 4.11. Under (LS1), for any
$C\in\mathcal{W}_{T}$
,
$\mu$
-almost surely.
Proof. The proof is essentially identical to that of Lemma 7.5 of [Reference Liptser and Shiryaev24] except for the randomness of
$\xi$
. We first show that for fixed
$t\in[0,T]$
and
$C_{t}\in\mathcal{B}(\mathbb{R}^{d})$
,
$\mu$
-almost surely. Note that the following probability for fixed a is
$\mathcal{A}_{T}$
-measurable owing to (LS1) and Fubini’s theorem:
Let
$f(a(\omega))$
be a
$\sigma(a)$
-measurable bounded random variable. Again Fubini’s theorem gives that
\begin{align*} E\left[f(a(\omega))\mathbb{I}_{F_{t}(a,B,\xi)\in C_{t}}\right]&=\int_{A_{T}}\int_{W_{T}}\int_{\mathbb{R}^{d}}f(a)\mathbb{I}_{F_{t}(a,w,x)\in C_{t}}\mu_{a}(\text{d} a)\mu_{B}(\text{d} w)\mu_{\xi}(\text{d} x)\\ &=\int_{A_{T}}f(a)\left(\int_{W_{T}}\int_{\mathbb{R}^{d}}\mathbb{I}_{F_{t}(a,w,x)\in C_{t}}\mu_{B}(\text{d} w)\mu_{\xi}(\text{d} x)\right)\mu_{a}(\text{d} a)\\ &=\int_{A_{T}}f(a)\left(\mu_{B}\times \mu_{\xi}\right)\left(F_{t}(a,B,\xi)\in C_{t}\right)\mu_{a}(\text{d} a)\\ &=\int_{A_{T}}f(a)\tilde{\mu}\left(F_{t}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t}\right)\mu_{a}(\text{d} a)\\ &=E\left[f(a)\tilde{\mu}\left(F_{t}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t}\right)\right] \end{align*}
and thus the definition of conditional expectation yields the result. Similarly, we obtain that for all
$n\in\mathbb{N}$
,
$0\le t_{1} < \cdots < t_{n}\le T$
, and
$C_{t_{i}}\in \mathcal{B}(\mathbb{R}^{d})$
,
$i=1,\ldots,n$
,
\begin{align*} &\mu\left(F_{t_{1}}(a,B,\xi)\in C_{t_{1}}\cdots F_{t_{n}}(a,B,\xi)\in C_{t_{n}}|\sigma(a)\right)\\ &\quad=\tilde{\mu}\left(F_{t_{1}}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t_{1}},\cdots,F_{t_{n}}\left(a,\tilde{B},\tilde{\xi}\right)\in C_{t_{n}}\right). \end{align*}
Therefore, the statement holds true.
Let
$P_{T}$
and
$Q_{T}$
denote the laws of
$\{(a_{t},X_{t}^{P}) \, : \, t\in[0,T]\}$
and
$\{(a_{t},X_{t}^{Q}) \, : \, t\in[0,T]\}$
. Note that
$a_{t}$
and
$X_{t}^{Q}$
are independent of each other by the assumptions. The following proposition gives the equivalence and the representation of the likelihood ratio.
Proposition 4.2. Under (LS1)–(LS4), we have that
\begin{align*} &\frac{\text{d} Q_{T}}{\text{d} P_{T}}\left(a,X^{P}\right)\\ &\quad=\exp\left(-\sqrt{\frac{\beta}{2}}\int_{0}^{T}\left\langle \left(b^{P}-b^{Q}\right)\left(t,a,X^{P}\right),\text{d} B_{t}\right\rangle -\frac{\beta}{4}\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(t,a,X^{P}\right)\right|^{2}\text{d} t\right). \end{align*}
Proof. The proof is quite parallel to that of Lemma 7.6 of [Reference Liptser and Shiryaev24]. For arbitrary set
$\Gamma=\Gamma_{1}\times \Gamma_{2}$
,
$\Gamma_{1}\in\mathcal{A}_{T}$
and
$\Gamma_{2}\in\mathcal{W}_{T}$
, by Lemma 4.11,
\begin{align*} \mu\left(\left(a,X^{P}\right)\in\Gamma\right)&=\int_{A_{T}\times W_{T}\times \mathbb{R}^{d}}\mathbb{I}_{a\in \Gamma_{1}}\mathbb{I}_{X^{P}(a,w,x)\in\Gamma_{2}}\mu_{a}(\text{d} a)\mu_{B}\left(\text{d} w\right)\mu_{\xi}\left(\text{d} x\right)\\ &=\int_{a\in \Gamma_{1}}\mu\left(X^{P}(a,B,\xi)\in \Gamma_{2}|\sigma(a)\right)\mu_{a}(\text{d} a)\\ &=\int_{a\in \Gamma_{1}}\tilde{\mu}\left(X^{P}\left(a,\tilde{B},\tilde{\xi}\right)\in \Gamma_{2}\right)\mu_{a}(\text{d} a)\\ &=\int_{a\in \Gamma_{1}}\left(P|a\right)_{T}(\Gamma_{2})\mu_{a}(\text{d} a),\end{align*}
where
$(P|a)_{T}$
is the law of (9). Let
$(Q|a)_{T}$
denote the law of
$X^{Q}$
. For
$\mu_{a}$
-almost all a, under (LS1)–(LS4) and Theorem 7.19 of [Reference Liptser and Shiryaev24],
$(P|a)_{T}\sim (Q|a)_{T}$
and the likelihood ratio is given as
\begin{align*} &\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}\left(X^{Q}\right)\\ &\quad=\exp\left(\frac{\beta}{2}\int_{0}^{T}\left\langle \left(b^{P}-b^{Q}\right)\left(t,a,X^{Q}\right),\text{d} B_{t}\right\rangle -\frac{\beta^{2}}{4}\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(t,a,X^{Q}\right)\right|^{2}\text{d} t\right).\end{align*}
Therefore, we have
\begin{align*} \mu\left(\left(a,X^{P}\right)\in\Gamma\right)&=\int_{\Gamma_{1}}\int_{\Gamma_{2}}\left(\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}(\text{d} w)(Q|a)_{T}(\text{d} w)\right)\mu_{a}(\text{d} a)\\ &=\int_{\Gamma_{1}}\int_{\Gamma_{2}}\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}(\text{d} w)\left(\mu_{a}\times (Q|a)_{T}\right)(\text{d} a \text{d} w)\\ &=\int_{\Gamma}\frac{\text{d} (P|a)_{T}}{\text{d} (Q|a)_{T}}(\text{d} w)Q_{T}(\text{d} a \text{d} w).\end{align*}
Since
$Q_{T}(a,w \, : \, (\text{d} (P|a)_{T})/(\text{d} (Q|a)_{T})(w)=0)=0$
, Lemma 6.8 of [Reference Liptser and Shiryaev24] yields the desired conclusion.
We obtain the following result.
Proposition 4.3 (Kullback–Leibler divergence.) Under (LS1)–(LS4) and the assumption
we obtain
Proof. Using Proposition 4.2, we obtain
\begin{align*} &D\left(P_{T}\left\|Q_{T}\right.\right)\\ &\quad=E\left[\log\left(\frac{\text{d} P_{T}}{\text{d} Q_{T}}\right)(a,X^{P})\right]\\ &\quad=E\left[\frac{\beta}{4}\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right)\right|^{2}\text{d} s+\sqrt{\frac{\beta}{2}}\int_{0}^{T}\left\langle \left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right),\text{d} B_{s}\right\rangle\right]\\ &\quad=\frac{\beta}{4}E\left[\int_{0}^{T}\left|\left(b^{P}-b^{Q}\right)\left(s,a,X^{P}\right)\right|^{2}\text{d} s\right], \end{align*}
since the local martingale term is a martingale by the assumption. Hence, the proposition holds.
4.5. Poincaré inequalities
Let us consider Poincaré inequalities for a probability measure
$P_{\Phi}$
whose density is
$\left(\int \mathrm{e}^{-\Phi(x)}\text{d} x\right)^{-1}\mathrm{e}^{-\Phi(x)}$
with lower-bounded
$\Phi\in\mathcal{C}^{2}(\mathbb{R}^{d})$
such that
$\int \mathrm{e}^{-\Phi(x)}\text{d} x<\infty$
. Let
$L\,:\!=\,\Delta -\langle \nabla \Phi,\nabla \rangle$
, which is
$P_{\Phi}$
-symmetric,
$P_{t}$
be the Markov semigroup with the infinitesimal generator L, and
$\mathcal{E}$
denote the Dirichlet form
where
$g\in L^{2}(P_{\Phi})$
such that the limit exists. Here, we say that a probability measure
$P_{\Phi}$
satisfies a Poincaré inequality with constant
$c_\textrm{P}(P_{\Phi})$
(the Poincaré constant) if for any
$Q\ll P_{\Phi}$
,
\begin{align*} \chi^{2}\left(Q\|P_{\Phi}\right)\le c_\textrm{P}(P_{\Phi})\mathcal{E}\left(\sqrt{\frac{\text{d} Q}{\text{d} P_{\Phi}}}\right).\end{align*}
We adopt the following statement from [Reference Raginsky, Rakhlin and Telgarsky31]; although it is different from the original discussion of [Reference Bakry, Barthe, Cattiaux and Guillin4], the difference is negligible because Eq. (2.3) of [Reference Bakry, Barthe, Cattiaux and Guillin4] yields the same upper bound.
Proposition 4.4. (Bakry et al. [Reference Bakry, Barthe, Cattiaux and Guillin4].) Assume that there exists a Lyapunov function
$V\in\mathcal{C}^{2}(\mathbb{R}^{d})$
with
$V:\mathbb{R}^{d}\to[1,\infty)$
such that
for some
$\lambda_{0}>0$
,
$\kappa_{0}\ge 0$
, and
$\tilde{R}>0$
, where
$LV(x)=\Delta V-\langle \nabla \Phi,\nabla V\rangle$
. Then
$P_{\Phi}$
satisfies a Poincaré inequality with constant
$c_\textrm{P}(P_{\Phi})$
such that
where
$a>0$
is an absolute constant and
$\mathrm{Osc}_{\tilde{R}}\,:\!=\,\max_{x:|x|\le \tilde{R}}\Phi(x)-\min_{x:|x|\le \tilde{R}}\Phi(x)$
.
The next proposition shows the convergence in
$\mathcal{W}_{2}$
by
$\chi^{2}$
-divergence using the recent study by [Reference Liu25].
Proposition 4.5. (Lehec [Reference Lehec21], Lemma 9.) Assume that
$P_{\Phi}$
satisfies Poincaré inequalities with constant
$c_\textrm{P}(P_{\Phi})$
and
$\nabla \Phi$
is at most of linear growth. Then for any probability measure
$\nu$
on
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$
with
$\nu\ll P_{\Phi}$
and every
$t>0$
,
where
$\nu P_{t}$
is the law of the unique weak solution
$Z_{t}$
of the SDE
4.6. A bound for the 2-Wasserstein distance by KL divergence
The next proposition is an immediate result by [Reference Bolley and Villani7].
Proposition 4.6. (Bolley and Villani [Reference Bolley and Villani7].) Let
$\mu,\nu$
be probability measures on
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$
. Assume that there exists a constant
$\lambda>0$
such that
$\int \exp(\lambda |x|^{2})\nu(\text{d} x)<\infty$
. Then for any
$\mu$
,
\begin{align*} \mathcal{W}_{2}\left(\mu,\nu\right)\le C_{\nu}\left(D(\mu\|\nu)^{1/2}+\left(\frac{D(\mu\|\nu)}{2}\right)^{1/4}\right), \end{align*}
where
5. Proof of the main theorem
In this section, we use the notation
$\|\nabla U\|_{\mathbb{M}}\,:\!=\,|\nabla U(\mathbf{0})|+\omega_{\nabla U}(1)$
and
$\|\widetilde{G}\|_{\mathbb{M}}\,:\!=\,|\widetilde{G}(\mathbf{0})|+\omega_{\widetilde{G}}(1)$
under (A3). We denote by
$\bar{X}_{t}^{r}$
the unique strong solution of the following SDE under (A3) (Lemma 4.8 gives the existence and uniqueness):
and
$\bar{\nu}_{t}^{r}$
represents the probability measure of
$\bar{X}_{t}^{r}$
. We use the notation
$\pi$
and
$\bar{\pi}^{r}$
, probability measures on
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$
, as
where
$\mathcal{Z}(\beta)=\int\exp(\! - \! \beta U(x))\text{d} x$
and
$\bar{\mathcal{Z}}^{r}(\beta)=\int\exp(\! - \! \beta \bar{U}_{r}(x))\text{d} x$
. Note that
$\bar{U}_{r}$
is
$(\bar{m},\bar{b})$
-dissipative with
$\bar{m}\,:\!=\,m,\bar{b}\,:\!=\,b+\omega_{\nabla U}(1)$
as
\begin{align*} \left\langle x,\nabla \bar{U}_{r}\left(x\right)\right\rangle&=\int_{\mathbb{R}^{d}}\left\langle x,\nabla U\left(x-y\right)\right\rangle \rho_{r}(y)\text{d} y\\ &= \int_{\mathbb{R}^{d}}\left\langle x-y,\nabla U\left(x-y\right)\right\rangle \rho_{r}(y)\text{d} y\\ &\quad+\int_{\mathbb{R}^{d}}\left\langle y,\nabla U\left(x-y\right)-\nabla U\left(x\right)\right\rangle \rho_{r}(y)\text{d} y\\ &\ge \int_{\mathbb{R}^{d}}\left(m\left| x-y\right|^{2}-b\right) \rho_{r}(y)\text{d} y-\omega_{\nabla U}(r) \int_{\mathbb{R}^{d}}\left| y\right|\rho_{r}(y)\text{d} y\\ &\ge m\left| x\right|^{2}-b+\int_{\mathbb{R}^{d}}\left|y\right|^{2} \rho_{r}(y)\text{d} y-r\omega_{\nabla U}(r)\\ &\ge m|x|^{2}-(b+\omega_{\nabla U}(1))\end{align*}
owing to
$r\le 1$
and
$\int_{\mathbb{R}^{d}}\langle y,z\rangle \rho_{r}(y)\text{d} y=0$
for each
$z\in\mathbb{R}^{d}$
.
5.1. Moments of SG-LMC algorithms
Lemma 5.1. (Uniform
$L^{2}$
moments.) Assume that (A1)–(A6) hold.
-
(1) For all
$k\in\mathbb{N}$
and
$0<\eta\le 1\wedge (\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))$
,
$Y_{k\eta},G(Y_{k\eta},a_{k\eta})\in L^{2}$
. Moreover,
\begin{align*} \sup_{k\ge 0}E\left[\left|Y_{k\eta}\right|^{2}\right]&\le \kappa_{0}+2\left(1\vee \frac{1}{\tilde{m}}\right)\left(\tilde{b}+\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\frac{d}{\beta}\right)= \! : \, \kappa_{\infty}. \end{align*}
-
(2) For any
$t\ge0$
and
$r\in(0,1]$
,
\begin{align*} E\left[\left|\bar{X}_{t}^{r}\right|^{2}\right]&\le \kappa_{0}\mathrm{e}^{-2\bar{m}t}+\frac{\bar{b}+d/\beta}{\bar{m}}\left(1-\mathrm{e}^{-2\bar{m}t}\right). \end{align*}
Proof. The proof is adapted from Lemma 3 of [Reference Raginsky, Rakhlin and Telgarsky31].
-
(1) We first show
$Y_{k\eta}\in L^{2}$
for each
$k\in\mathbb{N}$
since almost surely and thus
\begin{align*} E\left[\left.\left|G(Y_{k\eta},a_{k\eta})\right|^{2}\right|Y_{k\eta}\right]&\le 2E\left[\left.\left|G(Y_{k\eta},a_{k\eta})-\widetilde{G}(Y_{k\eta})\right|^{2}\right|Y_{k\eta}\right]+2\left|\widetilde{G}(Y_{k\eta})\right|^{2}\\ &\le 4\delta_{\mathbf{v},2}\left|Y_{k\eta}\right|^{2}+4\delta_{\mathbf{v},0}+\left(4\left\|\widetilde{G}\right\|_{\mathbb{M}}+4\left(\omega_{\widetilde{G}}\left(1\right)\right)^{2}\left|Y_{k\eta}\right|^{2}\right) \end{align*}
$Y_{k\eta}\in L^{2}$
implies
$G(Y_{k\eta},a_{k\eta})\in L^{2}$
. Assumptions (A3), (A5), and (A6) together with Lemma 4.4 give Hence,
\begin{align*} E\left[\left|Y_{(k+1)\eta}\right|^{2}\right] &=E\left[\left|Y_{k\eta}-\eta{G}\left(Y_{k\eta},a_{k\eta}\right)+\sqrt{2\beta^{-1}}\left(B_{(k+1)\eta}-B_{k\eta}\right)\right|^{2}\right]\\ &\le 2E\left[\left|Y_{k\eta}-\eta{G}\left(Y_{k\eta},a_{k\eta}\right)\right|^{2}\right]+2E\left[\left|\sqrt{2\beta^{-1}}\left(B_{(k+1)\eta}-B_{k\eta}\right)\right|^{2}\right]\\ &\le 4E\left[\left|Y_{k\eta}-\eta \widetilde{G}(Y_{k\eta})\right|^{2}\right]+4\eta^{2}E\left[\left|\widetilde{G}(Y_{k\eta})-{G}\left(Y_{k\eta},a_{k\eta}\right)\right|^{2}\right]+\frac{4\eta d}{\beta}\\ &\le \left(8+16(\omega_{\widetilde{G}}(1))^{2}+8\delta_{\mathbf{v},2}\right)E\left[\left|Y_{k\eta}\right|^{2}\right]+\left(16\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+8\delta_{\mathbf{v},0}+\frac{4d}{\beta}\right). \end{align*}
$Y_{k\eta}\in L^{2}$
as there exist
$\gamma_{2},\gamma_{0}>1$
such that for arbitrary
\begin{align*} E[|Y_{(k+1)\eta}|^{2}] &\le \gamma_{2}E[|Y_{k\eta}|^{2}]+\gamma_{0}\\ &\le \gamma_{2}^{k+1}E[|\xi|^{2}]+\gamma_{0}(\gamma_{2}^{k+1}-1)/(\gamma_{2}-1)\\ &\le \gamma_{2}^{k+1}(\log E[\exp(|\xi|^{2})]+\gamma_{0}/(\gamma_{2}-1))\\ &\le \gamma_{2}^{k+1}(\kappa_{0}+\gamma_{0}/(\gamma_{2}-1))<\infty \end{align*}
$k\in\mathbb{N}$
by Jensen’s inequality.
The independence among
$Y_{k\eta}$
,
$a_{k\eta}$
, and
$B_{(k+1)\eta}-B_{k\eta}$
and the square integrability of
$Y_{k\eta}$
and
$G(Y_{k\eta},a_{k\eta})$
lead to
Lemma 4.4 gives
\begin{align*} &E\left[\left|Y_{k\eta}-\eta \widetilde{G}(Y_{k\eta})\right|^{2}\right]\\ &\quad=E\left[\left|Y_{k\eta}\right|^{2}\right]-2\eta E\left[\left\langle Y_{k\eta}, \widetilde{G}(Y_{k\eta})\right\rangle \right]+\eta^{2}E\left[\left|\widetilde{G}(Y_{k\eta})\right|^{2}\right]\\ &\quad\le E\left[\left|Y_{k\eta}\right|^{2}\right]+2\eta\left(\tilde{b}-\tilde{m}E\left[\left|Y_{k\eta}\right|^{2}\right]\right)+2\eta^{2}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\left(\omega_{\widetilde{G}}(1)\right)^{2}E\left[\left|Y_{k\eta}\right|^{2}\right]\right)\\ &\quad=\left(1-2\eta \tilde{m}+2\eta^{2}\left(\omega_{\widetilde{G}}(1)\right)^{2}\right)E\left[\left|Y_{k\eta}\right|^{2}\right]+2\eta \tilde{b}+2\eta^{2}\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}. \end{align*}
By assumption (A5) and the independence between
$a_{k\eta}$
and
$Y_{k\eta}$
, we also have
Hence, for
$\gamma\,:\!=\,1-2\eta \tilde{m}+2\eta^{2}((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2})<1$
, we have that
If
$\gamma\le 0$
, then it is obvious that
and if
$\gamma\in(0,1)$
,
\begin{align*} E\left[\left|Y_{k\eta}\right|^{2}\right]&\le \gamma^{k}E\left[\left|Y_{0}\right|^{2}\right]+\frac{2\eta \tilde{b}+2\eta^{2}\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+2\eta^{2}\delta_{\mathbf{v},0}+\frac{2\eta d}{\beta}}{2\eta \tilde{m}-2\eta^{2}\left(\left(\omega_{\widetilde{G}}(1)\right)^{2}+\delta_{\mathbf{v},2}\right)}\\ &\le E\left[\left|Y_{0}\right|^{2}\right]+\frac{2\tilde{b}+2\eta\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+2\eta \delta_{\mathbf{v},0}+\frac{2d}{\beta}}{2\tilde{m}-2\eta\left(\left(\omega_{\widetilde{G}}(1)\right)^{2}+\delta_{\mathbf{v},2}\right)}\\ &\le \kappa_{0}+\frac{2}{\tilde{m}}\left(\tilde{b}+\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\frac{d}{\beta}\right) \end{align*}
since Jensen’s inequality yields
$E[|\xi|^{2}]\le \log E[\exp(|\xi|^{2})]= \kappa_{0}$
.
(2) Itô’s formula yields
\begin{align*} \mathrm{e}^{2\bar{m}t}\left|\bar{X}_{t}^{r}\right|^{2}&=\left|\bar{X}_{0}^{r}\right|^{2}+\int_{0}^{t}\left(\mathrm{e}^{2\bar{m}s}\left\langle -\nabla U\left(\bar{X}_{s}^{r}\right),2\bar{X}_{s}^{r}\right\rangle+\mathrm{e}^{2\bar{m}s}\frac{2d}{\beta}+2\bar{m}\mathrm{e}^{2\bar{m}s}\left|\bar{X}_{s}^{r}\right|^{2}\right)\text{d} s\\ &\quad+\sqrt{2\beta^{-1}}\int_{0}^{t}\mathrm{e}^{2\bar{m}s}\left\langle 2\bar{X}_{s}^{r},\text{d} B_{s}\right\rangle. \end{align*}
The dissipativity and the martingale property of the last term lead to
\begin{align*} E\left[\left|\bar{X}_{t}^{r}\right|^{2}\right]&=\mathrm{e}^{-2\bar{m}t}E\left[\left|\xi\right|^{2}\right]\\ &\quad+2\int_{0}^{t}\mathrm{e}^{2\bar{m}(s-t)}\left(E\left[\left\langle -\nabla U\left(\bar{X}_{s}^{r}\right),\bar{X}_{s}^{r}\right\rangle+\bar{m}\left|\bar{X}_{s}^{r}\right|^{2}\right]+\frac{d}{\beta}\right)\text{d} s\\ &\le \mathrm{e}^{-2\bar{m}t}E\left[\left| \xi\right|^{2}\right]+2\int_{0}^{t}\mathrm{e}^{2\bar{m}(s-t)}\left(E\left[-\bar{m}\left|\bar{X}_{s}^{r}\right|^{2}+\bar{b}+\bar{m}\left|\bar{X}_{s}^{r}\right|^{2}\right]+\frac{d}{\beta}\right)\text{d} s\\ &\le \mathrm{e}^{-2\bar{m}t}\kappa_{0}+\frac{\bar{b}+d/\beta}{\bar{m}}\left(1-\mathrm{e}^{-2\bar{m}t}\right), \end{align*}
and we are done.
Lemma 5.2. (Exponential integrability of mollified Langevin dynamics.) Assume (A1)–(A4) and (A6). For all
$r\in(0,1]$
and
$\alpha\in(0,\beta \bar{m}/2)$
such that
$E[\exp(\alpha|\xi|^{2})]<\infty$
,
In particular, for
$\alpha=1\wedge (\beta \bar{m}/4)$
,
Proof. Let
$V_{\alpha}(x)\,:\!=\,\exp(\alpha|x|^{2})$
. Note that
Let
$\bar{\mathcal{L}}^{r}$
denote the extended generator of
$\bar{X}_{t}^{r}$
such that
$\bar{\mathcal{L}}^{r}f\,:\!=\,\beta^{-1}\Delta f-\langle \nabla \bar{U}_{r}, \nabla f\rangle$
for
$f\in\mathcal{C}^{2}(\mathbb{R}^{d})$
. We have that
\begin{align*} \bar{\mathcal{L}}^{r}V_{\alpha}(x)&\le -2\alpha V_{\alpha}(x)\left\langle \nabla \bar{U}_{r}(x),x\right\rangle +2(\alpha/\beta)V_{\alpha}(x)\left(2\alpha\left|x\right|^{2}+d\right)\\ &\le 2\alpha V_{\alpha}(x)\left(\left(-\bar{m}\left|x\right|^{2}+\bar{b}\right) +\left(2\alpha\left|x\right|^{2}+d\right)/\beta\right)\\ &=2\alpha V_{\alpha}(x)\left(\left(2\alpha/\beta-\bar{m}\right)\left|x\right|^{2}+\bar{b}+ d/\beta\right). \end{align*}
Let
$R^{2}= 2(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)$
be a fixed constant. We obtain that for all
$x\in\mathbb{R}^{d}$
with
$|x|\ge R$
,
and trivially for all
$x\in\mathbb{R}^{d}$
with
$|x|<R$
,
\begin{align*} \bar{\mathcal{L}}^{r}V_{\alpha}(x)&\le 2\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)\\ &\le 4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)-2\alpha \left(\bar{b}+ d/\beta\right)V_{\alpha}(x). \end{align*}
Thus, we have for all
$x\in\mathbb{R}^{d}$
,
By Itô’s formula, there exists a sequence of stopping times
$\{\sigma_{n}\in[0,\infty)\}_{n\in\mathbb{N}}$
with
$\sigma_{n}<\sigma_{n+1}$
for all
$n\in\mathbb{N}$
and
$\sigma_{n}\uparrow \infty$
as
$n\to\infty$
almost surely such that for all
$n\in\mathbb{N}$
and
$t\ge0$
,
\begin{align*} &E\left[\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)(t\wedge\sigma_{n})}V_{\alpha}\left(\bar{X}_{t\wedge\sigma_{n}}^{r}\right)\right]\\ &\quad= E\left[V_{\alpha}\left(\bar{X}_{0}^{r}\right)\right]\\ &\qquad+E\left[\int_{0}^{t\wedge\sigma_{n}}\left(\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\bar{\mathcal{L}}^{r}V_{\alpha}\left(\bar{X}_{s}^{r}\right)+2\alpha \left(\bar{b}+ d/\beta\right)\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}V_{\alpha}\left(\bar{X}_{s}^{r}\right)\right)\text{d} s\right]. \end{align*}
We have that
\begin{align*} &E\left[\int_{0}^{t\wedge\sigma_{n}}\left(\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\bar{\mathcal{L}}^{r}V_{\alpha}\left(\bar{X}_{s}^{r}\right)+2\alpha \left(\bar{b}+ d/\beta\right)\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}V_{\alpha}\left(\bar{X}_{s}^{r}\right)\right)\text{d} s\right]\\ &\quad\le 4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)E\left[\int_{0}^{t\wedge\sigma_{n}}\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\text{d} s\right]\\ &\quad\le 4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)\int_{0}^{t}\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\text{d} s, \end{align*}
and thus Fatou’s lemma gives
\begin{align*} &E\left[\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)t}V_{\alpha}\left(\bar{X}_{t}^{r}\right)\right]\\ &\quad=E\left[\lim_{n\to\infty} \mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)(t\wedge\sigma_{n})}V_{\alpha}\left(\bar{X}_{t\wedge\sigma_{n}}^{r}\right)\right]\\ &\quad\le \liminf_{n\to\infty}E\left[\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)(t\wedge\sigma_{n})}V_{\alpha}\left(\bar{X}_{t\wedge\sigma_{n}}^{r}\right)\right]\\ &\quad\le E\left[V_{\alpha}\left(\bar{X}_{0}^{r}\right)\right]+4\alpha \mathrm{e}^{2\alpha(\bar{b}+d/\beta)/(\bar{m}-2\alpha/\beta)}\left(\bar{b}+d/\beta\right)\int_{0}^{t}\mathrm{e}^{2\alpha \left(\bar{b}+ d/\beta\right)s}\text{d} s. \end{align*}
Therefore,
and we obtain the desired conclusion.
5.2. Poincaré inequalities for distributions with mollified potentials
Let
$\bar{L}^{r}$
be an operator such that
$\bar{L}^{r}f\,:\!=\,\Delta f-\beta\langle \nabla \bar{U}_{r},\nabla f\rangle $
for all
$f\in\mathcal{C}^{2}(\mathbb{R}^{d})$
. Note that Lemma 4.7 yields
$\bar{U}_{r}\in\mathcal{C}^{2}(\mathbb{R}^{d})$
.
Lemma 5.3. (A bound for the constant of a Poincaràinequality for
$\bar{\pi}^{r}$
.) Under (A1)–(A4), for some absolute constant
$a>0$
, for all
$r\in(0,1]$
,
\begin{align*} c_\textrm{P}(\bar{\pi}^{r})&\le \frac{2}{\bar{m}\beta\left(d+\bar{b}\beta\right)}+\frac{4a\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\exp\left(\beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\right)+U_{0}\right)\right), \end{align*}
where
$U_{0}\,:\!=\,\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}<\infty$
.
Remark 5.1. Note that this upper bound is independent of r.
Proof. We adapt the discussion of [Reference Raginsky, Rakhlin and Telgarsky31]. We set a Lyapunov function
$V(x)=\mathrm{e}^{\bar{m}\beta|x|^{2}/4}$
. Since
$-\langle \nabla \bar{U}_{r}(x),x\rangle \le -\bar{m}|x|^{2}+\bar{b}$
for all
$x\in\mathbb{R}^{d}$
, we have that
\begin{align*} \bar{L}^{r}V(x)&=\left(\frac{d\bar{m}\beta}{2}+\frac{\left(\bar{m}\beta\right)^{2}}{4}\left|x\right|^{2}-\frac{\bar{m}\beta^{2}}{2}\left\langle \nabla \bar{U}_{r}\left(x\right),x\right\rangle \right)V(x)\\ &\le \left(\frac{d\bar{m}\beta}{2}+\frac{\left(\bar{m}\beta\right)^{2}}{4}\left|x\right|^{2}-\frac{\bar{m}^{2}\beta^{2}}{2}\left|x\right|^{2}+\frac{\bar{m}\beta^{2}\bar{b}}{2}\right)V(x)\\ &= \left(\frac{\bar{m}\beta\left(d+\bar{b}\beta\right)}{2}-\frac{\bar{m}^{2}\beta^{2}}{4}|x|^{2}\right)V(x). \end{align*}
We fix the constants
Lemma 4.6,
$2a\le a^{2}+1$
for
$a>0$
, and
$U(x)\ge 0$
give
$\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}<\infty$
and
\begin{align*} \text{Osc}_{\tilde{R}}\left(\beta \bar{U}_{r}\right)&\le \beta\left(\frac{\omega_{\nabla U}(1)}{2}\tilde{R}^{2}+2\left\|\nabla U\right\|_{\mathbb{M}}\tilde{R}+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right)\\ &\le \beta\left(\left(\left\|\nabla U\right\|_{\mathbb{M}}+\frac{\omega_{\nabla U}(1)}{2}\right)\tilde{R}^{2}+\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right)\\ &\le \beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\tilde{R}^{2}+\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right)\\ &\le \beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\tilde{R}^{2}\right)+\left\|U\right\|_{L^{\infty}(B_{1}(\mathbf{0}))}\right). \end{align*}
Proposition 4.4 with
$\lambda_{0}=\kappa_{0}=\kappa$
yields that for some absolute constant
$a>0$
,
\begin{align*} c_\textrm{P}\left(\bar{\pi}^{r}\right)&\le \frac{2}{\bar{m}\beta\left(d+\bar{b}\beta\right)}\left(1+a\frac{\bar{m}\beta\left(d+\bar{b}\beta\right)}{2}\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\mathrm{e}^{\text{Osc}_{\tilde{R}}\left(\beta \bar{U}_{r}\right)}\right)\\ &=\frac{2}{\bar{m}\beta\left(d+\bar{b}\beta\right)}+\frac{4a\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\exp\left(\beta\left(\frac{3}{2}\left\|\nabla U\right\|_{\mathbb{M}}\left(1+\frac{4\left(d+\bar{b}\beta\right)}{\bar{m}\beta}\right)+U_{0}\right)\right). \end{align*}
Hence, the statement holds true.
5.3. Kullback–Leibler and
$\chi^{2}$
-divergences
Lemma 5.4. Under (A1)–(A6), for any
$k\in\mathbb{N}$
and
$\eta\in(0,1\wedge (\tilde{m}/2((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}))]$
, we have that
where
$C_{0}$
is a positive constant such that
Proof. We set
$A_{t}\,:\!=\,\{\mathbf{a}_{s}=a_{\lfloor s/\eta\rfloor \eta} \, : \, a_{i\eta}\in A,i=0,\ldots,\lfloor t/\eta\rfloor,s\le t\}$
with
$t\le k\eta$
, and
$\mathcal{A}_{t}\,:\!=\,\sigma(\{\mathbf{a}\in A_{t} \, : \, \mathbf{a}_{s_{j}}\in C_{j},j=1,\ldots,n\} \, : \, s_{j}\in[0,t],C_{j}\in\mathcal{A},n\in\mathbb{N})$
. Let
$P_{k\eta}$
and
$Q_{k\eta}$
denote the probability measures on
$(A_{k\eta}\times W_{k\eta},\mathcal{A}_{k\eta}\times \mathcal{W}_{k\eta})$
of
$\{(a_{\lfloor t/\eta\rfloor \eta},Y_{t}) \, : \, 0\le t\le T\}$
and
$\{(a_{\lfloor t/\eta\rfloor \eta},\bar{X}_{t}^{r}) \, : \, 0\le t\le T\}$
, respectively. Note that
$\bar{X}_{t}^{r}$
is the unique strong solution to Eq. (10) and such a unique strong solution of this equation exists for any
$r>0$
since
$\nabla \bar{U}_{r}$
is Lipschitz continuous by Lemma 4.8. We obtain
\begin{align*} &\frac{\beta}{4}E\left[\int_{0}^{k\eta}\left|\nabla \bar{U}_{r}\left(Y_{t}\right)-G\left(Y_{\lfloor t/\eta\rfloor \eta},a_{\lfloor t/\eta\rfloor\eta}\right)\right|^{2}\text{d} t\right]\\ &\quad\le \frac{\beta}{2}\sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|\nabla \bar{U}_{r}\left(Y_{t}\right)-\nabla \bar{U}_{r}\left(Y_{\lfloor t/\eta\rfloor \eta}\right)\right|^{2}\text{d} t\right]\\ &\qquad+\frac{\beta}{2}\sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|\nabla \bar{U}_{r}\left(Y_{\lfloor t/\eta\rfloor \eta}\right)-G\left(Y_{\lfloor t/\eta\rfloor \eta},a_{\lfloor t/\eta\rfloor\eta}\right)\right|^{2}\text{d} t\right]\\ &\quad\le \frac{\beta}{2}\frac{\left(d+2\right)\omega_{\nabla U}(r)}{r}\sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|Y_{t}-Y_{\lfloor t/\eta\rfloor \eta}\right|^{2}\text{d} t\right]\\ &\qquad+\frac{\beta}{2}\sum_{j=0}^{k-1}E\left[\eta\left|\nabla \bar{U}_{r}\left(Y_{j \eta}\right)-G\left(Y_{j\eta},a_{j\eta}\right)\right|^{2}\right]. \end{align*}
Note that
$E[\langle G(Y_{j\eta},a_{j\eta})-\widetilde{G}(Y_{j\eta}), f(Y_{j\eta})\rangle ]=0$
for any measurable
$f:\mathbb{R}^{d}\to\mathbb{R}^{d}$
of linear growth since
$Y_{j\eta}$
is square integrable by Lemma 5.1 and
$\sigma(Y_{(j-1)\eta},a_{(j-1)\eta},B_{j\eta}-B_{(j-1)\eta})$
-measurable, and
$a_{j\eta}$
is independent of this
$\sigma$
-algebra. For all
$t\in[j\eta,(j+1)\eta)$
, by Lemmas 4.4 and 5.1,
\begin{align*} &E\left[\left|Y_{t}-Y_{\lfloor t/\eta\rfloor \eta}\right|^{2}\right]\\ &\quad=E\left[\left|-\left(t-j\eta\right)G\left(Y_{j\eta},a_{j\eta}\right)+\sqrt{2\beta^{-1}}\left(B_{t}-B_{j\eta}\right)\right|^{2}\right]\\ &\quad=\left(t-j\eta\right)^{2}E\left[\left|G\left(Y_{j\eta},a_{j\eta}\right)-\widetilde{G}\left(Y_{j\eta}\right)+\widetilde{G}\left(Y_{j\eta}\right)\right|^{2}\right]+2\beta^{-1}E\left[\left|B_{t}-B_{j\eta}\right|^{2}\right]\\ &\quad\le \left(t-j\eta\right)^{2}\left(2\delta_{\mathbf{v},2}E\left[\left|Y_{j\eta}\right|^{2}\right]+2\delta_{\mathbf{v},0}+E\left[\left|\widetilde{G}\left(Y_{j\eta}\right)\right|^{2}\right]\right)+2\beta^{-1}d\left(t-j\eta\right)\\ &\quad\le 2\left(t-j\eta\right)^{2}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)E\left[\left|Y_{j\eta}\right|^{2}\right]\right)+2\beta^{-1}d\left(t-j\eta\right)\\ &\quad\le 2\left(t-j\eta\right)^{2}\left(\left\|\widetilde{G}\right\|_{\mathbb{M}}^{2}+\delta_{\mathbf{v},0}+\left((\omega_{\widetilde{G}}(1))^{2}+\delta_{\mathbf{v},2}\right)\kappa_{\infty}\right)+2\beta^{-1}d\left(t-j\eta\right)\\ &\quad = \! : \, 2\left(t-j\eta\right)^{2}C'+2\beta^{-1}d\left(t-j\eta\right) \end{align*}
and thus
\begin{align*} \sum_{j=0}^{k-1}E\left[\int_{j\eta}^{(j+1)\eta}\left|Y_{t}-Y_{\lfloor t/\eta\rfloor \eta}\right|^{2}\text{d} t\right]\le \left(\frac{2C'\eta^{3}}{3}+\frac{d\eta^{2}}{\beta}\right)k\le \left(\frac{2C'}{3}+\frac{d}{\beta}\right)k\eta^{2}. \end{align*}
We have that
\begin{align*} E\left[\left|\nabla \bar{U}_{r}\left(Y_{j\eta}\right)-G\left(Y_{j\eta},a_{j\eta}\right)\right|^{2}\right] &\le E\left[2\left(\delta_{\mathbf{b},r,2}+\delta_{\mathbf{v},2}\right)\left|Y_{j\eta}\right|^{2}+2\left(\delta_{\mathbf{b},r,0}+\delta_{\mathbf{v},0}\right)\right]\\ &\le 2\delta_{r,2}\kappa_{\infty}+2\delta_{r,0}. \end{align*}
Assumptions (LS1)–(LS4) of Propositions 4.2 and 4.3 are satisfied owing to (A1)–(A6), Lemma 5.1, and the linear growths of
$\widetilde{G}(w_{\lfloor \cdot /\eta\rfloor \eta})$
with respect to
$\max_{i=0,\ldots,k}|w_{i\eta}|$
and
$\nabla \bar{U}_{r}(w_{t})$
with respect to
$|w_{t}|$
. Therefore, the data-processing inequality and Proposition 4.3 give
\begin{align*} D(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r}) &\le \int\log\left(\frac{\text{d} P_{k\eta}}{\text{d} Q_{k\eta}}\right)\text{d} P_{k\eta}\\ &= \frac{\beta}{4}E\left[\int_{0}^{k\eta}\left|\nabla \bar{U}_{r}\left(Y_{t}\right)-G\left(Y_{\lfloor t/\eta\rfloor \eta},a_{\lfloor t/\eta\rfloor\eta}\right)\right|^{2}\text{d} t\right]\\ &\le \frac{(d+2)\omega_{\nabla U}(r)}{r}\left(\frac{C'\beta}{3}+\frac{d}{2}\right)k\eta^{2}+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)k\eta\\ &=\left(\frac{(d+2)\omega_{\nabla U}(r)}{r}\left(\frac{C'\beta}{3}+\frac{d}{2}\right)\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta\\ &=\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta. \end{align*}
This is the desired conclusion.
Lemma 5.5 (Lemma 2 of [Reference Raginsky, Rakhlin and Telgarsky31].) Under (A1) and (A4), for almost all
$x\in\mathbb{R}^{d}$
,
Proof. The proof is adapted from Lemma 2 of [Reference Raginsky, Rakhlin and Telgarsky31]. We first fix
$c\in(0,1]$
. Since
$\{x\in\mathbb{R}^{d} \, : \, x\text{ or }cx\text{ is in the set such that Eq.~(6) does not hold}\}$
is null, for almost all
$x\in\mathbb{R}^{d}$
,
\begin{align*} U(x)&=U(cx)+\int_{0}^{1}\left\langle \nabla U(cx+t(x-cx)),x-cx\right\rangle \text{d} t\\\ &\ge \int_{0}^{1}\left\langle \nabla U((c+t(1-c))x),(1-c)x\right\rangle \text{d} t\\ &=\int_{0}^{1}\frac{1-c}{c+t(1-c)}\left\langle \nabla U((c+t(1-c))x),(c+t(1-c))x\right\rangle \text{d} t\\ &\ge \int_{0}^{1}\frac{1-c}{c+t(1-c)}\left(m(c+t(1-c))^{2}|x|^{2}-b\right) \text{d} t\\ &=\int_{c}^{1}\frac{1}{s}\left(ms^{2}|x|^{2}-b\right) \text{d} s\\ &=\frac{1-c^{2}}{2}m|x|^{2}+b\log c. \end{align*}
Here,
$s=c+t(1-c)$
and thus
$\text{d} t=(1-c)^{-1}\text{d} s$
. Then
$c=1/\sqrt{3}$
yields the conclusion.
Lemma 5.6. Under (A1)–(A4) and (A6), for all
$r\in(0,1]$
, we have that
Proof. The density of
$\bar{\pi}^{r}$
is given as
$(\text{d}\bar{\pi}^{r}/\text{d} x)(x)=\bar{\mathcal{Z}}^{r}(\beta)^{-1}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}$
and
\begin{align*} \chi^{2}(\mu_{0}\|\bar{\pi}^{r})&=\int_{\mathbb{R}^{d}}\left(\left(\frac{\left(\int_{\mathbb{R}^{d}}\mathrm{e}^{-\Psi(x)}\text{d} x\right)^{-1}\mathrm{e}^{-\Psi(x)}}{\bar{\mathcal{Z}}^{r}(\beta)^{-1}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}}\right)^{2}-1\right)\bar{\mathcal{Z}}^{r}(\beta)^{-1}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}\text{d} x\\ &=\frac{\bar{\mathcal{Z}}^{r}(\beta)}{\int_{\mathbb{R}^{d}}\mathrm{e}^{-\Psi(x)}\text{d} x}\int_{\mathbb{R}^{d}}\mathrm{e}^{\beta \bar{U}_{r}\left(x\right)-\Psi\left(x\right)}\mu_{0}(\text{d} x)-1\\ &\le \frac{\mathrm{e}^{\psi_{0}}\bar{\mathcal{Z}}^{r}(\beta)}{\int_{\mathbb{R}^{d}}\mathrm{e}^{-\psi_{2}|x|^{2}}\text{d} x}\int_{\mathbb{R}^{d}}\mathrm{e}^{\beta\left(\frac{\omega_{\nabla U}(1)}{2}\left|x\right|^{2}+2\left\|\nabla U\right\|_{\mathbb{M}}\left|x\right|+\left\|U\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}\right)-\Psi(x)}\mu_{0}(\text{d} x)\\ &\le \frac{\mathrm{e}^{\psi_{0}}\bar{\mathcal{Z}}^{r}(\beta)}{(\pi/\psi_{2})^{d/2}}\int_{\mathbb{R}^{d}}\mathrm{e}^{\beta\left(\left\|\nabla U\right\|_{\mathbb{M}}\left|x\right|^{2}+2\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}\right)-\Psi(x)}\mu_{0}(\text{d} x)\\ &\le \frac{\bar{\mathcal{Z}}^{r}(\beta)}{(\pi/\psi_{2})^{d/2}}\mathrm{e}^{\beta\left(2\left\|\nabla U\right\|_{\mathbb{M}}+\left\|U\right\|_{L^{\infty}\left(B_{1}\left(\mathbf{0}\right)\right)}\right)+2\psi_{0}} \end{align*}
by Lemma 4.6 and
$2|x|\le |x|^{2}/2+2$
. Lemma 5.5, Jensen’s inequality, and the convexity of
$\mathrm{e}^{-x}$
yield
\begin{align*} \bar{\mathcal{Z}}^{r}(\beta)&=\int_{\mathbb{R}^{d}}\mathrm{e}^{-\beta \bar{U}_{r}\left(x\right)}\text{d} x\\ &=\int_{\mathbb{R}^{d}}\mathrm{e}^{-\beta \int _{\mathbb{R}^{d}}U(x-y)\rho_{r}(y)\text{d} y}\text{d} x\\ &\le \int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\mathrm{e}^{-\beta U(x-y)}\text{d} x\rho_{r}(y)\text{d} y\\ &\le \mathrm{e}^{\frac{1}{2}\beta b\log 3}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\mathrm{e}^{-m\beta|x-y|^{2}/3}\text{d} x\rho_{r}(y)\text{d} y\\ &=3^{\beta b/2}\left(3\pi/m\beta\right)^{d/2}, \end{align*}
and we are done.
Lemma 5.7. (Kullback--Leibler divergence of Gibbs distributions.) Under (A1)–(A4) and (A6), we have that
Proof. The divergence of
$\pi$
from
$\bar{\pi}^{r}$
is
\begin{align*} D\left(\pi\|\bar{\pi}^{r}\right)&=\frac{1}{\mathcal{Z}(\beta)}\int \exp\left(-\beta U\left(x\right)\right)\log\left[\frac{\bar{\mathcal{Z}}^{r}(\beta)\exp\left(-\beta U\left(x\right)\right)}{\mathcal{Z}(\beta)\exp\left(-\beta \bar{U}_{r}\left(x\right)\right)}\right]\text{d} x\\ &=\frac{\beta}{\mathcal{Z}(\beta)}\int \exp\left(-\beta U\left(x\right)\right)\left(\bar{U}_{r}(x)-U\left(x\right)\right)\text{d} x+\left(\log\bar{\mathcal{Z}}^{r}(\beta)-\log\mathcal{Z}(\beta)\right).\end{align*}
Lemma 4.10 yields
\begin{align*} \frac{\beta}{\mathcal{Z}(\beta)}\int_{\mathbb{R}^{d}} \left(\bar{U}_{r}(x)-U\left(x\right)\right)\mathrm{e}^{-\beta U\left(x\right)}\text{d} x &\le \frac{\beta}{\mathcal{Z}(\beta)}\int_{\mathbb{R}^{d}} \left|\bar{U}_{r}(x)-U\left(x\right)\right|\mathrm{e}^{-\beta U\left(x\right)}\text{d} x\\ &\le \frac{\beta }{\mathcal{Z}(\beta)}\int_{\mathbb{R}^{d}} r\omega_{\nabla U}(r)\mathrm{e}^{-\beta U\left(x\right)}\text{d} x\\ &\le \beta r\omega_{\nabla U}(r).\end{align*}
Jensen’s inequality and Fubini’s theorem give
\begin{align*} \bar{\mathcal{Z}}^{r}\left(\beta\right)&=\int_{\mathbb{R}^{d}}\exp\left(-\beta \int_{\mathbb{R}^{d}}U\left(x-y\right)\rho_{r}\left(y\right)\text{d} y\right)\text{d} x\\ &\le \int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp\left(-\beta U\left(x-y\right)\right)\rho_{r}\left(y\right)\text{d} y\text{d} x\\ &=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp\left(-\beta U\left(x-y\right)\right)\text{d} x\rho_{r}\left(y\right)\text{d} y\\ &= \mathcal{Z}\left(\beta\right)\end{align*}
and thus
$\log\bar{\mathcal{Z}}^{r}(\beta)-\log\mathcal{Z}(\beta)\le 0$
.
5.4. Proof of Theorem 2.1
We complete the proof of Theorem 2.1.
Proof of Theorem 2.1. We decompose the 2-Wasserstein distance as follows:
-
(1) We first consider an upper bound for
$\mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})$
. Proposition 4.6 gives where
\begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})\le C_{\bar{\nu}_{k\eta}^{r}}\left(D\left(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r}\right)^{1/2}+\left(\frac{D\left(\mu_{k\eta}\|\bar{\nu}_{k\eta}^{r}\right)}{2}\right)^{1/4}\right),\end{align*}
We fix
\begin{align*} C_{\bar{\nu}_{k\eta}^{r}}\,:\!=\,2\inf_{\lambda>0}\left(\frac{1}{\lambda}\left(\frac{3}{2}+\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\bar{\nu}_{k\eta}^{r}\left(\text{d} x\right)\right)\right)^{1/2}.\end{align*}
$\lambda=1\wedge (\beta \bar{m}/4)$
and then Lemma 5.2 leads to Hence, Lemma 5.4 gives the following bound:
\begin{align*} C_{\bar{\nu}_{k\eta}^{r}}&\le \frac{1}{\lambda^{1/2}}\left(6+4\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\bar{\nu}_{k\eta}^{r}\left(\text{d} x\right)\right)^{1/2}\\ &\le \frac{1}{\lambda^{1/2}}\left(6+4\left(\lambda\kappa_{0}+\frac{4\lambda(\bar{b}+d/\beta)}{\bar{m}}+1\right)\right)^{1/2}\\ &\le \left(4\kappa_{0}+\frac{16(\bar{b}+d/\beta)}{\bar{m}}+\frac{10}{1\wedge (\beta \bar{m}/4)}\right)^{1/2}.\end{align*}
\begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r}) &\le C_{1}\max\left.\left\{x^{1/2},x^{1/4}\right\}\right|_{x=\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta}.\end{align*}
-
(2) Let us now give a bound for
$\mathcal{W}_{2}(\bar{\nu}_{k\eta}^{r},\bar{\pi}^{r})$
. Proposition 4.5 and Lemma 5.6 yield
\begin{align*} \mathcal{W}_{2}(\bar{\nu}_{k\eta}^{r},\bar{\pi}^{r})&\le \sqrt{2c_\textrm{ P}(\bar{\pi}^{r})\chi^{2}\left(\mu_{0}\|\bar{\pi}^{r}\right)}\exp\left(-\frac{t}{2\beta c_\textrm{ P}(\bar{\pi}^{r})}\right)\\ &\le \sqrt{2c_\textrm{P}(\bar{\pi}^{r})C_{2}}\exp\left(-\frac{t}{2\beta c_\textrm{P}(\bar{\pi}^{r})}\right).\end{align*}
-
(3) Next, we consider a bound for
$\mathcal{W}_{2}(\bar{\pi}^{r},\pi)$
. Proposition 4.6 gives where
\begin{align*} \mathcal{W}_{2}(\bar{\pi}^{r},\pi)\le C_{\bar{\pi}^{r}}\left(D\left(\pi\|\bar{\pi}^{r}\right)^{1/2}+\left(\frac{D\left(\pi\|\bar{\pi}^{r}\right)}{2}\right)^{1/4}\right),\end{align*}
$C_{\bar{\pi}^{r}}\,:\!=\,2\inf_{\lambda>0}\left(\frac{1}{\lambda}\left(\frac{3}{2}+\log\int_{\mathbb{R}^{d}}\mathrm{e}^{\lambda\left|x\right|^{2}}\bar{\pi}^{r}\left(\text{d} x\right)\right)\right)^{1/2}$
. We fix
$\lambda=1\wedge (\beta \bar{m}/4)$
and then Lemmas 5.2 along with Fatou’s lemma leads to Lemma 5.7 yields the bound
\begin{align*} C_{\bar{\pi}^{r}} &\le \left(\frac{16(\bar{b}+d/\beta)}{\bar{m}}+\frac{10}{1\wedge (\beta \bar{m}/4)}\right)^{1/2}.\end{align*}
\begin{align*} \mathcal{W}_{2}(\bar{\pi}^{r},\pi)\le C_{1}\max\left.\left\{y^{1/2},y^{1/4}\right\}\right|_{y=\beta r\omega_{\nabla U}(r)}.\end{align*}
-
where
\begin{align*} \mathcal{W}_{2}(\mu_{k\eta},\bar{\nu}_{k\eta}^{r})+\mathcal{W}_{2}(\bar{\pi}^{r},\pi)&\le C_{1}\left(\max\left\{x^{1/2},x^{1/4}\right\}+\max\left\{y^{1/2},y^{1/4}\right\}\right)\\ &\le C_{1}\left(2\left(x+y\right)^{1/2}+2\left(x+y\right)^{1/4}\right),\end{align*}
Hence, we obtain the desired conclusion.
\begin{align*} x=\left(C_{0}\frac{\omega_{\nabla U}(r)}{r}\eta+\beta\left(\delta_{r,2}\kappa_{\infty}+\delta_{r,0}\right)\right)k\eta,\ {y=\beta r\omega_{\nabla U}(r)}.\end{align*}
Acknowledgements
The author thanks the two anonymous reviewers for their constructive feedback.
Funding information
The author was supported by JSPS KAKENHI Grant No. JP21K20318 and JST CREST Grants Nos. JPMJCR21D2 and JPMJCR2115.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.
