Notations
 In this paper, vectors are denoted in boldface and matrix names in uppercase letters. If 
 $\mathbf{y}$
 is a vector and A is a matrix,
$\mathbf{y}$
 is a vector and A is a matrix, 
 $\text{Var}(\mathbf{y})$
 denotes the variance–covariance matrix associated with
$\text{Var}(\mathbf{y})$
 denotes the variance–covariance matrix associated with 
 $\mathbf{y}$
,
$\mathbf{y}$
, 
 $\textbf{diag}(A)$
 represents the diagonal of matrix A, and
$\textbf{diag}(A)$
 represents the diagonal of matrix A, and 
 $\text{Diag}(\mathbf{y})$
 is the diagonal matrix such that
$\text{Diag}(\mathbf{y})$
 is the diagonal matrix such that 
 $\textbf{diag}(\text{Diag}(\mathbf{y})) = \mathbf{y}$
. The sum of the diagonal elements of A is denoted as
$\textbf{diag}(\text{Diag}(\mathbf{y})) = \mathbf{y}$
. The sum of the diagonal elements of A is denoted as 
 $\text{tr}(A)$
 and its transpose as
$\text{tr}(A)$
 and its transpose as 
 $A^{T}$
. In the case where A is invertible,
$A^{T}$
. In the case where A is invertible, 
 $A^{- 1}$
 denotes its inverse and
$A^{- 1}$
 denotes its inverse and 
 $|A|$
 denotes the product of the eigenvalues of A. For a non-invertible matrix A,
$|A|$
 denotes the product of the eigenvalues of A. For a non-invertible matrix A, 
 $A^{-}$
 refers to the Moore–Penrose pseudo-inverse of A, and
$A^{-}$
 refers to the Moore–Penrose pseudo-inverse of A, and 
 $|A|_+$
 denotes the product of the non-zero eigenvalues of A. By writing the eigendecomposition as
$|A|_+$
 denotes the product of the non-zero eigenvalues of A. By writing the eigendecomposition as 
 $A = U\Sigma V^{T}$
, where U and V are orthogonal matrices and
$A = U\Sigma V^{T}$
, where U and V are orthogonal matrices and 
 $\Sigma$
 is a diagonal matrix containing the eigenvalues of A, and by denoting
$\Sigma$
 is a diagonal matrix containing the eigenvalues of A, and by denoting 
 $\Sigma^-$
 as the matrix obtained by replacing the non-zero eigenvalues in
$\Sigma^-$
 as the matrix obtained by replacing the non-zero eigenvalues in 
 $\Sigma$
 with their inverses leaving the zero eigenvalues unchanged, the pseudo-inverse is given by
$\Sigma$
 with their inverses leaving the zero eigenvalues unchanged, the pseudo-inverse is given by 
 $A^{-} = V\Sigma^{-}U^{T}$
. The Kronecker product of two matrices A and B is denoted as
$A^{-} = V\Sigma^{-}U^{T}$
. The Kronecker product of two matrices A and B is denoted as 
 $A \otimes B$
, and their Hadamard (element-wise) product is denoted as
$A \otimes B$
, and their Hadamard (element-wise) product is denoted as 
 $A \odot B$
.
$A \odot B$
. 
 $\lfloor x\rfloor$
 denotes the greatest integer less than or equal to
$\lfloor x\rfloor$
 denotes the greatest integer less than or equal to 
 $x\in\mathbb{R}$
. Finally, the symbol
$x\in\mathbb{R}$
. Finally, the symbol 
 $\propto$
 denotes proportionality between the expressions on both sides.
$\propto$
 denotes proportionality between the expressions on both sides.
1. Introduction
Whittaker–Henderson (WH) smoothing is a graduation method designed to mitigate the effects of sampling fluctuations in a vector of evenly spaced discrete observations. Although this method was originally proposed by Bohlmann (Reference Bohlmann1899), it is named after Whittaker (Reference Whittaker1923), who applied it to graduate mortality tables, and Henderson (Reference Henderson1924), who popularized it among actuaries in the United States. The method was later extended to two dimensions by Knorr (Reference Knorr1984). WH smoothing may be used to build experience tables for a broad spectrum of life insurance risks, such as mortality, disability, long-term care, lapse, mortgage default, and unemployment. We begin with a brief overview of the method before outlining the structure and main contributions of the paper.
1.1. A brief reminder of WH smoothing mathematical formulation
The one-dimensional case
 Let 
 $\mathbf{y}$
 be a vector of observations and
$\mathbf{y}$
 be a vector of observations and 
 $\mathbf{w}$
 a vector of positive weights, both of size n. The estimator associated with WH smoothing is given by
$\mathbf{w}$
 a vector of positive weights, both of size n. The estimator associated with WH smoothing is given by
 \begin{equation}{\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\}}\end{equation}
\begin{equation}{\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\}}\end{equation}
where:
- 
•  $F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2$
 represents a fidelity criterion with respect to the observations, $F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2$
 represents a fidelity criterion with respect to the observations,
- 
•  $R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2$
 represents a smoothness criterion. $R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2$
 represents a smoothness criterion.
 In the latter expression, 
 $\lambda \ge 0$
 is a smoothing parameter and
$\lambda \ge 0$
 is a smoothing parameter and 
 $\Delta^q$
 denotes the forward difference operator of order q, such that for any
$\Delta^q$
 denotes the forward difference operator of order q, such that for any 
 $i\in\{1,\ldots,n - q\}$
:
$i\in\{1,\ldots,n - q\}$
:
 \begin{align*}(\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(\!-\!1)^{q - k} \theta_{i + k}.\end{align*}
\begin{align*}(\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(\!-\!1)^{q - k} \theta_{i + k}.\end{align*} 
 Define 
 $W = \text{Diag}(\mathbf{w})$
, the diagonal matrix of weights, and
$W = \text{Diag}(\mathbf{w})$
, the diagonal matrix of weights, and 
 $D_{n,q}$
 as the order q difference matrix of dimensions
$D_{n,q}$
 as the order q difference matrix of dimensions 
 $(n-q) \times n$
, such that
$(n-q) \times n$
, such that 
 $(D_{n,q}\boldsymbol{\theta})_i = (\Delta^q\boldsymbol{\theta})_i$
 for all
$(D_{n,q}\boldsymbol{\theta})_i = (\Delta^q\boldsymbol{\theta})_i$
 for all 
 $i \in [1, n-q]$
. The first- and second-order difference matrices are given by
$i \in [1, n-q]$
. The first- and second-order difference matrices are given by 
 \begin{align*}D_{n,1} = \begin{bmatrix}-1 &\quad 1 &\quad 0 &\quad \ldots &\quad 0 \\0 &\quad -1 &\quad 1 &\quad \ddots &\quad \vdots \\\vdots &\quad \ddots &\quad \ddots &\quad \ddots &\quad 0 \\0 &\quad \ldots &\quad 0 &\quad -1 &\quad 1 \\\end{bmatrix}\quad\text{and}\quad D_{n,2} = \begin{bmatrix}1 &\quad -2 &\quad 1 &\quad 0 &\quad \ldots &\quad 0 \\0 &\quad 1 &\quad -2 &\quad 1 &\quad \ddots &\quad \vdots \\\vdots &\quad \ddots &\quad \ddots &\quad \ddots &\quad \ddots &\quad 0 \\0 &\quad \ldots &\quad 0 &\quad 1 &\quad -2 &\quad 1 \\\end{bmatrix}.\end{align*}
\begin{align*}D_{n,1} = \begin{bmatrix}-1 &\quad 1 &\quad 0 &\quad \ldots &\quad 0 \\0 &\quad -1 &\quad 1 &\quad \ddots &\quad \vdots \\\vdots &\quad \ddots &\quad \ddots &\quad \ddots &\quad 0 \\0 &\quad \ldots &\quad 0 &\quad -1 &\quad 1 \\\end{bmatrix}\quad\text{and}\quad D_{n,2} = \begin{bmatrix}1 &\quad -2 &\quad 1 &\quad 0 &\quad \ldots &\quad 0 \\0 &\quad 1 &\quad -2 &\quad 1 &\quad \ddots &\quad \vdots \\\vdots &\quad \ddots &\quad \ddots &\quad \ddots &\quad \ddots &\quad 0 \\0 &\quad \ldots &\quad 0 &\quad 1 &\quad -2 &\quad 1 \\\end{bmatrix}.\end{align*} 
while higher-order difference matrices follow the recursive formula 
 $D_{n,q} = D_{n - 1,q - 1}D_{n,1}$
. The fidelity and smoothness criteria can be rewritten with matrix notations as
$D_{n,q} = D_{n - 1,q - 1}D_{n,1}$
. The fidelity and smoothness criteria can be rewritten with matrix notations as
 \begin{align*}F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \quad \text{and} \quad R_{\lambda,q}(\boldsymbol{\theta}) = \lambda\boldsymbol{\theta}^TD_{n,q}^TD_{n,q}\boldsymbol{\theta}\end{align*}
\begin{align*}F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \quad \text{and} \quad R_{\lambda,q}(\boldsymbol{\theta}) = \lambda\boldsymbol{\theta}^TD_{n,q}^TD_{n,q}\boldsymbol{\theta}\end{align*} 
and the WH smoothing estimator thus becomes
 \begin{equation}{\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace}\end{equation}
\begin{equation}{\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace}\end{equation}
where 
 $P_{\lambda} = \lambda D_{n,q}^TD_{n,q}$
.
$P_{\lambda} = \lambda D_{n,q}^TD_{n,q}$
.
The two-dimensional case
 In the two-dimensional case, consider a matrix Y of observations and a matrix 
 $\Omega$
 of non-negative weights, both of dimensions
$\Omega$
 of non-negative weights, both of dimensions 
 $n_x \times n_z$
. The WH smoothing estimator solves:
$n_x \times n_z$
. The WH smoothing estimator solves:
 \begin{align*}\widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\}\end{align*}
\begin{align*}\widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\}\end{align*} 
where:
- 
•  $F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2$
 represents a fidelity criterion with respect to the observations, $F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2$
 represents a fidelity criterion with respect to the observations,
- 
•  $R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2$
 is a smoothness criterion with $R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2$
 is a smoothness criterion with $\lambda = (\lambda_x,\lambda_z)$
. $\lambda = (\lambda_x,\lambda_z)$
.
 This latter criterion adds row-wise and column-wise regularization criteria to 
 $\Theta$
, with respective orders
$\Theta$
, with respective orders 
 $q_x$
 and
$q_x$
 and 
 $q_z$
, weighted by non-negative smoothing parameters
$q_z$
, weighted by non-negative smoothing parameters 
 $\lambda_x$
 and
$\lambda_x$
 and 
 $\lambda_z$
. In matrix notation, let
$\lambda_z$
. In matrix notation, let 
 $\mathbf{y} = \textbf{vec}(Y)$
,
$\mathbf{y} = \textbf{vec}(Y)$
, 
 $\mathbf{w} = \textbf{vec}(\Omega)$
, and
$\mathbf{w} = \textbf{vec}(\Omega)$
, and 
 $\boldsymbol{\theta} = \textbf{vec}(\Theta)$
 as the vectors obtained by stacking the columns of the matrices Y,
$\boldsymbol{\theta} = \textbf{vec}(\Theta)$
 as the vectors obtained by stacking the columns of the matrices Y, 
 $\Omega$
, and
$\Omega$
, and 
 $\Theta$
, respectively. Additionally, denote
$\Theta$
, respectively. Additionally, denote 
 $W = \text{Diag}(\mathbf{w})$
 and
$W = \text{Diag}(\mathbf{w})$
 and 
 $n = n_x \times n_z$
. The fidelity and smoothness criteria become
$n = n_x \times n_z$
. The fidelity and smoothness criteria become
 \begin{align*}\begin{aligned}F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}\end{aligned}\end{align*}
\begin{align*}\begin{aligned}F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}\end{aligned}\end{align*} 
and the associated estimator also takes the form of Equation 1.2 except in this case
 \begin{align*}P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}.\end{align*}
\begin{align*}P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}.\end{align*} 
Extension to higher dimensions is straightforward and not discussed here.
An explicit solution
 If 
 $W + P_{\lambda}$
 is invertible, Equation (1.2) admits the closed-form solution:
$W + P_{\lambda}$
 is invertible, Equation (1.2) admits the closed-form solution:
 \begin{equation}{\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.}\end{equation}
\begin{equation}{\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.}\end{equation}
 Indeed, as a minimum, 
 $\hat{\mathbf{y}}$
 satisfies:
$\hat{\mathbf{y}}$
 satisfies:
 \begin{align*}0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 W(y - \hat{\mathbf{y}}) +2P_{\lambda} \hat{\mathbf{y}}.\end{align*}
\begin{align*}0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 W(y - \hat{\mathbf{y}}) +2P_{\lambda} \hat{\mathbf{y}}.\end{align*} 
 It follows that 
 $(W + P_{\lambda})\hat{\mathbf{y}} = W\mathbf{y}$
, proving Equation (1.3). If
$(W + P_{\lambda})\hat{\mathbf{y}} = W\mathbf{y}$
, proving Equation (1.3). If 
 $\lambda \neq 0$
,
$\lambda \neq 0$
, 
 $W + P_{\lambda}$
 is invertible as long as
$W + P_{\lambda}$
 is invertible as long as 
 $\mathbf{w}$
 has q non-zero elements in the one-dimensional case and
$\mathbf{w}$
 has q non-zero elements in the one-dimensional case and 
 $\Omega$
 has at least
$\Omega$
 has at least 
 $q_x \times q_z$
 non-zero elements spread across
$q_x \times q_z$
 non-zero elements spread across 
 $q_x$
 different rows and
$q_x$
 different rows and 
 $q_z$
 different columns in the two-dimensional case. These conditions are always met in real datasets.
$q_z$
 different columns in the two-dimensional case. These conditions are always met in real datasets.
1.2. Structure of the paper
Introduced a century ago, WH smoothing remains widely used by actuaries, particularly in France and North America (Canadian Institute of Actuaries, 2017; Society of Actuaries, 2018). Other non-parametric smoothing methods have since emerged, notably spline-based techniques (Reinsch, Reference Reinsch1967), which gained even greater popularity with P-splines (Eilers and Marx, Reference Eilers and Marx1996). A broader overview of alternative smoothers is available in Wood (Reference Wood2017, chap. 5).
For evenly spaced discrete observations, WH smoothing may be considered a particular case of P-splines with degree-zero splines and identity model matrix. Its appeal lies in its simplicity: no selection of knots, parameters equal to fitted values, and shape controlled solely via penalization. However, it involves more parameters than low-rank smoothers, making it more computationally intensive.
Originally proposed as an empirical alternative to polynomial regression and weighted averages, WH smoothing offered key benefits noted by Whittaker (Reference Whittaker1923): first q moment preservation, adjustable smoothing parameters, and robustness at boundaries. While smoothing theory has evolved – particularly via generalized additive models (Hastie and Tibshirani, Reference Hastie and Tibshirani1990), use of WH smoothing by actuaries remains largely unchanged. This paper reinterprets WH within modern statistical theory to bridge that gap and address six practical questions, each discussed in a dedicated section.
How to measure uncertainty in smoothing results?
We propose a method to quantify the uncertainty in WH smoothing based on data volume, a topic that has received little attention in the literature. In a frequentist framework, the WH estimator is biased, which complicates the construction of valid confidence intervals for finite samples. However, under certain conditions, WH smoothing can be viewed as a Bayesian model, enabling the derivation of credible intervals. This Bayesian interpretation was originally suggested by Whittaker (Reference Whittaker1923) as a justification for the method and formally revisited decades later by Taylor (Reference Taylor1992). In this section, we build on that equivalence to derive credible intervals for WH smoothing.
Which observation and weight vectors to use?
 For the Bayesian interpretation of WH smoothing discussed in Section 2 to hold, it must be applied to a vector 
 $\mathbf{y}$
 of independent, normally distributed observations with known variances. The weight vector
$\mathbf{y}$
 of independent, normally distributed observations with known variances. The weight vector 
 $\mathbf{w}$
 should then contain the inverse variances (up to a constant), as noted by Taylor (Reference Taylor1992) and Verrall (Reference Verrall1993). We show that, under piecewise constant transition intensities in duration models, the maximum likelihood estimator of crude rates produces vectors
$\mathbf{w}$
 should then contain the inverse variances (up to a constant), as noted by Taylor (Reference Taylor1992) and Verrall (Reference Verrall1993). We show that, under piecewise constant transition intensities in duration models, the maximum likelihood estimator of crude rates produces vectors 
 $(\mathbf{y}, \mathbf{w})$
 that asymptotically meet these conditions. This, combined with the results from the previous section, offers a statistical foundation for the use of WH smoothing in constructing experience tables for life insurance risks.
$(\mathbf{y}, \mathbf{w})$
 that asymptotically meet these conditions. This, combined with the results from the previous section, offers a statistical foundation for the use of WH smoothing in constructing experience tables for life insurance risks.
How to improve the accuracy of smoothing with limited data volume?
The standard approach applies WH smoothing to crude rate estimates, assuming they are asymptotically normal. However, this assumption often breaks down in practice when data are limited, making the method unreliable in such cases. Following Verrall (Reference Verrall1993), we propose a generalization of WH smoothing that replaces the two-step procedure with the direct maximization of a penalized log-likelihood. Instead of smoothing pre-estimated rates, this method works directly with aggregated event and exposure counts. The estimation is performed iteratively using the PIRLS algorithm. We evaluate both methods on simulated datasets reflecting typical life insurance portfolios. Results show that, in smaller samples, the normal approximation in the traditional method introduces notable bias. This supports the use of the generalized approach – based on penalized log-likelihood – as a more robust alternative when data are limited.
How to select the smoothing parameters?
 We now turn to the crucial choice of the smoothing parameter 
 $\lambda$
, which has long been left to actuarial judgement. Giesecke and Center (1981) suggested choosing
$\lambda$
, which has long been left to actuarial judgement. Giesecke and Center (1981) suggested choosing 
 $\lambda$
 so that the variance of the smoothed results matches the average variance of a Chi-square statistic, but uses
$\lambda$
 so that the variance of the smoothed results matches the average variance of a Chi-square statistic, but uses 
 $n - q$
 as degrees of freedom, thus ignoring the reduction in effective model dimension due to penalization. Brooks et al. (Reference Brooks, Stone, Chan and Chan1988) minimized the global cross-validation criterion introduced by Wahba (Reference Wahba1980), though this can result in severe under-smoothing as noted by Wood (Reference Wood2011).
$n - q$
 as degrees of freedom, thus ignoring the reduction in effective model dimension due to penalization. Brooks et al. (Reference Brooks, Stone, Chan and Chan1988) minimized the global cross-validation criterion introduced by Wahba (Reference Wahba1980), though this can result in severe under-smoothing as noted by Wood (Reference Wood2011).
 We instead propose to select 
 $\lambda$
 by maximizing a marginal likelihood function, a method first introduced by Patterson and Thompson (Reference Patterson and Thompson1971) and later applied to smoothing parameter selection by Anderssen and Bloomfield (Reference Anderssen and Bloomfield1974). This approach is consistent with the Bayesian framework discussed earlier and performs well in small samples, as shown by Reiss and Todd Ogden (Reference Reiss and Todd Ogden2009). This marginal likelihood function has a closed-form expression and can be maximized numerically. For the proposed generalization of WH smoothing, the marginal likelihood is no longer available in closed form. Instead, we rely on the Laplace approximation of the marginal likelihood (LAML), which can be maximized numerically. As both solving likelihood equations and selecting the optimal smoothing parameter are iterative processes, we explore different ways of nesting these iterations. We compare three nesting strategies combined with three numerical optimization algorithms for maximizing the marginal likelihood or LAML. Simulation results show that all strategies have near-optimal accuracy, with the fastest performance achieved using the outer iteration strategy combined with the Newton algorithm
$\lambda$
 by maximizing a marginal likelihood function, a method first introduced by Patterson and Thompson (Reference Patterson and Thompson1971) and later applied to smoothing parameter selection by Anderssen and Bloomfield (Reference Anderssen and Bloomfield1974). This approach is consistent with the Bayesian framework discussed earlier and performs well in small samples, as shown by Reiss and Todd Ogden (Reference Reiss and Todd Ogden2009). This marginal likelihood function has a closed-form expression and can be maximized numerically. For the proposed generalization of WH smoothing, the marginal likelihood is no longer available in closed form. Instead, we rely on the Laplace approximation of the marginal likelihood (LAML), which can be maximized numerically. As both solving likelihood equations and selecting the optimal smoothing parameter are iterative processes, we explore different ways of nesting these iterations. We compare three nesting strategies combined with three numerical optimization algorithms for maximizing the marginal likelihood or LAML. Simulation results show that all strategies have near-optimal accuracy, with the fastest performance achieved using the outer iteration strategy combined with the Newton algorithm
How to improve smoothing computational efficiency?
When the number of observations – and thus parameters – is large, the computational cost of WH smoothing becomes a major challenge. This is particularly relevant in actuarial contexts, such as smoothing two-dimensional tables for disability or long-term care modelling. Beyond actuarial applications, WH smoothing is also widely used in economics for long time series, where it is known as the Hodrick-Prescott filter (Hodrick and Prescott, Reference Hodrick and Prescott1997). Although fast algorithms have been developed to exploit the structure of the penalization matrix (e.g., Weinert, Reference Weinert2007; Cornea-Madeira, Reference Cornea-Madeira2017) they are typically limited to the one-dimensional case and cannot be directly extended to two dimensions.
After briefly outlining the main computational steps of (generalized) WH smoothing-including smoothing parameter selection via marginal likelihood or LAML-and their leading-order costs, we introduce two complementary strategies to reduce the computational burden:
- 
1. Banded matrix exploitation: WH smoothing involves model and penalization matrices with banded structure. Taking advantage of this structure greatly accelerates key computations. 
- 
2. Reduced-rank basis via natural parametrization: Building on the work of Demmler and Reinsch (Reference Demmler and Reinsch1975), we apply an eigendecomposition to the one-dimensional penalization matrices and drop components associated with the largest eigenvalues, which reduces the problem size. In two dimensions, we further improve efficiency using the generalized linear array model (GLAM) framework (Currie et al., Reference Currie, Durban and Eilers2006) which leverages the rectangular shape of the data. 
In the two-dimensional case, we compare these strategies with a cubic P-spline alternative using simulated datasets. Results show that the banded implementation reduces computation time by up to a factor of 25. The reduced-rank approach brings further gains – up to a factor of 250 – at the cost of a slight reduction in accuracy. Its performance is comparable to P-spline smoothing with a cubic basis of similar size.
How to extrapolate smoothing results?
We conclude by addressing how to extrapolate smoothing results. Semi-parametric models like WH and P-splines can extrapolate beyond the observed data – similar to parametric models – but this feature is often overlooked in actuarial practice. The existing literature is limited and mostly focused on mortality forecasting.
Currie et al. (Reference Currie, Durban and Eilers2004) uses P-splines to fit and forecast mortality rates by treating the extrapolated positions as zero-weight observations (see also Delwarde et al., Reference Delwarde, Denuit and Eilers2007; Currie, Reference Currie2013). While this works well in one dimension, Carballo et al. (Reference Carballo, Durban and Lee2021) showed that it distorts the fit in two dimensions. To fix this, they proposed adding constraints to preserve the values that would result from fitting the observed data alone.
However, their approach to confidence intervals overlooks potential innovation error beyond the observed data, effectively treating the extrapolated process as perfectly smooth. In contrast, we propose an approach that derives credible intervals for extrapolated values, accounting for the underlying variability beyond the observed data range.
2. How to measure uncertainty in smoothing results?
 The explicit solution given by Equation (1.3) indicates that 
 $\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{-1}W\mathbb{E}(\mathbf{y}) \ne \mathbb{E}(\mathbf{y})$
 when
$\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{-1}W\mathbb{E}(\mathbf{y}) \ne \mathbb{E}(\mathbf{y})$
 when 
 $\lambda \ne 0$
. This implies that penalization introduces a smoothing bias, which prevents the construction of confidence intervals for finite samples centred on
$\lambda \ne 0$
. This implies that penalization introduces a smoothing bias, which prevents the construction of confidence intervals for finite samples centred on 
 $\mathbb{E}(\mathbf{y})$
. Therefore, in this section, we turn to a Bayesian framework where smoothing can be interpreted more naturally.
$\mathbb{E}(\mathbf{y})$
. Therefore, in this section, we turn to a Bayesian framework where smoothing can be interpreted more naturally.
2.1. Maximum a posteriori estimate
 Suppose that 
 $\mathbf{y} \mid \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2W^{-})$
 and
$\mathbf{y} \mid \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2W^{-})$
 and 
 $\boldsymbol{\theta} \sim \mathcal{N}(0, \sigma^2P_{\lambda}^{-})$
 for some
$\boldsymbol{\theta} \sim \mathcal{N}(0, \sigma^2P_{\lambda}^{-})$
 for some 
 $\sigma \gt 0$
. The Bayes formula allows us to express the posterior likelihood
$\sigma \gt 0$
. The Bayes formula allows us to express the posterior likelihood 
 $f(\boldsymbol{\theta} \mid \mathbf{y})$
 associated with these choices in the following form:
$f(\boldsymbol{\theta} \mid \mathbf{y})$
 associated with these choices in the following form:
 \begin{align*}f(\boldsymbol{\theta} \mid \mathbf{y}) \propto f(\mathbf{y} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta}) \propto \exp\left(\!-\!\frac{1}{2\sigma^2}\left[(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right]\right).\end{align*}
\begin{align*}f(\boldsymbol{\theta} \mid \mathbf{y}) \propto f(\mathbf{y} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta}) \propto \exp\left(\!-\!\frac{1}{2\sigma^2}\left[(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right]\right).\end{align*} 
 Hence the mode of the posterior distribution, 
 $\hat{\boldsymbol{\theta}} = \text{argmax} [f(\boldsymbol{\theta} \mid \mathbf{y})]$
, also known as the maximum a posteriori (MAP) estimate, coincides with the solution
$\hat{\boldsymbol{\theta}} = \text{argmax} [f(\boldsymbol{\theta} \mid \mathbf{y})]$
, also known as the maximum a posteriori (MAP) estimate, coincides with the solution 
 $\hat{\mathbf{y}}$
 from Equation (1.2), whose explicit form is given by Equation (1.3).
$\hat{\mathbf{y}}$
 from Equation (1.2), whose explicit form is given by Equation (1.3).
2.2. Posterior distribution of
 
 $\boldsymbol{\theta} \mid \mathbf{y}$
$\boldsymbol{\theta} \mid \mathbf{y}$
 A second-order Taylor expansion of the log-posterior likelihood around 
 $\hat{\mathbf{y}} = \hat{\boldsymbol{\theta}}$
 gives us:
$\hat{\mathbf{y}} = \hat{\boldsymbol{\theta}}$
 gives us:
 \begin{equation}{\ln f(\boldsymbol{\theta} \mid \mathbf{y}) = \ln f(\hat{\boldsymbol{\theta}} \mid \mathbf{y}) + \left.\frac{\partial \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}^{T}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T} \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})}\end{equation}
\begin{equation}{\ln f(\boldsymbol{\theta} \mid \mathbf{y}) = \ln f(\hat{\boldsymbol{\theta}} \mid \mathbf{y}) + \left.\frac{\partial \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}^{T}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T} \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})}\end{equation}
 \begin{align*}\text{where} \quad \left.\frac{\partial \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0 \quad \text{and} \quad \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = - \frac{1}{\sigma^2}(W + P_{\lambda}).\end{align*}
\begin{align*}\text{where} \quad \left.\frac{\partial \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0 \quad \text{and} \quad \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} \mid \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = - \frac{1}{\sigma^2}(W + P_{\lambda}).\end{align*} 
 As this last derivative no longer depends on 
 $\boldsymbol{\theta}$
, higher-order derivatives are all zero. The Taylor expansion allows for an exact computation of
$\boldsymbol{\theta}$
, higher-order derivatives are all zero. The Taylor expansion allows for an exact computation of 
 $\ln f(\boldsymbol{\theta} \mid \mathbf{y})$
. Substituting the result back into Equation (2.1) yields:
$\ln f(\boldsymbol{\theta} \mid \mathbf{y})$
. Substituting the result back into Equation (2.1) yields:
 \begin{align*}\begin{aligned}f(\boldsymbol{\theta} \mid \mathbf{y}) &\propto \exp\left[\ln f(\hat{\boldsymbol{\theta}} \mid \mathbf{y}) - \frac{1}{2\sigma^2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \\&\propto \exp\left[- \frac{1}{2\sigma^2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right]\end{aligned}\end{align*}
\begin{align*}\begin{aligned}f(\boldsymbol{\theta} \mid \mathbf{y}) &\propto \exp\left[\ln f(\hat{\boldsymbol{\theta}} \mid \mathbf{y}) - \frac{1}{2\sigma^2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \\&\propto \exp\left[- \frac{1}{2\sigma^2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right]\end{aligned}\end{align*} 
which can immediately be recognized as the density of the 
 $\mathcal{N}(\hat{\boldsymbol{\theta}},\sigma^2(W + P_{\lambda})^{- 1})$
 distribution.
$\mathcal{N}(\hat{\boldsymbol{\theta}},\sigma^2(W + P_{\lambda})^{- 1})$
 distribution.
2.3. Consequence for the WH smoothing
 The prior 
 $\boldsymbol{\theta} \sim \mathcal{N}(0, \sigma^2P_{\lambda}^{-})$
 provides a Bayesian interpretation of the smoothness penalty, expressing an (improper) prior belief about the structure of
$\boldsymbol{\theta} \sim \mathcal{N}(0, \sigma^2P_{\lambda}^{-})$
 provides a Bayesian interpretation of the smoothness penalty, expressing an (improper) prior belief about the structure of 
 $\mathbf{y}$
.
$\mathbf{y}$
.
 This Bayesian framework and the resulting credible intervals rely on the assumption that 
 $\mathbf{y} \mid \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2W^{-})$
, meaning that the components of
$\mathbf{y} \mid \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2W^{-})$
, meaning that the components of 
 $\mathbf{y}$
 are independent with known variances (up to a constant
$\mathbf{y}$
 are independent with known variances (up to a constant 
 $\sigma^2$
). The weight vector
$\sigma^2$
). The weight vector 
 $\mathbf{w}$
 must then be proportional to the inverse variances, not chosen empirically. If
$\mathbf{w}$
 must then be proportional to the inverse variances, not chosen empirically. If 
 $\sigma^2$
 is known,
$\sigma^2$
 is known, 
 $100(1 - \alpha) \%$
 credible intervals take the form:
$100(1 - \alpha) \%$
 credible intervals take the form:
 \begin{equation}{\mathbb{E}(\mathbf{y}) \mid \mathbf{y} \in \left[\hat{\mathbf{y}} \pm \Phi^{- 1}\left(1 - \alpha / 2\right)\sqrt{\sigma^2\textbf{diag}\left\lbrace(W + P_{\lambda})^{-1}\right\rbrace}\right]}\end{equation}
\begin{equation}{\mathbb{E}(\mathbf{y}) \mid \mathbf{y} \in \left[\hat{\mathbf{y}} \pm \Phi^{- 1}\left(1 - \alpha / 2\right)\sqrt{\sigma^2\textbf{diag}\left\lbrace(W + P_{\lambda})^{-1}\right\rbrace}\right]}\end{equation}
where 
 $\hat{\mathbf{y}} = (W + P_{\lambda})^{- 1}W\mathbf{y}$
 and
$\hat{\mathbf{y}} = (W + P_{\lambda})^{- 1}W\mathbf{y}$
 and 
 $\Phi$
 is the cumulative distribution function for the standard normal distribution. According to Marra and Wood (Reference Marra and Wood2012), such intervals have good Frequentist coverage.
$\Phi$
 is the cumulative distribution function for the standard normal distribution. According to Marra and Wood (Reference Marra and Wood2012), such intervals have good Frequentist coverage.
 If 
 $\sigma^2$
 is unknown, it can be estimated as
$\sigma^2$
 is unknown, it can be estimated as
 \begin{align*}\hat{\sigma}^2 = \frac{(\mathbf{y} - \hat{\mathbf{y}})^TW(\mathbf{y} - \hat{\mathbf{y}})}{n - \text{tr}(H)}\quad \text{where}\quad H = (W + P_{\lambda})^{- 1}W.\end{align*}
\begin{align*}\hat{\sigma}^2 = \frac{(\mathbf{y} - \hat{\mathbf{y}})^TW(\mathbf{y} - \hat{\mathbf{y}})}{n - \text{tr}(H)}\quad \text{where}\quad H = (W + P_{\lambda})^{- 1}W.\end{align*} 
 In that case, 
 $\sigma^2$
 is replaced by
$\sigma^2$
 is replaced by 
 $\hat{\sigma}^2$
 and the normal distribution in Equation (2.2) by the Student t - distribution with
$\hat{\sigma}^2$
 and the normal distribution in Equation (2.2) by the Student t - distribution with 
 $n - \text{tr}(H)$
 degrees of freedom.
$n - \text{tr}(H)$
 degrees of freedom.
3. Which observation and weight vectors to use?
 Section 2 highlighted that WH smoothing may be interpreted in a robust statistical framework when applied to a vector 
 $\mathbf{y}$
 of independent, normally distributed observations with known variances, and a weight vector
$\mathbf{y}$
 of independent, normally distributed observations with known variances, and a weight vector 
 $\mathbf{w}$
 proportional to the inverses of those variances. In this section, we propose, within the framework of duration models used for constructing experience tables for life insurance risks, vectors
$\mathbf{w}$
 proportional to the inverses of those variances. In this section, we propose, within the framework of duration models used for constructing experience tables for life insurance risks, vectors 
 $\mathbf{y}$
 and
$\mathbf{y}$
 and 
 $\mathbf{w}$
 that satisfy these conditions.
$\mathbf{w}$
 that satisfy these conditions.
3.1. Survival analysis framework
 We consider a longitudinal follow-up of m individuals, subject to left truncation and non-informative right censoring, and aim to estimate a distribution governed by a continuous explanatory variable x (e.g. age). Let 
 $\mu$
 denote the hazard function, also known as the force of mortality in the study of the death risk. Under standard survival analysis assumptions, the log-likelihood takes the following continuous-time form:
$\mu$
 denote the hazard function, also known as the force of mortality in the study of the death risk. Under standard survival analysis assumptions, the log-likelihood takes the following continuous-time form:
 \begin{equation}{\ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} \left[\delta_i \ln\mu(x_i + t_i,\boldsymbol{\theta}) - \underset{u = 0}{\overset{t_i}{\int}}\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right].}\end{equation}
\begin{equation}{\ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} \left[\delta_i \ln\mu(x_i + t_i,\boldsymbol{\theta}) - \underset{u = 0}{\overset{t_i}{\int}}\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right].}\end{equation}
 Here 
 $x_i$
 is the age at the start of observation,
$x_i$
 is the age at the start of observation, 
 $t_i$
 is the follow-up duration for individual i and
$t_i$
 is the follow-up duration for individual i and 
 $\delta_i$
 is an event indicator: 1 if the event is observed and 0 if censored.
$\delta_i$
 is an event indicator: 1 if the event is observed and 0 if censored.
Although model estimation can be based on direct maximization of Equation (3.1), this approach scales poorly with large m and generally requires numerical integration – except in simple parametric cases. We instead adopt a discrete approximation by assuming the hazard rate is piecewise constant over one-year intervals:
 \begin{align*}\mu(x + \epsilon) = \mu(x) \quad\text{for all}\quad x \in \mathbb{N}, \epsilon \in [0,1].\end{align*}
\begin{align*}\mu(x + \epsilon) = \mu(x) \quad\text{for all}\quad x \in \mathbb{N}, \epsilon \in [0,1].\end{align*} 
Under this assumption, the log-likelihood simplifies to a sum over discrete ages:
 \begin{equation}{\ell(\boldsymbol{\theta}) = \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \ln\mu(x,\boldsymbol{\theta}) d(x) - \mu(x,\boldsymbol{\theta}) e_c(x).}\end{equation}
\begin{equation}{\ell(\boldsymbol{\theta}) = \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \ln\mu(x,\boldsymbol{\theta}) d(x) - \mu(x,\boldsymbol{\theta}) e_c(x).}\end{equation}
 Here d(x) is the number of observed events at age x and 
 $e_c(x)$
 is the central exposure to risk, that is, the total duration individuals are observed at age x.
$e_c(x)$
 is the central exposure to risk, that is, the total duration individuals are observed at age x.
 This discretization, first introduced by Hoem (Reference Hoem1971), is widely used in actuarial science. Its advantages are underlined for example in Gschlössl et al. (Reference Gschlössl, Schoenmaekers and Denuit2011). It extends naturally to the two-dimensional case by assuming 
 $\mu(x + \epsilon, z + \xi) = \mu(x, z)$
 and summing over (x, z) pairs.
$\mu(x + \epsilon, z + \xi) = \mu(x, z)$
 and summing over (x, z) pairs.
Details on the derivation of Equations (3.1) and (3.2), along with the computation of central exposures and event counts, are provided in Section A of the Supplementary Materials.
3.2. Likelihood equations
 Assuming one parameter per observation and using the exponential link 
 $\boldsymbol{\mu}(\boldsymbol{\theta}) = \boldsymbol{\exp}(\boldsymbol{\theta})$
, we recover the crude rates estimator, which models each age (or age pair) independently. The exponential link ensures positive hazard rates. The log-likelihood, in both one- and two-dimensional cases, takes the vectorized form:
$\boldsymbol{\mu}(\boldsymbol{\theta}) = \boldsymbol{\exp}(\boldsymbol{\theta})$
, we recover the crude rates estimator, which models each age (or age pair) independently. The exponential link ensures positive hazard rates. The log-likelihood, in both one- and two-dimensional cases, takes the vectorized form:
 \begin{equation}{\ell(\boldsymbol{\theta}) = \boldsymbol{\theta}^{T}\mathbf{d} - \boldsymbol{\exp}(\boldsymbol{\theta})^{T}\mathbf{e_c}}\end{equation}
\begin{equation}{\ell(\boldsymbol{\theta}) = \boldsymbol{\theta}^{T}\mathbf{d} - \boldsymbol{\exp}(\boldsymbol{\theta})^{T}\mathbf{e_c}}\end{equation}
where 
 $\mathbf{d}$
 and
$\mathbf{d}$
 and 
 $\mathbf{e}_c$
 are the vectors of observed deaths and central exposures.
$\mathbf{e}_c$
 are the vectors of observed deaths and central exposures.
The derivatives of this likelihood are
 \begin{equation}{\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \mathbf{d} -\boldsymbol{\exp}(\boldsymbol{\theta}) \odot \mathbf{e_c} \quad \text{and} \quad \frac{\partial^2 \ell}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} = - \text{Diag}(\boldsymbol{\exp}(\boldsymbol{\theta}) \odot \mathbf{e_c}).}\end{equation}
\begin{equation}{\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \mathbf{d} -\boldsymbol{\exp}(\boldsymbol{\theta}) \odot \mathbf{e_c} \quad \text{and} \quad \frac{\partial^2 \ell}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} = - \text{Diag}(\boldsymbol{\exp}(\boldsymbol{\theta}) \odot \mathbf{e_c}).}\end{equation}
 These equations correspond to those of a Poisson GLM (Nelder and Wedderburn, Reference Nelder and Wedderburn1972) with mean 
 $\boldsymbol{\mu}(\boldsymbol{\theta}) \odot \mathbf{e}_c$
, although derived under different assumptions.
$\boldsymbol{\mu}(\boldsymbol{\theta}) \odot \mathbf{e}_c$
, although derived under different assumptions.
 The model admits the closed-form solution 
 $\hat{\boldsymbol{\theta}} = \ln(\mathbf{d} / \mathbf{e}_c)$
. Under standard regularity conditions, the maximum likelihood estimator satisfies
$\hat{\boldsymbol{\theta}} = \ln(\mathbf{d} / \mathbf{e}_c)$
. Under standard regularity conditions, the maximum likelihood estimator satisfies 
 $\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, W_{\hat{\boldsymbol{\theta}}}^{-1})$
, with
$\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, W_{\hat{\boldsymbol{\theta}}}^{-1})$
, with 
 $W_{\hat{\boldsymbol{\theta}}} = \text{Diag}(\mathbf{d})$
.
$W_{\hat{\boldsymbol{\theta}}} = \text{Diag}(\mathbf{d})$
.
Notably, this asymptotic approximation depends on the number of individuals m and not the dimension n of the aggregated vectors.
3.3. Consequence for the WH smoothing
 We conclude that, under the duration model framework and using crude rates, the log-estimate 
 $\ln(\mathbf{d} / \mathbf{e}_c)$
 is asymptotically normal:
$\ln(\mathbf{d} / \mathbf{e}_c)$
 is asymptotically normal:
 \begin{align*}\ln(\mathbf{d} / \mathbf{e}_c) \sim \mathcal{N}(\!\ln\boldsymbol{\mu}, W^{-1}) \quad\text{with}\quad W = \text{Diag}(\mathbf{d}).\end{align*}
\begin{align*}\ln(\mathbf{d} / \mathbf{e}_c) \sim \mathcal{N}(\!\ln\boldsymbol{\mu}, W^{-1}) \quad\text{with}\quad W = \text{Diag}(\mathbf{d}).\end{align*} 
 This justifies applying WH smoothing to the observation vector 
 $\mathbf{y} = \ln(\mathbf{d} / \mathbf{e}_c)$
 with weight vector
$\mathbf{y} = \ln(\mathbf{d} / \mathbf{e}_c)$
 with weight vector 
 $\mathbf{w} = \mathbf{d}$
. Using results from Section 2, and
$\mathbf{w} = \mathbf{d}$
. Using results from Section 2, and 
 $\sigma^2 = 1$
, the credible intervals for
$\sigma^2 = 1$
, the credible intervals for 
 $\ln\boldsymbol{\mu}$
 are
$\ln\boldsymbol{\mu}$
 are
 \begin{align*}\ln\boldsymbol{\mu} \mid \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\textbf{diag}\left\lbrace(\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\right\rbrace}\right]\end{align*}
\begin{align*}\ln\boldsymbol{\mu} \mid \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\textbf{diag}\left\lbrace(\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\right\rbrace}\right]\end{align*} 
with 
 $\hat{\boldsymbol{\theta}} = (W + P_{\lambda})^{-1}W(\!\ln\mathbf{d} - \ln \mathbf{e}_c)$
. Credible intervals for
$\hat{\boldsymbol{\theta}} = (W + P_{\lambda})^{-1}W(\!\ln\mathbf{d} - \ln \mathbf{e}_c)$
. Credible intervals for 
 $\boldsymbol{\mu}$
 itself are then obtained by exponentiating the bounds.
$\boldsymbol{\mu}$
 itself are then obtained by exponentiating the bounds.
4. How to improve the accuracy of smoothing with limited data volume?
4.1. Generalized Whittaker–Henderson smoothing
 The approach described in Section 3.2 assumes that the crude rates estimator is asymptotically normal, justifying the application of WH smoothing to its logarithm. However, with limited data, this approximation may introduce significant bias. We therefore propose an alternative based directly on the exact likelihood in Equation (3.3). Applying the Bayesian framework from Section 2 and assuming 
 $\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})$
, Bayes’ theorem gives
$\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})$
, Bayes’ theorem gives
 \begin{align*}f(\boldsymbol{\theta} \mid \mathbf{d}, \mathbf{e_c})\propto f(\mathbf{d}, \mathbf{e_c} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta})\propto \exp\left[\ell(\boldsymbol{\theta}) - \frac{1}{2}\boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right].\end{align*}
\begin{align*}f(\boldsymbol{\theta} \mid \mathbf{d}, \mathbf{e_c})\propto f(\mathbf{d}, \mathbf{e_c} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta})\propto \exp\left[\ell(\boldsymbol{\theta}) - \frac{1}{2}\boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right].\end{align*} 
 We define the penalized log-likelihood as 
 $\ell_P(\boldsymbol{\theta}) = \ell(\boldsymbol{\theta}) - \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}/2$
. The maximum a posteriori estimate is the maximizer of
$\ell_P(\boldsymbol{\theta}) = \ell(\boldsymbol{\theta}) - \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}/2$
. The maximum a posteriori estimate is the maximizer of 
 $\ell_P$
.
$\ell_P$
.
 Using a second-order Taylor expansion of the posterior log-likelihood around 
 $\hat{\boldsymbol{\theta}}$
 leads to the Laplace approximation:
$\hat{\boldsymbol{\theta}}$
 leads to the Laplace approximation:
 \begin{equation}{f(\boldsymbol{\theta} \mid \mathbf{d}, \mathbf{e_c}) \approx \mathcal{N}(\hat{\boldsymbol{\theta}},(W_{\hat{\boldsymbol{\theta}}} + P_{\lambda})^{- 1})}\end{equation}
\begin{equation}{f(\boldsymbol{\theta} \mid \mathbf{d}, \mathbf{e_c}) \approx \mathcal{N}(\hat{\boldsymbol{\theta}},(W_{\hat{\boldsymbol{\theta}}} + P_{\lambda})^{- 1})}\end{equation}
where 
 $W_{\hat{\boldsymbol{\theta}}} = \text{Diag}(\boldsymbol{\exp}(\hat{\boldsymbol{\theta}}) \odot \mathbf{e_c})$
. Unlike the normal case studied in Section 2, the higher-order derivatives of the posterior log-likelihood are not zero, and Equation (4.1) only provides an approximation of the posterior log-likelihood, which yields asymptotic credible intervals:
$W_{\hat{\boldsymbol{\theta}}} = \text{Diag}(\boldsymbol{\exp}(\hat{\boldsymbol{\theta}}) \odot \mathbf{e_c})$
. Unlike the normal case studied in Section 2, the higher-order derivatives of the posterior log-likelihood are not zero, and Equation (4.1) only provides an approximation of the posterior log-likelihood, which yields asymptotic credible intervals:
 \begin{align*}\ln \boldsymbol{\mu} \mid \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\textbf{diag}\left\lbrace(W_{\hat{\boldsymbol{\theta}}} + P_{\lambda})^{-1}\right\rbrace}\right].\end{align*}
\begin{align*}\ln \boldsymbol{\mu} \mid \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\textbf{diag}\left\lbrace(W_{\hat{\boldsymbol{\theta}}} + P_{\lambda})^{-1}\right\rbrace}\right].\end{align*} 
 Unlike the closed-form estimator in Equation (3.4), no analytical solution for 
 $\hat{\boldsymbol{\theta}}$
 exists here. We solve numerically using Newton’s algorithm, which iteratively updates:
$\hat{\boldsymbol{\theta}}$
 exists here. We solve numerically using Newton’s algorithm, which iteratively updates:
 \begin{align*}\boldsymbol{\theta}_{k + 1} =\boldsymbol{\theta}_k + (W_k + P_{\lambda})^{- 1} (\mathbf{d} -\boldsymbol{\exp}(\boldsymbol{\theta}_k) \odot \mathbf{e_c} - P_{\lambda} \boldsymbol{\theta}_k)]\end{align*}
\begin{align*}\boldsymbol{\theta}_{k + 1} =\boldsymbol{\theta}_k + (W_k + P_{\lambda})^{- 1} (\mathbf{d} -\boldsymbol{\exp}(\boldsymbol{\theta}_k) \odot \mathbf{e_c} - P_{\lambda} \boldsymbol{\theta}_k)]\end{align*} 
with 
 $W_k = \text{Diag}(\boldsymbol{\exp}(\boldsymbol{\theta}_k) \odot \mathbf{e_c})$
. The update can be rewritten as
$W_k = \text{Diag}(\boldsymbol{\exp}(\boldsymbol{\theta}_k) \odot \mathbf{e_c})$
. The update can be rewritten as
 \begin{align*}\boldsymbol{\theta}_{k + 1} =(W_k + P_{\lambda})^{- 1}W_k\mathbf{z}_k\quad\text{where}\quad \mathbf{z}_k = \boldsymbol{\theta}_k + W_k^{-1}[\mathbf{d} -\boldsymbol{\exp}(\boldsymbol{\theta}_k) \odot \mathbf{e_c}].\end{align*}
\begin{align*}\boldsymbol{\theta}_{k + 1} =(W_k + P_{\lambda})^{- 1}W_k\mathbf{z}_k\quad\text{where}\quad \mathbf{z}_k = \boldsymbol{\theta}_k + W_k^{-1}[\mathbf{d} -\boldsymbol{\exp}(\boldsymbol{\theta}_k) \odot \mathbf{e_c}].\end{align*} 
 Initializing with the crude rates estimator 
 $\boldsymbol{\theta}_0 = \ln(\mathbf{d} / \mathbf{e_c})$
 implies
$\boldsymbol{\theta}_0 = \ln(\mathbf{d} / \mathbf{e_c})$
 implies 
 $W_0 = \text{Diag}(\mathbf{d})$
 and
$W_0 = \text{Diag}(\mathbf{d})$
 and 
 $z_0 = \ln(\mathbf{d} / \mathbf{e_c})$
, so the first iteration recovers the classical WH smoothing result.
$z_0 = \ln(\mathbf{d} / \mathbf{e_c})$
, so the first iteration recovers the classical WH smoothing result.
Subsequent iterations refine the observation and weight vectors. This process can thus be interpreted as an iterative generalization of WH smoothing, akin to how generalized linear models extend linear models.
We refer to this method as generalized WH smoothing. The iterative estimation algorithm described above corresponds to the penalized iteratively reweighted least squares (PIRLS) algorithm, widely used for fitting generalized additive models.
This framework naturally extends to other exponential family distributions, such as the binomial case suggested in Verrall (Reference Verrall1993), by adapting the likelihood, link function, weight matrix, and working vector. However, we advocate for the Poisson-like likelihood of Equation (3.3), which offers several advantages: it generalizes to competing risks, supports multiplicative covariate effects via the log link and allows the use of an external reference table as a multiplicative offset.
4.2. Impact of the normal approximation in the original smoothing
As discussed in Section 3, classical WH smoothing can be viewed as an approximation to a penalized likelihood maximization, relying on a crude rate estimator assumed to be asymptotically normal. To assess the practical consequences of this approximation, we conduct an empirical comparison based on six simulated datasets reflecting the typical structure and volume of real insurance portfolios:
- 
• The first three datasets simulate annuity portfolios with 20,000, 100,000, and 500,000 policyholders. The sole covariate is age, ranging from 50 to 95. 
- 
• The next three mimic long-term care (LTC) portfolios of the same sizes. Modelling of LTC typically relies on the illness-death model (Fix and Neyman, Reference Fix and Neyman1951; Clifford, Reference Clifford1977). To get a two-dimensional illustration we focus on the transition between the disabled and dead states (the two other transitions would provide additional one-dimensional examples). Two covariates are used: age (70–100) and duration in LTC (0–15 years). 
 Each dataset consists of individual-level longitudinal data, from which we derive event counts 
 $\mathbf{d}$
 and exposures
$\mathbf{d}$
 and exposures 
 $\mathbf{e_c}$
, aggregated by age x (for annuities) of by (x, z) pairs (for LTC). All datasets within each group share the same underlying structure and differ only in size. Key dataset statistics are provided in Table 1 and additional details about how those datasets were generated are provided in Section B of the Supplementary Materials.
$\mathbf{e_c}$
, aggregated by age x (for annuities) of by (x, z) pairs (for LTC). All datasets within each group share the same underlying structure and differ only in size. Key dataset statistics are provided in Table 1 and additional details about how those datasets were generated are provided in Section B of the Supplementary Materials.
Table 1. Key figures associated with the 6 simulated datasets.

We apply two methods:
- 
1. Original WH smoothing using  $\mathbf{y} = \ln(\mathbf{d} / \mathbf{e_c})$
 and weights $\mathbf{y} = \ln(\mathbf{d} / \mathbf{e_c})$
 and weights $\mathbf{w} = \mathbf{d}$
 as in Section 3. $\mathbf{w} = \mathbf{d}$
 as in Section 3.
- 
2. Generalized WH smoothing, using the likelihood formulation of Section 4. 
 Both methods use the same smoothing parameter(s) 
 $\lambda$
, to ensure that prior assumptions on
$\lambda$
, to ensure that prior assumptions on 
 $\boldsymbol{\theta} = \ln \boldsymbol{\mu}$
 are held constant. We fix the penalty order at
$\boldsymbol{\theta} = \ln \boldsymbol{\mu}$
 are held constant. We fix the penalty order at 
 $q = 2$
, corresponding to second-order differences.
$q = 2$
, corresponding to second-order differences.
 As both estimators target 
 $\boldsymbol{\theta}$
, we compare them using the following relative error metric:
$\boldsymbol{\theta}$
, we compare them using the following relative error metric:
 \begin{equation}{\Delta(\boldsymbol{\theta}) = \frac{\ell_{P}(\hat{\boldsymbol{\theta}}_{\text{ML}}) - \ell_{P}(\boldsymbol{\theta})\;\;}{\ell_{P}(\hat{\boldsymbol{\theta}}_{\text{ML}}) - \ell_{P}(\hat{\boldsymbol{\theta}}_{\infty})}.}\end{equation}
\begin{equation}{\Delta(\boldsymbol{\theta}) = \frac{\ell_{P}(\hat{\boldsymbol{\theta}}_{\text{ML}}) - \ell_{P}(\boldsymbol{\theta})\;\;}{\ell_{P}(\hat{\boldsymbol{\theta}}_{\text{ML}}) - \ell_{P}(\hat{\boldsymbol{\theta}}_{\infty})}.}\end{equation}
 Here 
 $\hat{\boldsymbol{\theta}}_\infty$
 maximizes the penalized likelihood, while
$\hat{\boldsymbol{\theta}}_\infty$
 maximizes the penalized likelihood, while 
 $\hat{\boldsymbol{\theta}}_\infty$
 corresponds to the solution with
$\hat{\boldsymbol{\theta}}_\infty$
 corresponds to the solution with 
 $\lambda \rightarrow \infty$
, which we later show to be the degree-
$\lambda \rightarrow \infty$
, which we later show to be the degree-
 $(q - 1)$
 polynomial that maximizes the likelihood. By construction:
$(q - 1)$
 polynomial that maximizes the likelihood. By construction:
 \begin{align*}\Delta(\hat{\boldsymbol{\theta}}_{\text{ML}}) = 0, \quad\Delta(\hat{\boldsymbol{\theta}}_{\infty}) = 1, \quad\text{and}\quad \Delta(\boldsymbol{\theta}) \ge 0.\end{align*}
\begin{align*}\Delta(\hat{\boldsymbol{\theta}}_{\text{ML}}) = 0, \quad\Delta(\hat{\boldsymbol{\theta}}_{\infty}) = 1, \quad\text{and}\quad \Delta(\boldsymbol{\theta}) \ge 0.\end{align*} 
 A model with 
 $\Delta(\boldsymbol{\theta}) \gt 1$
 performs worse than a simple polynomial fit under the prior.
$\Delta(\boldsymbol{\theta}) \gt 1$
 performs worse than a simple polynomial fit under the prior.
 Table 2 presents the values of 
 $\Delta(\hat{\boldsymbol{\theta}}_{\text{norm}})$
 across the six datasets. As expected, discrepancies decrease with portfolio size. For annuities, the approximation performs reasonably well even at smaller scales. In contrast, for LTC, it yields substantial errors, except for the largest portfolio.
$\Delta(\hat{\boldsymbol{\theta}}_{\text{norm}})$
 across the six datasets. As expected, discrepancies decrease with portfolio size. For annuities, the approximation performs reasonably well even at smaller scales. In contrast, for LTC, it yields substantial errors, except for the largest portfolio.
Table 2. Impact of the approximation from the original WH smoothing on the 6 simulated datasets.

One explanation, supported by the standardized mortality ratio (SMR) also provided in Table 2, is the positive correlation between observed event counts and their use as weights. This causes high crude rates to be overweighted, and low rates to be underweighted – introducing systematic overestimation of mortality rates. This bias is more severe in the LTC case where the observed deaths by data point is lower. In contrast, generalized WH smoothing preserves total event counts by construction, always yielding an SMR of exactly 100%. These results support adopting generalized WH smoothing in most practical settings. It retains the advantages of the original method while offering improved accuracy – even in small samples – and remains straightforward to implement.
5. How to select the smoothing parameters?
5.1. Impact of smoothing parameter choice
 In the one-dimensional case, WH smoothing involves a single smoothing parameter 
 $\lambda$
; in two dimensions a pair
$\lambda$
; in two dimensions a pair 
 $\lambda = (\lambda_x, \lambda_z)$
. These parameters govern the trade-off between fidelity to the data and smoothness of the estimate, as defined in Equation (1.1).
$\lambda = (\lambda_x, \lambda_z)$
. These parameters govern the trade-off between fidelity to the data and smoothness of the estimate, as defined in Equation (1.1).
 Figure 1 illustrates this effect in a one-dimensional annuity dataset (100,000 policyholders, see Section 4.2), with three values of 
 $\lambda$
. The effective degrees of freedom (edf), computed as the trace of the hat matrix
$\lambda$
. The effective degrees of freedom (edf), computed as the trace of the hat matrix 
 $H = (W + P_{\lambda})^{- 1}W$
, are shown for each curve. This quantity serves as a non-parametric analog of the number of free parameters in classical models and can take fractional values.
$H = (W + P_{\lambda})^{- 1}W$
, are shown for each curve. This quantity serves as a non-parametric analog of the number of free parameters in classical models and can take fractional values.

Figure 1. WH smoothing on a synthetic annuity portfolio with 3 smoothing levels. Dots: crude rates; curves: smoothed estimates; shaded areas: credibility intervals. edf: effective degrees of freedom.
 As shown, a low value 
 $\lambda = 10^1$
 yields an overfitted result that mirrors sampling noise, while a high value
$\lambda = 10^1$
 yields an overfitted result that mirrors sampling noise, while a high value 
 $\lambda = 10^7$
 oversmooths and obscures the underlying trend. A mid-range value
$\lambda = 10^7$
 oversmooths and obscures the underlying trend. A mid-range value 
 $\lambda = 10^4$
 appears visually balanced. However, selecting a smoothing parameter by eye is unreliable: small-sample variability at the extremes of the age range can easily be mistaken for meaningful patterns.
$\lambda = 10^4$
 appears visually balanced. However, selecting a smoothing parameter by eye is unreliable: small-sample variability at the extremes of the age range can easily be mistaken for meaningful patterns.
 The two-dimensional case further illustrates this difficulty. Figure 2 presents the smoothed transition rates from disability to death in an LTC portfolio (100,000 policyholders), using 9 combinations of 
 $(\lambda_x, \lambda_z)$
. Choosing an appropriate parameter pair visually becomes nearly impossible, reinforcing the need for a data-driven statistical selection criterion.
$(\lambda_x, \lambda_z)$
. Choosing an appropriate parameter pair visually becomes nearly impossible, reinforcing the need for a data-driven statistical selection criterion.

Figure 2. WH smoothing applied to disability-to-death transitions in an LTC portfolio, using 9 combinations of smoothing parameters. Contour lines and colours show the smoothed mortality surface by age and LTC duration.
5.2. Statistical criteria for parameter selection
Smoothing parameter selection typically relies on two classes of statistical criteria:
- 
1. Prediction-based criteria, which aim to minimize prediction error, such as the Akaike Information Criterion (AIC) (Akaike, Reference Akaike1973), and generalized cross-validation (GCV) (Wahba, Reference Wahba1980); 
- 
2. Likelihood-based criteria, which maximize the marginal likelihood – an approach introduced by Patterson and Thompson (Reference Patterson and Thompson1971) (under the name REML in the Gaussian case) and adapted to smoothing by Anderssen and Bloomfield (Reference Anderssen and Bloomfield1974). 
While prediction-based criteria have desirable asymptotic properties (Wahba, Reference Wahba1985; Kauermann, Reference Kauermann2005), their convergence towards optimal smoothing parameters can be slow. In contrast, marginal likelihood criteria tend to perform more robustly in finite samples (Reiss and Todd Ogden, Reference Reiss and Todd Ogden2009; Wood, Reference Wood2011).
To illustrate this, we apply AIC, GCV, and marginal likelihood to 100 replicates of the annuity portfolio with 100,000 policyholders (see Section 4.2). For each replicate, we select the optimal smoothing parameter and compute the corresponding effective degrees of freedom.
As shown in the left side of Figure 3, marginal likelihood produces stable and coherent degrees of freedom across replicates, whereas AIC and especially GCV often yield overly complex models. On the right, we plot the GCV and marginal likelihood profiles for a single replicate: marginal likelihood exhibits a well-defined maximum, while GCV presents two local minima. One aligns with the marginal likelihood optimum, but the global minimum corresponds to a model with 35 degrees of freedom – an implausibly complex mortality curve.

Figure 3. Comparison of criteria for selecting the smoothing parameter in one-dimensional WH smoothing. Left: distribution of effective degrees of freedom under AIC, GCV, and marginal likelihood across 100 replicates. Right: GCV and marginal likelihood values for one replicate as functions of the smoothing parameter.
These observations support the use of marginal likelihood over prediction-based criteria, especially in actuarial applications where robustness is key. Moreover, this choice aligns naturally with the Bayesian framework introduced in Sections 2–4.
We now detail its implementation – first for the original WH smoothing, then for the generalized setting – introducing three optimization strategies and three numerical algorithms and comparing their respective performances.
5.3 Selection in the original smoothing
 We consider again the normal framework from Section 2, where 
 $\mathbf{y} \mid \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2W^{-})$
 and
$\mathbf{y} \mid \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2W^{-})$
 and 
 $\boldsymbol{\theta} \mid \lambda \sim \mathcal{N}(0, \sigma^2P_{\lambda}^{-})$
. In the empirical Bayes approach, the smoothing parameter
$\boldsymbol{\theta} \mid \lambda \sim \mathcal{N}(0, \sigma^2P_{\lambda}^{-})$
. In the empirical Bayes approach, the smoothing parameter 
 $\lambda$
 is estimated by maximizing the marginal likelihood:
$\lambda$
 is estimated by maximizing the marginal likelihood:
 \begin{align*}\mathcal{L}^m_{\text{norm}}(\lambda) = f(\mathbf{y} \mid \lambda) = \int f(\mathbf{y}, \boldsymbol{\theta} \mid \lambda)\text{d}\boldsymbol{\theta} = \int f(\mathbf{y} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta} \mid \lambda)\text{d}\boldsymbol{\theta}.\end{align*}
\begin{align*}\mathcal{L}^m_{\text{norm}}(\lambda) = f(\mathbf{y} \mid \lambda) = \int f(\mathbf{y}, \boldsymbol{\theta} \mid \lambda)\text{d}\boldsymbol{\theta} = \int f(\mathbf{y} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta} \mid \lambda)\text{d}\boldsymbol{\theta}.\end{align*} 
This is simply the maximum likelihood method applied to the smoothing parameter, treated as deterministic but unknown. A closed-form expression for this integral can be derived using standard Gaussian identities (see Section C of the Online Supplementary Materials), yielding the marginal log-likelihood:
 \begin{align*}\ell^m_{\text{norm}}(\lambda) = - \frac{1}{2}\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) / \sigma^2 + \hat{\boldsymbol{\theta}}_{\lambda}^{T}P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda} / \sigma^2 + \ln|W + P_{\lambda}| - \ln |P_{\lambda}|_{+} + C\right].\end{align*}
\begin{align*}\ell^m_{\text{norm}}(\lambda) = - \frac{1}{2}\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) / \sigma^2 + \hat{\boldsymbol{\theta}}_{\lambda}^{T}P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda} / \sigma^2 + \ln|W + P_{\lambda}| - \ln |P_{\lambda}|_{+} + C\right].\end{align*} 
where 
 $\hat{\boldsymbol{\theta}}_\lambda = (W + P_{\lambda})^{-1}W\mathbf{y}$
, and
$\hat{\boldsymbol{\theta}}_\lambda = (W + P_{\lambda})^{-1}W\mathbf{y}$
, and 
 $C = - \ln|W|_{+} + (n_* - q)\ln(2\pi\sigma^2)$
 is a constant independent of
$C = - \ln|W|_{+} + (n_* - q)\ln(2\pi\sigma^2)$
 is a constant independent of 
 $\lambda$
. This function is maximized numerically to obtain
$\lambda$
. This function is maximized numerically to obtain 
 $\hat{\lambda}_{\text{norm}}$
.
$\hat{\lambda}_{\text{norm}}$
.
5.4. Selection in the generalized smoothing
 The empirical Bayes approach introduced in the normal framework can be extended to the generalized smoothing framework developed in Section 4. While no closed-form expression exists for the marginal likelihood in this context, it can be approximated using a second-order Taylor expansion of the log-posterior density around its maximum 
 $\hat{\boldsymbol{\theta}}_\lambda$
 – similarly to what was done in the normal case. This yields the so-called Laplace approximation of the marginal likelihood (LAML), defined as
$\hat{\boldsymbol{\theta}}_\lambda$
 – similarly to what was done in the normal case. This yields the so-called Laplace approximation of the marginal likelihood (LAML), defined as 
 \begin{align*}\ell^m_{\text{LAML}}(\lambda) = \ell(\hat{\boldsymbol{\theta}}_{\lambda}) - \frac{1}{2}\left[\hat{\boldsymbol{\theta}}_{\lambda}^T P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda} + \ln|W_{\lambda} + P_{\lambda}| - \ln|P_{\lambda}|_{+} - q\ln(2\pi)\right]\end{align*}
\begin{align*}\ell^m_{\text{LAML}}(\lambda) = \ell(\hat{\boldsymbol{\theta}}_{\lambda}) - \frac{1}{2}\left[\hat{\boldsymbol{\theta}}_{\lambda}^T P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda} + \ln|W_{\lambda} + P_{\lambda}| - \ln|P_{\lambda}|_{+} - q\ln(2\pi)\right]\end{align*} 
where 
 $W_\lambda = \text{Diag}(\!\exp(\hat{\boldsymbol{\theta}}_{\lambda}) \odot \mathbf{e_c})$
 and
$W_\lambda = \text{Diag}(\!\exp(\hat{\boldsymbol{\theta}}_{\lambda}) \odot \mathbf{e_c})$
 and 
 $\ell(\hat{\boldsymbol{\theta}}_{\lambda})$
 is the log-likelihood evaluated at the penalized MLE. The detailed derivation of the Laplace approximation in this setting is provided in Section C of the Supplementary Materials. This approximation plays a central role in the automatic selection of the smoothing parameter
$\ell(\hat{\boldsymbol{\theta}}_{\lambda})$
 is the log-likelihood evaluated at the penalized MLE. The detailed derivation of the Laplace approximation in this setting is provided in Section C of the Supplementary Materials. This approximation plays a central role in the automatic selection of the smoothing parameter 
 $\lambda$
 in the generalized WH smoothing framework. As in the normal case, the marginal likelihood
$\lambda$
 in the generalized WH smoothing framework. As in the normal case, the marginal likelihood 
 $\ell^m_{\text{LAML}}(\lambda)$
 must be maximized numerically. However, a key distinction is that the penalized likelihood maximizer
$\ell^m_{\text{LAML}}(\lambda)$
 must be maximized numerically. However, a key distinction is that the penalized likelihood maximizer 
 $\hat{\boldsymbol{\theta}}_\lambda$
 now depends on
$\hat{\boldsymbol{\theta}}_\lambda$
 now depends on 
 $\lambda$
 and must be recomputed at each iteration via the PIRLS algorithm. This leads to a two-level optimization procedure:
$\lambda$
 and must be recomputed at each iteration via the PIRLS algorithm. This leads to a two-level optimization procedure:
- 
• an inner loop estimating  $\hat{\boldsymbol{\theta}}_\lambda$
 for fixed $\hat{\boldsymbol{\theta}}_\lambda$
 for fixed $\lambda$
 using PIRLS; $\lambda$
 using PIRLS;
- 
• and an outer loop optimizing  $\ell^m_{\text{LAML}}(\lambda)$
 with respect to $\ell^m_{\text{LAML}}(\lambda)$
 with respect to $\lambda$
. $\lambda$
.
This outer iteration approach is the most principled method for smoothing parameter selection in this setting.
 Alternative strategies have been proposed to reduce computational burden. The first one, known as performance-oriented iteration, was introduced by Gu (Reference Gu1992) and relies on the observation that, at each PIRLS step, the working response vector 
 $\mathbf{z}_k$
 can be treated as approximately normal:
$\mathbf{z}_k$
 can be treated as approximately normal: 
 $\mathbf{z}_k \mid \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, W_k^{-1})$
. Assuming
$\mathbf{z}_k \mid \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, W_k^{-1})$
. Assuming 
 $W_k$
 independent of
$W_k$
 independent of 
 $\lambda$
, the marginal likelihood can be maximized within each PIRLS step using the normal approximation methodology of Section 5.3, with
$\lambda$
, the marginal likelihood can be maximized within each PIRLS step using the normal approximation methodology of Section 5.3, with 
 $\mathbf{y}$
 replaced by
$\mathbf{y}$
 replaced by 
 $\mathbf{z}_k$
 and W by
$\mathbf{z}_k$
 and W by 
 $W_k$
. This effectively reverses the nesting structure, potentially saving computational time when updating
$W_k$
. This effectively reverses the nesting structure, potentially saving computational time when updating 
 $\lambda$
 is less costly than recomputing a PIRLS step. A formal justification of the method is provided by Wood (Reference Wood2017, 149), which emphasizes that it does not actually require
$\lambda$
 is less costly than recomputing a PIRLS step. A formal justification of the method is provided by Wood (Reference Wood2017, 149), which emphasizes that it does not actually require 
 $\mathbf{z}_k$
 to have a normal distribution to be well founded.
$\mathbf{z}_k$
 to have a normal distribution to be well founded.
 A third and even simpler strategy is the alternate iteration approach, used for instance by Wood et al. (Reference Wood, Li, Shaddick and Augustin2017). It consists in alternating updates of 
 $\boldsymbol{\theta}$
 (via PIRLS) and
$\boldsymbol{\theta}$
 (via PIRLS) and 
 $\lambda$
 (via approximate marginal likelihood), without fully optimizing either at each step. This relies on the empirical observation that a coarse update of
$\lambda$
 (via approximate marginal likelihood), without fully optimizing either at each step. This relies on the empirical observation that a coarse update of 
 $\lambda$
 may suffice, as the marginal likelihood surface changes between iterations.
$\lambda$
 may suffice, as the marginal likelihood surface changes between iterations.
 Despite their efficiency, both performance-oriented and alternate iteration approaches lack formal convergence guarantees. Unlike outer iteration, they operate on different smoothing parameters at each step, rendering penalized likelihood values non-comparable across iterations. Moreover, they do not track the value of 
 $\ell^m_{\text{LAML}}(\lambda)$
 during the optimization, making it harder to assess convergence or apply step-length controls.
$\ell^m_{\text{LAML}}(\lambda)$
 during the optimization, making it harder to assess convergence or apply step-length controls.
Detailed algorithmic formulations of all three strategies in the generalized WH smoothing framework are provided in Section D of the Supplementary Materials.
5.5. Algorithms for the maximization of the marginal likelihood
Several algorithms can be used to maximize the marginal likelihood or its Laplace approximation (LAML). It is generally preferable to apply these algorithms to the logarithm of the smoothing parameters, for three main reasons:
- 
1. It ensures positivity of the smoothing parameters; 
- 
2. It simplifies the expressions of derivatives, when required; 
- 
3. It allows more uniform coverage of the range of interest (e.g. from  $\lambda = 10^1$
 to $\lambda = 10^1$
 to $10^7$
, as in Figure 1, differences of comparable magnitude occur on a logarithmic scale). $10^7$
, as in Figure 1, differences of comparable magnitude occur on a logarithmic scale).
Derivative-free heuristics
A first, operationally simple option is to use general-purpose derivative-free optimization methods:
- 
• Brent’s method (Brent, Reference Brent1973) in the one-dimensional case; 
- 
• The Nelder–Mead simplex algorithm (Nelder and Mead, Reference Nelder and Mead1965) in higher dimensions. 
 These are readily available in base 
 $\texttt{R}$
 via the optimize and optim functions. They only require evaluating the marginal likelihood or LAML at each step, which is computationally inexpensive. However, they typically require more iterations to converge and cannot be combined with the alternate iteration approach, as they do not guarantee systematic improvement of the criterion at each step.
$\texttt{R}$
 via the optimize and optim functions. They only require evaluating the marginal likelihood or LAML at each step, which is computationally inexpensive. However, they typically require more iterations to converge and cannot be combined with the alternate iteration approach, as they do not guarantee systematic improvement of the criterion at each step.
Generalized Fellner–Schall method
A more specialized algorithm is the generalized Fellner–Schall method, based on ideas from Fellner (Reference Fellner1986) and Schall (Reference Schall1991), and adapted for smoothing parameter selection in multidimensional generalized linear models by Rodríguez-Álvarez et al. (Reference Rodríguez-Álvarez, Lee, Kneib, Durban and Eilers2015). It may be summarized by the update formula:
 \begin{equation}{\lambda_{j}^{\text{next}} = \frac{\text{tr}(P_\lambda^{-}P_j) - \text{tr}[(X^TWX + P_\lambda)^{- 1}P_j]}{\hat{\boldsymbol{\beta}}_\lambda^TP_j\hat{\boldsymbol{\beta}}_\lambda}\lambda_j^{\text{current}} \quad \text{for} \quad j\in\{x, z\} .}\end{equation}
\begin{equation}{\lambda_{j}^{\text{next}} = \frac{\text{tr}(P_\lambda^{-}P_j) - \text{tr}[(X^TWX + P_\lambda)^{- 1}P_j]}{\hat{\boldsymbol{\beta}}_\lambda^TP_j\hat{\boldsymbol{\beta}}_\lambda}\lambda_j^{\text{current}} \quad \text{for} \quad j\in\{x, z\} .}\end{equation}
where for WH smoothing 
 $P_j = D_{n_{+},q}^{T}D_{n_{+},q}$
 in the one-dimensional case and
$P_j = D_{n_{+},q}^{T}D_{n_{+},q}$
 in the one-dimensional case and 
 $P_x$
 (resp.
$P_x$
 (resp. 
 $P_z$
) is the marginal penalization matrix
$P_z$
) is the marginal penalization matrix 
 $I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x}$
 (resp.
$I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x}$
 (resp. 
 $D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}$
) in the two-dimensional case.This update can be interpreted more intuitively as
$D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}$
) in the two-dimensional case.This update can be interpreted more intuitively as
 \begin{align*}\hat{\boldsymbol{\beta}}_\lambda^T(\lambda_{j}^{\text{next}}P_j)\hat{\boldsymbol{\beta}}_\lambda = \text{tr}[P_\lambda^{-}\lambda_j^{\text{current}}P_j - (X^TWX + P_\lambda)^{- 1}\lambda_j^{\text{current}}P_j]\end{align*}
\begin{align*}\hat{\boldsymbol{\beta}}_\lambda^T(\lambda_{j}^{\text{next}}P_j)\hat{\boldsymbol{\beta}}_\lambda = \text{tr}[P_\lambda^{-}\lambda_j^{\text{current}}P_j - (X^TWX + P_\lambda)^{- 1}\lambda_j^{\text{current}}P_j]\end{align*} 
where the right-hand side corresponds to an effective degrees of freedom associated with 
 $\lambda_j^{\text{current}}P_j$
, and the left-hand side to a squared error, normalized by the updated penalty precision. This makes
$\lambda_j^{\text{current}}P_j$
, and the left-hand side to a squared error, normalized by the updated penalty precision. This makes 
 $\lambda_j^{\text{next}}$
 resemble a REML-based estimator for the inverse variance. More details may be found in Rodríguez-Álvarez et al. (Reference Rodríguez-Álvarez, Durban, Lee and Eilers2019). This method:
$\lambda_j^{\text{next}}$
 resemble a REML-based estimator for the inverse variance. More details may be found in Rodríguez-Álvarez et al. (Reference Rodríguez-Álvarez, Durban, Lee and Eilers2019). This method:
- 
• May be combined with any of the three iteration nesting schemes (outer, performance, and alternate); 
- 
• Does not require explicit derivative computations; 
- 
• Converges towards an approximate maximum of LAML in the generalized case, since it ignores the dependence of W on  $\lambda$
; $\lambda$
;
- 
• Tends to take longer steps than EM-like algorithms (Dempster et al., Reference Dempster, Laird and Rubin1977), but shorter than Newton updates (see Wood and Fasiolo Reference Wood and Fasiolo2017, which also provides a thorough justification for the method). 
Newton algorithm
 A third option is the Newton method, which involves computing both the first and second derivatives of the marginal likelihood (or LAML) with respect to 
 $\ln\lambda$
. Full derivations are provided in Wood (Reference Wood2011), which covers a more general case. The method applies in both the normal and generalized cases, but in the latter, derivative expressions are more complex due to the dependence of W on
$\ln\lambda$
. Full derivations are provided in Wood (Reference Wood2011), which covers a more general case. The method applies in both the normal and generalized cases, but in the latter, derivative expressions are more complex due to the dependence of W on 
 $\lambda$
. The Newton algorithm is fast and precise and applicable to all three nesting strategies. The downside is the operational complexity associated with this method, especially in the generalized case.
$\lambda$
. The Newton algorithm is fast and precise and applicable to all three nesting strategies. The downside is the operational complexity associated with this method, especially in the generalized case.
5.6. Performance comparison
Sections 5.4 and 5.5 introduced eight combinations of nesting strategies and optimization algorithms applicable to the generalized WH smoothing. We now assess the potential convergence issues and approximation errors associated with each of them.
 This analysis is based on 100 replicates of the simulated annuity and LTC portfolios with 100,000 policyholders, as described in Section 4.2. For each replicate and each method combination, we compute the LAML at the selected smoothing parameters and compare this value to the (approximate) optimal value obtained across all combinations, denoted 
 $\hat{\lambda}_{\text{opt}}$
.
$\hat{\lambda}_{\text{opt}}$
.
To quantify the discrepancy, we define the relative error:
 \begin{equation}{\Delta(\lambda) = \frac{\ell^m_{\text{LAML}}(\hat{\lambda}_{\text{opt}}) - \ell^m_{\text{LAML}}(\lambda)}{\ell^m_{\text{LAML}}(\hat{\lambda}_{\text{opt}}) - \ell^m_{\text{LAML}}(\infty)}}\end{equation}
\begin{equation}{\Delta(\lambda) = \frac{\ell^m_{\text{LAML}}(\hat{\lambda}_{\text{opt}}) - \ell^m_{\text{LAML}}(\lambda)}{\ell^m_{\text{LAML}}(\hat{\lambda}_{\text{opt}}) - \ell^m_{\text{LAML}}(\infty)}}\end{equation}
where 
 $\ell^m_{\text{LAML}}(\infty)$
 corresponds to the LAML value when using an infinite smoothing penalty, that is, the overly smooth baseline. By construction,
$\ell^m_{\text{LAML}}(\infty)$
 corresponds to the LAML value when using an infinite smoothing penalty, that is, the overly smooth baseline. By construction, 
 $\Delta(\lambda) \ge 0$
 for all tested methods, with
$\Delta(\lambda) \ge 0$
 for all tested methods, with 
 $\Delta(\hat{\lambda}_{\text{opt}}) = 0$
 and
$\Delta(\hat{\lambda}_{\text{opt}}) = 0$
 and 
 $\Delta(\infty) = 1$
. In the two-dimensional setting, we also compare average computation time for each method across replicates.
$\Delta(\infty) = 1$
. In the two-dimensional setting, we also compare average computation time for each method across replicates.
 Results are summarised in Figure 4. The top panel displays the relative error 
 $\Delta(\lambda)$
 (capped below
$\Delta(\lambda)$
 (capped below 
 $10^{-10}$
 for readability). In the outer iteration framework:
$10^{-10}$
 for readability). In the outer iteration framework:
- 
• The Newton method consistently achieves relative errors below  $10^{-10}$
; $10^{-10}$
;
- 
• Brent and Nelder–Mead heuristics yield slightly higher errors but remain below  $10^{-7}$
; $10^{-7}$
;
- 
• The generalized Fellner–Schall method produces higher errors, but still below  $10^{-5}$
 and negligible in practice. $10^{-5}$
 and negligible in practice.

Figure 4. Comparison of the 8 nesting strategy and algorithm combinations in the 1D and 2D simulated cases. Top: relative error on the LAML (log scale). Bottom: improvement in average computation time compared to the Nelder-Mead + outer iteration reference.
 In the performance and alternate iteration frameworks, all methods yield similar errors, consistently below 
 $10^{-5}$
, with no convergence issues observed in any replicate. These findings suggest that method selection can be guided by practical considerations such as speed and implementation ease.
$10^{-5}$
, with no convergence issues observed in any replicate. These findings suggest that method selection can be guided by practical considerations such as speed and implementation ease.
The bottom panel of Figure 4 compares computation times (relative to the Nelder–Mead + outer iteration baseline):
- 
• In the outer iteration framework, the Newton method is the fastest, followed by the Fellner-Schall approach; 
- 
• All outer iteration variants are faster than their performance or alternate counterparts. 
This is unsurprising, as PIRLS steps are particularly lightweight in WH smoothing (where the model matrix is the identity). However, alternate strategies may remain useful for more general cases like those described in Section 6.4.
For reference, the average time required for a single iteration using Nelder–Mead in the 2D outer iteration case is approximately 1.68 s (versus 5 ms in the 1D case).
6. How to improve smoothing computational efficiency?
6.1. Motivation
WH smoothing is a full-rank method, meaning that it includes as many parameters as there are observation points. This feature ensures a high degree of flexibility, allowing the estimator to closely track the input signal when sufficient data are available. Formally, WH smoothing is asymptotically unbiased since:
 \begin{align*}\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{- 1}W\mathbb{E}(\mathbf{y}) \overset {m \rightarrow \infty}{\rightarrow} \mathbb{E}(\mathbf{y}),\end{align*}
\begin{align*}\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{- 1}W\mathbb{E}(\mathbf{y}) \overset {m \rightarrow \infty}{\rightarrow} \mathbb{E}(\mathbf{y}),\end{align*} 
where m denotes the number of observed individuals, which influences the matrix W.
However, this flexibility comes at a computational cost. Some key operations, such as (implicit) matrix inversions, scale cubically with the number of parameters. As a result, WH smoothing may become impractical with large number of combinations or when applied repeatedly (e.g. in simulations or bootstraps).
In one-dimensional settings, such as age-only models with annual discretization, the number of points rarely exceeds 100, and computation time is negligible. In contrast, two-dimensional use cases – common in insurance – can lead to substantially larger datasets:
- 
• Disability tables in France must cover entry ages from 18 to 61 and exit ages up to 62, resulting in  $(62 - 18) \times (62 - 18 + 1) / 2 = 990$
 combinations. $(62 - 18) \times (62 - 18 + 1) / 2 = 990$
 combinations.
- 
• Transition tables from short-term incapacity to disability involve entry ages from 18 to 67 and monthly durations from 0 to 36 months, yielding  $(67 - 18) \times 36 = 1,764$
 combinations. $(67 - 18) \times 36 = 1,764$
 combinations.
- 
• Long-term care (LTC) models require coverage over ages 50–100 and durations from 0 to 20 years, totalling  $(100 - 50) \times (20 - 0) = 1000$
 combinations (in practice, this number may be lower due to data sparsity). $(100 - 50) \times (20 - 0) = 1000$
 combinations (in practice, this number may be lower due to data sparsity).
In such settings, computing WH smoothing – especially when paired with smoothing parameter selection – can take several minutes per application, limiting usability in iterative contexts.
To address this limitation, we now analyse the computational complexity of the main steps in WH smoothing and smoothing parameter selection then introduce two complementary strategies to reduce computation time:
- 
• A structural optimization that exploits the specific form of WH penalization matrices; 
- 
• A reduced-rank approximation that lowers the number of parameters while minimizing bias compared to the full-rank estimator. 
Finally, we benchmark these strategies in terms of runtime and accuracy using 100 replicates of the mid-size annuity and LTC portfolios from Section 4.2. The structural optimization is compared to the original WH method, while the reduced-rank approximation is evaluated against the original method, the structural optimization, and a reference P-spline smoothing approach.
6.2. Practical computation for penalized smoothers
WH smoothing belongs to a broader family of penalized smoothing methods that produce estimates of the form:
 \begin{align*}\hat{\mathbf{y}}_\lambda = X\hat{\boldsymbol{\beta}}_{\lambda} \quad\text{where}\;\boldsymbol{\beta}_{\lambda}\;\text{solves}\: (X^TWX + P_{\lambda})\hat{\boldsymbol{\beta}}_\lambda = X^TW\mathbf{y}.\end{align*}
\begin{align*}\hat{\mathbf{y}}_\lambda = X\hat{\boldsymbol{\beta}}_{\lambda} \quad\text{where}\;\boldsymbol{\beta}_{\lambda}\;\text{solves}\: (X^TWX + P_{\lambda})\hat{\boldsymbol{\beta}}_\lambda = X^TW\mathbf{y}.\end{align*} 
 Here, X and 
 $P_\lambda$
 denote the model and penalization matrices of size
$P_\lambda$
 denote the model and penalization matrices of size 
 $n \times p$
 and
$n \times p$
 and 
 $p \times p$
, respectively, and W is a diagonal matrix of positive weights of size
$p \times p$
, respectively, and W is a diagonal matrix of positive weights of size 
 $n \times n$
.
$n \times n$
.
Computational steps
 The computation of 
 $\hat{\mathbf{y}}_\lambda$
 for a given
$\hat{\mathbf{y}}_\lambda$
 for a given 
 $\lambda$
 typically involves the following steps:
$\lambda$
 typically involves the following steps:
- 
1. Absorb the weights in the model matrix and observation vector, forming  $W^{1/2}X$
 and $W^{1/2}X$
 and $W^{1/2}\mathbf{y}$
, which requires $W^{1/2}\mathbf{y}$
, which requires $O(n^2)$
 and O(n) operations, respectively (multiplying each row of X and each element of $O(n^2)$
 and O(n) operations, respectively (multiplying each row of X and each element of $\mathbf{y}$
 by the corresponding element of $\mathbf{y}$
 by the corresponding element of $\mathbf{w}$
). $\mathbf{w}$
).
- 
2. Form the matrix  $P_\lambda$
. The cost of this operation is typically $P_\lambda$
. The cost of this operation is typically $O(p^2)$
 in the general case. $O(p^2)$
 in the general case.
- 
3. Form the matrix  $X^TWX$
 and the vector $X^TWX$
 and the vector $X^TW\mathbf{y}$
, which requires up to $X^TW\mathbf{y}$
, which requires up to $O(np^2)$
 and O(np) operations respectively. $O(np^2)$
 and O(np) operations respectively.
- 
4. Add together  $X^TWX$
 and $X^TWX$
 and $P_{\lambda}$
 which requires $P_{\lambda}$
 which requires $O(p^2)$
 operations in the general case. $O(p^2)$
 operations in the general case.
- 
5. Compute the Cholesky decomposition  $X^TWX + P_{\lambda} = R^TR$
 at a cost of $X^TWX + P_{\lambda} = R^TR$
 at a cost of $O(p^3)$
. $O(p^3)$
.
- 
6. Obtain  $\hat{\boldsymbol{\beta}}_\lambda$
 by forward-backward substitution, first solving $\hat{\boldsymbol{\beta}}_\lambda$
 by forward-backward substitution, first solving $R^T\mathbf{u} = X^TW\mathbf{y}$
 then $R^T\mathbf{u} = X^TW\mathbf{y}$
 then $R\hat{\boldsymbol{\beta}}_\lambda = \mathbf{u}$
 with an associated cost of $R\hat{\boldsymbol{\beta}}_\lambda = \mathbf{u}$
 with an associated cost of $O(p^2)$
 for each system. $O(p^2)$
 for each system.
- 
7. Compute  $\hat{\mathbf{y}}_\lambda = X\hat{\boldsymbol{\beta}}_\lambda$
 at a cost of O(np). $\hat{\mathbf{y}}_\lambda = X\hat{\boldsymbol{\beta}}_\lambda$
 at a cost of O(np).
 As an alternative to Cholesky, QR decomposition may be used for greater numerical stability (see Golub and Van Loan, Reference Golub and Van Loan2013). It applies to the weighted design matrix stacked with a matrix B such that 
 $B^T B = P_\lambda$
.
$B^T B = P_\lambda$
.
Simplifications for WH smoothing
 In WH smoothing, 
 $X = I_n$
, which simplifies computations:
$X = I_n$
, which simplifies computations:
- 
• Step 7 is unnecessary, as well as the first part of step 3. 
- 
•  $X^T W \mathbf{y} = W \mathbf{y}$
 (step 3) is computed in O(n) by multiplying $X^T W \mathbf{y} = W \mathbf{y}$
 (step 3) is computed in O(n) by multiplying $\mathbf{w}$
 and $\mathbf{w}$
 and $\mathbf{y}$
. $\mathbf{y}$
.
- 
•  $X^T W X + P_\lambda = W + P_\lambda$
 (step 4) is also computed in O(n) by adding the vector $X^T W X + P_\lambda = W + P_\lambda$
 (step 4) is also computed in O(n) by adding the vector $\mathbf{w}$
 to the leading diagonal of $\mathbf{w}$
 to the leading diagonal of $P_{\lambda}$
. $P_{\lambda}$
.
Generalized WH smoothing with outer iteration
 When using the outer iteration approach (see Section 5), each candidate 
 $\lambda$
 requires a full PIRLS cycle to estimate
$\lambda$
 requires a full PIRLS cycle to estimate 
 $\hat{\boldsymbol{\theta}}_\lambda$
, with new working vector
$\hat{\boldsymbol{\theta}}_\lambda$
, with new working vector 
 $\hat{\mathbf{z}}^k_\lambda$
 and weight matrix
$\hat{\mathbf{z}}^k_\lambda$
 and weight matrix 
 $W^k_\lambda$
. Steps 1–6 above are repeated until convergence of the PIRLS algorithm, which may be assessed by monitoring the changes in penalized deviance. The deviance may be computed at a O(n) cost. For penalization based on differences matrices, computation of
$W^k_\lambda$
. Steps 1–6 above are repeated until convergence of the PIRLS algorithm, which may be assessed by monitoring the changes in penalized deviance. The deviance may be computed at a O(n) cost. For penalization based on differences matrices, computation of 
 $\hat{\boldsymbol{\beta}}_\lambda{\,}^TP_\lambda\hat{\boldsymbol{\beta}}_\lambda$
 should be based on the expression of
$\hat{\boldsymbol{\beta}}_\lambda{\,}^TP_\lambda\hat{\boldsymbol{\beta}}_\lambda$
 should be based on the expression of 
 $R_{\lambda,q}$
 provided in Section 1.1 for an associated cost of O(qp). In addition, PIRLS iterations for each new
$R_{\lambda,q}$
 provided in Section 1.1 for an associated cost of O(qp). In addition, PIRLS iterations for each new 
 $\lambda$
 can be initialized using the previous estimate of
$\lambda$
 can be initialized using the previous estimate of 
 $\hat{\mathbf{y}}_\lambda$
 for faster convergence.
$\hat{\mathbf{y}}_\lambda$
 for faster convergence.
LAML computation
Once the deviance is known, computing the marginal likelihood/LAML also requires:
- 
•  $\ln|X^TWX + P_{\lambda}|$
, which may be computed at a cost of O(p) from the leading diagonal of the Cholesky/QR factor R computed at step 5 in the derivation of $\ln|X^TWX + P_{\lambda}|$
, which may be computed at a cost of O(p) from the leading diagonal of the Cholesky/QR factor R computed at step 5 in the derivation of $\hat{\mathbf{y}}_\lambda$
. $\hat{\mathbf{y}}_\lambda$
.
- 
•  $\ln|P_{\lambda}|_{+}$
, which may be obtained from the eigenvalues of the penalization matrix: Section 6.4
 shows that in the two-dimensional case, it can be computed via eigendecomposition of $\ln|P_{\lambda}|_{+}$
, which may be obtained from the eigenvalues of the penalization matrix: Section 6.4
 shows that in the two-dimensional case, it can be computed via eigendecomposition of $D_{p_x,q_x}^T D_{p_x,q_x}$
 and $D_{p_x,q_x}^T D_{p_x,q_x}$
 and $D_{p_z,q_z}^T D_{p_z,q_z}$
, performed only once, at a $D_{p_z,q_z}^T D_{p_z,q_z}$
, performed only once, at a $O(p_x^3 + p_z^3)$
 cost. Computation of $O(p_x^3 + p_z^3)$
 cost. Computation of $\ln|P_{\lambda}|_{+}$
 then only requires scaling the eigenvalues for a cost of O(p). $\ln|P_{\lambda}|_{+}$
 then only requires scaling the eigenvalues for a cost of O(p).
Algorithm-specific computations
Brent and Nelder–Mead require only marginal likelihood/LAML evaluations.
The generalized Fellner–Schall algorithm relies on the update formula of Equation (5.1):
 \begin{align*}\lambda_{j}^{\text{next}} = \frac{\text{tr}(P_\lambda^{-}P_j) - \text{tr}[(X^TWX + P_\lambda)^{- 1}P_j]}{\hat{\boldsymbol{\beta}}_\lambda^TP_j\hat{\boldsymbol{\beta}}_\lambda}\lambda_j^{\text{current}} \quad \text{for} \quad j\in\{x, z\} .\end{align*}
\begin{align*}\lambda_{j}^{\text{next}} = \frac{\text{tr}(P_\lambda^{-}P_j) - \text{tr}[(X^TWX + P_\lambda)^{- 1}P_j]}{\hat{\boldsymbol{\beta}}_\lambda^TP_j\hat{\boldsymbol{\beta}}_\lambda}\lambda_j^{\text{current}} \quad \text{for} \quad j\in\{x, z\} .\end{align*} 
 Evaluation of 
 $\text{tr}(P_\lambda^{-}P_j)$
 does not require any matrix product. In the one-dimensional case it is simply
$\text{tr}(P_\lambda^{-}P_j)$
 does not require any matrix product. In the one-dimensional case it is simply 
 $(p - q)/\lambda$
 while in the two-dimensional case it may be obtained directly at a O(p) cost using the eigenvalues of the aforementioned penalization matrices
$(p - q)/\lambda$
 while in the two-dimensional case it may be obtained directly at a O(p) cost using the eigenvalues of the aforementioned penalization matrices 
 $P_j$
. Evaluation of
$P_j$
. Evaluation of 
 $\text{tr}[(X^TWX + P_\lambda)^{- 1}P_j]$
 may use the identity
$\text{tr}[(X^TWX + P_\lambda)^{- 1}P_j]$
 may use the identity 
 $\text{tr}(AB) = \sum_{i,j} A_{ij}B_{ji}$
 and therefore be computed at an
$\text{tr}(AB) = \sum_{i,j} A_{ij}B_{ji}$
 and therefore be computed at an 
 $O(p^2)$
 cost if the matrix
$O(p^2)$
 cost if the matrix 
 $(X^TWX + P_\lambda)^{- 1}$
 and
$(X^TWX + P_\lambda)^{- 1}$
 and 
 $P_j$
 are available. Computation of
$P_j$
 are available. Computation of 
 $V = (X^TWX + P_\lambda)^{- 1}$
 is done by first solving for the inverse
$V = (X^TWX + P_\lambda)^{- 1}$
 is done by first solving for the inverse 
 $K = R^{- 1}$
 of the Cholesky/QR factor and then forming V as
$K = R^{- 1}$
 of the Cholesky/QR factor and then forming V as 
 $KK^T$
. Both operations have a
$KK^T$
. Both operations have a 
 $O(p^3)$
 cost.
$O(p^3)$
 cost.
 Newton method also requires computation of V, as well as several matrix products involving the penalization matrix 
 $P_j$
. For example, the second derivatives of marginal likelihood require
$P_j$
. For example, the second derivatives of marginal likelihood require 
 $\text{tr}[VP_jVP_k]$
 terms, and the second derivatives of marginal likelihood require
$\text{tr}[VP_jVP_k]$
 terms, and the second derivatives of marginal likelihood require 
 $\text{tr}[V(X(\partial W/\partial \rho_j)X + P_j)V(X(\partial W/\partial \rho_k)X + P_k)]$
 terms where
$\text{tr}[V(X(\partial W/\partial \rho_j)X + P_j)V(X(\partial W/\partial \rho_k)X + P_k)]$
 terms where 
 $\rho_j = \ln(\lambda_j)$
,
$\rho_j = \ln(\lambda_j)$
, 
 $j = k = x$
 in the one-dimensional case and
$j = k = x$
 in the one-dimensional case and 
 $\{j,k\} \in \{x, z\}$
 in the two-dimensional case. The identity
$\{j,k\} \in \{x, z\}$
 in the two-dimensional case. The identity 
 $\text{tr}(AB) = \sum_{i,j} A_{ij}B_{ji}$
 can also be used in this case but matrix products
$\text{tr}(AB) = \sum_{i,j} A_{ij}B_{ji}$
 can also be used in this case but matrix products 
 $VP_j$
 or
$VP_j$
 or 
 $V[X(\partial W/\partial \rho_j)X + P_j]$
 still need to be explicitly computed, for a respective cost of
$V[X(\partial W/\partial \rho_j)X + P_j]$
 still need to be explicitly computed, for a respective cost of 
 $O(p^3)$
 and
$O(p^3)$
 and 
 $O(n p^2)$
 each.
$O(n p^2)$
 each.
These additional computations make Newton updates more expensive than generalized Fellner–Schall updates, but they generally yield faster convergence and higher precision (see Section 5.6).
6.3. Banded optimization for WH smoothing
 We now consider how to exploit the banded structure of the penalization matrix in WH smoothing. This structure enables significant computational gains, especially when dealing with large number of observations. Throughout this section, we assume 
 $X = I_n$
 and
$X = I_n$
 and 
 $p = n$
, which holds for both the original and generalized WH smoothing.
$p = n$
, which holds for both the original and generalized WH smoothing.
One-dimensional case
 In one dimension, the penalization matrix takes the form: 
 $P_\lambda = \lambda D_{n,q}^TD_{n,q}$
. This matrix is symmetric and banded with bandwidth q. As a consequence:
$P_\lambda = \lambda D_{n,q}^TD_{n,q}$
. This matrix is symmetric and banded with bandwidth q. As a consequence:
- 
1. Compact storage:  $P_\lambda$
 can be stored in a compact form with dimensions $P_\lambda$
 can be stored in a compact form with dimensions $n \times (q + 1)$
 and updated for new $n \times (q + 1)$
 and updated for new $\lambda$
 at a cost of O(qn). The matrix $\lambda$
 at a cost of O(qn). The matrix $W + P_\lambda$
 shares this structure. $W + P_\lambda$
 shares this structure.
- 
2. Efficient Cholesky decomposition: the Cholesky factor R of  $W + P_\lambda$
 can be computed in $W + P_\lambda$
 can be computed in $O(q^2 n)$
 instead of $O(q^2 n)$
 instead of $O(n^3)$
, and R is also banded with the same bandwidth. $O(n^3)$
, and R is also banded with the same bandwidth.
- 
3. Efficient back-substitution: computing  $\hat{\mathbf{y}}_\lambda = \hat{\boldsymbol{\beta}}_\lambda$
 using R is now O(qn) instead of $\hat{\mathbf{y}}_\lambda = \hat{\boldsymbol{\beta}}_\lambda$
 using R is now O(qn) instead of $O(n^2)$
. $O(n^2)$
.
- 
4. Efficient inversion of R: the inverse  $K = R^{-1}$
 costs $K = R^{-1}$
 costs $O(qn^2)$
, an improvement over the $O(qn^2)$
, an improvement over the $O(n^3)$
 cost for dense matrices. $O(n^3)$
 cost for dense matrices.
 However, K is a dense triangular matrix, meaning the computation of 
 $V = K K^T$
 remains a
$V = K K^T$
 remains a 
 $O(n^3)$
 operation. Fortunately, the generalized Fellner–Schall algorithm only requires the diagonal of V, which can be obtained from K in
$O(n^3)$
 operation. Fortunately, the generalized Fellner–Schall algorithm only requires the diagonal of V, which can be obtained from K in 
 $O(n^2)$
, since:
$O(n^2)$
, since:
 \begin{align*}\lambda_j[\text{tr}(P_\lambda^{-}P_j) - \text{tr}(VP_j)] = (n - q) - (n - \text{tr}[VW]) = \textbf{diag}(V)^T\mathbf{w} - q.\end{align*}
\begin{align*}\lambda_j[\text{tr}(P_\lambda^{-}P_j) - \text{tr}(VP_j)] = (n - q) - (n - \text{tr}[VW]) = \textbf{diag}(V)^T\mathbf{w} - q.\end{align*} 
 Furthermore, the Newton algorithm benefits as well: the trace terms involving 
 $V P_\lambda$
 or
$V P_\lambda$
 or 
 $V (\partial W / \partial \ln\lambda + P_\lambda)$
 are based on banded matrices, making those products computable in
$V (\partial W / \partial \ln\lambda + P_\lambda)$
 are based on banded matrices, making those products computable in 
 $O(qn^2)$
 instead of
$O(qn^2)$
 instead of 
 $O(n^3)$
.
$O(n^3)$
.
Two-dimensional case
 In two dimensions, the penalization matrix is 
 $P_\lambda = \lambda_xP_x + \lambda_zP_z$
 where:
$P_\lambda = \lambda_xP_x + \lambda_zP_z$
 where:
- 
•  $P_x = I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x}$ $P_x = I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x}$
- 
•  $P_z = D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}$
. $P_z = D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}$
.
This structure has the following key properties:
- 
• Both matrices are made of  $n_z \times n_z$
 square blocks of dimensions $n_z \times n_z$
 square blocks of dimensions $n_x \times n_x$
 each. $n_x \times n_x$
 each.
- 
•  $P_x$
 is block diagonal with $P_x$
 is block diagonal with $n_z$
 identical $n_z$
 identical $n_x \times n_x$
 banded blocks (bandwidth $n_x \times n_x$
 banded blocks (bandwidth $q_x$
). $q_x$
).
- 
•  $P_z$
 is block banded with bandwidth $P_z$
 is block banded with bandwidth $q_z$
. Each block is a scaled identity matrix. $q_z$
. Each block is a scaled identity matrix.
- 
• As a whole,  $P_z$
 and $P_z$
 and $P_\lambda$
 may be viewed as banded matrices with bandwidth $P_\lambda$
 may be viewed as banded matrices with bandwidth $q = q_z \times n_x$
. $q = q_z \times n_x$
.
 This implies that all statements made in the one-dimensional case carry over to the two-dimensional case with this value of q. It also suggests that, if 
 $(q_x + 1) / n_x \lt (q_z + 1) / n_z$
, dimensions x and z should be permuted before applying WH smoothing for maximal efficiency.
$(q_x + 1) / n_x \lt (q_z + 1) / n_z$
, dimensions x and z should be permuted before applying WH smoothing for maximal efficiency.
 As in the one-dimensional case, the generalized Fellner–Schall update formula does not require the full computation of V. Indeed, to compute 
 $\text{tr}[VP_x]$
 and
$\text{tr}[VP_x]$
 and 
 $\text{tr}[VP_z]$
, we only need access to elements of V for which either
$\text{tr}[VP_z]$
, we only need access to elements of V for which either 
 $P_x$
 or
$P_x$
 or 
 $P_z$
 is non-zero. From what precedes,
$P_z$
 is non-zero. From what precedes, 
 $P_x$
 has bandwidth
$P_x$
 has bandwidth 
 $q_x$
 while
$q_x$
 while 
 $P_z$
 only contains
$P_z$
 only contains 
 $q_z$
 non-zero diagonals on each side of the leading diagonal. As V is symmetric, we only need to compute
$q_z$
 non-zero diagonals on each side of the leading diagonal. As V is symmetric, we only need to compute 
 $q_x + q_z + 1$
 diagonals of V for an associated cost of
$q_x + q_z + 1$
 diagonals of V for an associated cost of 
 $O([q_x + q_z]n^2)$
 instead of
$O([q_x + q_z]n^2)$
 instead of 
 $O(n^3)$
.
$O(n^3)$
.
 With the Newton method, while computing 
 $V = K K^T$
 still incurs a
$V = K K^T$
 still incurs a 
 $O(n^3)$
 cost, matrix multiplications like
$O(n^3)$
 cost, matrix multiplications like 
 $V P_j$
 or
$V P_j$
 or 
 $V (\partial W / \partial \rho_j + P_j)$
 can be performed block-wise. It may easily be checked, for example, that the products
$V (\partial W / \partial \rho_j + P_j)$
 can be performed block-wise. It may easily be checked, for example, that the products 
 $VP_x$
 and
$VP_x$
 and 
 $V(\partial W/\partial \rho_x + P_x)$
 have a cost of
$V(\partial W/\partial \rho_x + P_x)$
 have a cost of 
 $O(q_xn^2)$
, while the products
$O(q_xn^2)$
, while the products 
 $VP_z$
 and
$VP_z$
 and 
 $V(\partial W/\partial \rho_z + P_z)$
 have a cost of
$V(\partial W/\partial \rho_z + P_z)$
 have a cost of 
 $O(q_zn^2)$
.
$O(q_zn^2)$
.
Summary of complexity gains
 Thanks to the banded structure, most computations involved in WH smoothing can be accelerated by a factor of 
 $n / (q + 1)$
 in the 1D case and
$n / (q + 1)$
 in the 1D case and 
 $\max(n_x / (q_z + 1), n_z / (q_x + 1))$
 in the 2D case. There are 3 notable exceptions:
$\max(n_x / (q_z + 1), n_z / (q_x + 1))$
 in the 2D case. There are 3 notable exceptions:
- 
• Cholesky decomposition is improved from  $O(n^3)$
 to $O(n^3)$
 to $O(q^2 n)$
 – a quadratic speed-up. $O(q^2 n)$
 – a quadratic speed-up.
- 
• Computation of  $V = K K^T$
 remains $V = K K^T$
 remains $O(n^3)$
. $O(n^3)$
.
- 
• Some matrix products required by Newton method get a full  $n / (q_z + 1)$
 or $n / (q_z + 1)$
 or $n / (q_z + 1)$
 speed-up in the 2D case. $n / (q_z + 1)$
 speed-up in the 2D case.
Table 3 summarizes theoretical complexities across different frameworks, including a typical generalized additive model framework for which the penalization matrix is diagonal. This last framework is used by the rank-reduced WH smoothing approach introduced next, as well as the P-spline alternative used for comparison.
Table 3. Compared theoretical leading-order costs associated with the key steps in smoothing computations for several frameworks. All cells should be read as O(…).

Empirical gains
Figure 5 compares actual computation times of WH smoothing (two-dimensional, outer iteration), showing that adapting the implementation to exploit banded structures results in large speed gains:
- 
• The Nelder–Mead method benefits the most, with a 25  $\times$
 speedup compared to dense computation. $\times$
 speedup compared to dense computation.
- 
• Newton and Fellner–Schall methods see 6.6  $\times$
 and 10 $\times$
 and 10 $\times$
 improvements, respectively, making them fall behind the Nelder–Mead method. $\times$
 improvements, respectively, making them fall behind the Nelder–Mead method.

Figure 5. Computation time comparison for 2D generalized WH smoothing with outer iteration. The speed-up factor is computed relative to the original dense method using the Nelder–Mead algorithm.
 As a final advantage, Brent and Nelder–Mead heuristic methods rely solely on banded matrices that can be stored as compact matrices of dimensions 
 $(q + 1)\times n$
, adding further efficiency.
$(q + 1)\times n$
, adding further efficiency.
6.4. Natural parameterization and rank reduction of WH smoothing
Demmler and Reinsch (Reference Demmler and Reinsch1975) proposed a natural parameterization for penalized smoothers using the eigendecomposition of the penalization matrix. This provides both an intuitive interpretation of the smoothing mechanism and a foundation for dimension reduction via rank-restricted estimation.
One-dimensional case
 In one dimension, let 
 $D_{n,q}^T D_{n,q} = U \Sigma U^T$
 be the eigendecomposition of the penalty matrix, where U is orthogonal and
$D_{n,q}^T D_{n,q} = U \Sigma U^T$
 be the eigendecomposition of the penalty matrix, where U is orthogonal and 
 $\Sigma$
 diagonal with non-negative eigenvalues. A change of variable
$\Sigma$
 diagonal with non-negative eigenvalues. A change of variable 
 $\boldsymbol{\theta} = U\boldsymbol{\beta}$
 transforms the WH optimization into:
$\boldsymbol{\theta} = U\boldsymbol{\beta}$
 transforms the WH optimization into:
 \begin{align*}\hat{\mathbf{y}} = U\hat{\boldsymbol{\beta}} \quad\text{where}\quad \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\text{argmin}}\left\lbrace (\mathbf{y} - U\boldsymbol{\beta})^{T}W(\mathbf{y} - U\boldsymbol{\beta}) + \lambda\boldsymbol{\beta}^{T}\Sigma\boldsymbol{\beta}\right\rbrace\end{align*}
\begin{align*}\hat{\mathbf{y}} = U\hat{\boldsymbol{\beta}} \quad\text{where}\quad \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\text{argmin}}\left\lbrace (\mathbf{y} - U\boldsymbol{\beta})^{T}W(\mathbf{y} - U\boldsymbol{\beta}) + \lambda\boldsymbol{\beta}^{T}\Sigma\boldsymbol{\beta}\right\rbrace\end{align*} 
yielding the solution:
 \begin{align*}\hat{\mathbf{y}} = U(U^TWU + S_{\lambda})^{-1}U^TW\mathbf{y} \quad \text{where} \quad S_{\lambda} = \lambda\Sigma.\end{align*}
\begin{align*}\hat{\mathbf{y}} = U(U^TWU + S_{\lambda})^{-1}U^TW\mathbf{y} \quad \text{where} \quad S_{\lambda} = \lambda\Sigma.\end{align*} 
This formulation shows that WH smoothing decomposes the signal into eigenvector components and attenuates each according to the associated eigenvalue – the higher the eigenvalue, the stronger the shrinkage.
We refer to Section E of the Online Supplementary Materials for graphical illustrations of:
- 
• the basis eigenvectors of  $D_{n,q}^T D_{n,q}$
; $D_{n,q}^T D_{n,q}$
;
- 
• the evolution of their effective degrees of freedom under smoothing. 
These figures show that only the first few components retain substantial degrees of freedom under moderate smoothing, motivating dimensionality reduction.
Two-dimensional case
In two dimensions, the penalization matrix takes the form:
 \begin{align*}P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x},\end{align*}
\begin{align*}P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x},\end{align*} 
with eigendecompositions 
 $D_{n_x,q_x}^T D_{n_x,q_x} = U_x \Sigma_x U_x^T$
 and
$D_{n_x,q_x}^T D_{n_x,q_x} = U_x \Sigma_x U_x^T$
 and 
 $D_{n_z,q_z}^T D_{n_z,q_z} = U_z \Sigma_z U_z^T$
.
$D_{n_z,q_z}^T D_{n_z,q_z} = U_z \Sigma_z U_z^T$
.
 Define 
 $U = U_z \otimes U_x$
, and
$U = U_z \otimes U_x$
, and 
 $\boldsymbol{\theta} = U \boldsymbol{\beta}$
. Then the WH estimate becomes:
$\boldsymbol{\theta} = U \boldsymbol{\beta}$
. Then the WH estimate becomes:
 \begin{align*}\hat{\mathbf{y}} = U(U^TWU + S_{\lambda})^{- 1}U^TW\mathbf{y} \quad \text{where} \quad S_{\lambda} = \lambda_x I_{n_z} \otimes \Sigma_x + \lambda_z \Sigma_z \otimes I_{n_x}.\end{align*}
\begin{align*}\hat{\mathbf{y}} = U(U^TWU + S_{\lambda})^{- 1}U^TW\mathbf{y} \quad \text{where} \quad S_{\lambda} = \lambda_x I_{n_z} \otimes \Sigma_x + \lambda_z \Sigma_z \otimes I_{n_x}.\end{align*} 
As in the one-dimensional case, this representation reveals how smoothing operates via coordinate-wise shrinkage in the eigenbasis. Section E of the Online Supplementary Materials displays the corresponding per-parameter effective degrees of freedom.
Rank reduction strategy
 Inspection of the effective degrees of freedom reveals that many components are heavily shrunk, especially those associated with high eigenvalues. This suggests reducing the dimension by keeping only the 
 $p \lt n$
 components with the lowest eigenvalues.
$p \lt n$
 components with the lowest eigenvalues.
In the one-dimensional case, the reduced-rank approximation is
 \begin{align*}\hat{\mathbf{y}}_p = U_p (U_p^TWU_p + \lambda\Sigma_p)^{- 1}U_p^TW\mathbf{y}\end{align*}
\begin{align*}\hat{\mathbf{y}}_p = U_p (U_p^TWU_p + \lambda\Sigma_p)^{- 1}U_p^TW\mathbf{y}\end{align*} 
where 
 $U_p$
 and
$U_p$
 and 
 $\Sigma_p$
 consist of the first p eigenvectors and their corresponding eigenvalues, respectively.
$\Sigma_p$
 consist of the first p eigenvectors and their corresponding eigenvalues, respectively.
 In the two-dimensional case, we retain 
 $p_x$
 and
$p_x$
 and 
 $p_z$
 eigenvectors in each dimension and use:
$p_z$
 eigenvectors in each dimension and use:
 \begin{align*}\hat{\mathbf{y}}_{p_x,p_z} = U_{p_x,p_z} (U_{p_x,p_z}^TWU_{p_x,p_z} + \lambda_x I_{p_z} \otimes \Sigma_{x,p_x} + \lambda_z \Sigma_{z,p_z} \otimes I_{p_x})^{- 1}U_{p_x,p_z}^TW\mathbf{y}\end{align*}
\begin{align*}\hat{\mathbf{y}}_{p_x,p_z} = U_{p_x,p_z} (U_{p_x,p_z}^TWU_{p_x,p_z} + \lambda_x I_{p_z} \otimes \Sigma_{x,p_x} + \lambda_z \Sigma_{z,p_z} \otimes I_{p_x})^{- 1}U_{p_x,p_z}^TW\mathbf{y}\end{align*} 
with 
 $U_{p_x,p_z} = U_{z,p_z} \otimes U_{x,p_x}$
. In that case, given a target number of parameters
$U_{p_x,p_z} = U_{z,p_z} \otimes U_{x,p_x}$
. In that case, given a target number of parameters 
 $p_{\text{max}}$
, we propose selecting
$p_{\text{max}}$
, we propose selecting 
 $(p_x, p_z)$
 such that
$(p_x, p_z)$
 such that 
 $p_x p_z \le p_{\text{max}}$
 and
$p_x p_z \le p_{\text{max}}$
 and 
 $p_x / n_x \approx p_z / n_z$
 using the rule:
$p_x / n_x \approx p_z / n_z$
 using the rule:
 \begin{align*}\kappa = \sqrt{p_{\text{max}} /n_x n_z},\quad p_x = \lfloor \text{min}(\kappa,1)n_x\rfloor,\quad p_z = \lfloor \text{min}(\kappa,1)n_z\rfloor.\end{align*}
\begin{align*}\kappa = \sqrt{p_{\text{max}} /n_x n_z},\quad p_x = \lfloor \text{min}(\kappa,1)n_x\rfloor,\quad p_z = \lfloor \text{min}(\kappa,1)n_z\rfloor.\end{align*} 
Adaptations for generalized WH smoothing follow by replacing 
 $(\mathbf{y}, W)$
 with
$(\mathbf{y}, W)$
 with 
 $(\mathbf{z}_k, W_k)$
 in the above expressions.
$(\mathbf{z}_k, W_k)$
 in the above expressions.
Efficient computation via GLAM
 Currie et al. (Reference Currie, Durban and Eilers2006) propose a general framework, GLAM, that exploits Kronecker structure for efficient computations. In our context, the model matrix 
 $U_{p_x,p_z}$
 inherits a Kronecker product form, allowing operations that rely on this matrix to be executed dimension-wise without explicit construction of the full matrix. This significantly reduces memory use and computation time in the two-dimensional rank-reduced WH framework.
$U_{p_x,p_z}$
 inherits a Kronecker product form, allowing operations that rely on this matrix to be executed dimension-wise without explicit construction of the full matrix. This significantly reduces memory use and computation time in the two-dimensional rank-reduced WH framework.
Impact of using the rank-reduced basis
We now evaluate the impact of the rank-reduced WH basis introduced in Section 6.4 in terms of both smoothing accuracy and computational speed. For context, results are compared against those obtained using P-spline smoothing with the same number of basis functions.
 To ensure a fair comparison, both approaches were implemented in the same computational framework, including the use of GLAM in the two-dimensional case – only the structure of the basis (and hence the model matrix) differs. The penalty structure and the unpenalized fixed effects (polynomials of degree 
 $q - 1$
) are identical.
$q - 1$
) are identical.
 In addition to the full basis of size 450 (
 $30 \times 15$
), three reduced basis of respective size 288 (
$30 \times 15$
), three reduced basis of respective size 288 (
 $24 \times 12$
), 128 (
$24 \times 12$
), 128 (
 $16 \times 8$
) and 32 (
$16 \times 8$
) and 32 (
 $8 \times 4$
) were considered.
$8 \times 4$
) were considered.
As in Section 5.6, accuracy is assessed using the relative LAML error defined in Equation (5.2). Note, however, that since both reduced-rank and P-spline smoothers rely on different bases and penalization matrices, their LAML expressions are different from the one used for full-rank WH smoothing. Hence, a reduced model can exhibit a higher LAML than the full-rank version at its selected smoothing parameter.
Figure 6 summarises the average speed-up achieved by both the reduced-rank and P-spline smoothers compared to the full-rank WH smoothing. As the number of retained parameters decreases, computation time drops substantially. Compared to the full-rank WH smoothing (unoptimized):
- 
• the 128-parameter basis achieves an 88  $\times$
 speed-up; $\times$
 speed-up;
- 
• the 32-parameter basis achieves up to 256  $\times$
 faster computation. $\times$
 faster computation.

Figure 6. Computation speed improvement from WH smoothing with a reduced-rank basis (solid lines) or P-spline basis (dotted lines), relative to unoptimized full-rank WH smoothing, as a function of basis size.
The alternate iteration and performance iteration strategies outperform the outer iteration in the reduced setting, primarily because model matrix construction becomes the new computational bottleneck – even with the use of the GLAM framework. In this context, the Newton algorithm combined with alternate iteration proves to be the most efficient, with the generalized Fellner–Schall update being nearly as competitive for smaller bases.
The gains in computational speed come with a moderate tradeoff in estimation accuracy. As shown in Figure 7, the relative LAML errors remain small:
- 
• For the 128-parameter basis, the average error is just 0.82%. 
- 
• For the 32-parameter basis, it rises to 2.26%. 

Figure 7. Relative LAML error of WH smoothing with a reduced-rank basis (solid lines) or a P-spline basis (dotted lines), with respect to unoptimized full-rank WH smoothing, as a function of basis size.
Across all sizes, the reduced-rank WH smoother slightly outperforms the P-spline smoother in terms of LAML error, confirming its effectiveness as a principled dimension reduction strategy.
7 How to extrapolate the smoothing?
Semi-parametric methods such as P-splines and WH smoothing naturally allow for extrapolation – that is, predicting values outside the range of the original data. Extrapolation is handled by solving an extended smoothing problem where extrapolated positions are associated with zero-weight observations.
However, in the two-dimensional case, extrapolation must be performed carefully: constraints are needed to ensure that the extrapolated solution remains consistent with the original smoothing result over the observed data. Following the approach introduced by Carballo et al. (Reference Carballo, Durban and Lee2021) for P-splines, we now extend WH smoothing to support extrapolation while also enabling the construction of credibility intervals that capture uncertainty both inside and outside the original observation domain.
7.1. Defining the extrapolation of the smoothing
 Let 
 $\hat{\mathbf{y}}$
 be the WH smoothing result obtained from an observation vector
$\hat{\mathbf{y}}$
 be the WH smoothing result obtained from an observation vector 
 $\mathbf{y}$
 defined over positions
$\mathbf{y}$
 defined over positions 
 $\mathbf{x}$
 (in 1D) or
$\mathbf{x}$
 (in 1D) or 
 $(\mathbf{x}, \mathbf{z})$
 (in 2D). We wish to extend predictions to a larger domain
$(\mathbf{x}, \mathbf{z})$
 (in 2D). We wish to extend predictions to a larger domain 
 $\mathbf{x}+$
 (or
$\mathbf{x}+$
 (or 
 $(\mathbf{x}+, \mathbf{z}+)$
), with
$(\mathbf{x}+, \mathbf{z}+)$
), with 
 $\mathbf{x} \subset \mathbf{x}+$
 and similarly for
$\mathbf{x} \subset \mathbf{x}+$
 and similarly for 
 $\mathbf{z}$
.
$\mathbf{z}$
.
 To preserve WH smoothing’s requirement for evenly spaced points, we assume that 
 $\mathbf{x}+$
 and
$\mathbf{x}+$
 and 
 $\mathbf{z}+$
 are sequences of consecutive integers. Let
$\mathbf{z}+$
 are sequences of consecutive integers. Let 
 $n_+$
 be the length of
$n_+$
 be the length of 
 $\mathbf{x}_+$
 in the on-dimensional case. In the two-dimensional case, let
$\mathbf{x}_+$
 in the on-dimensional case. In the two-dimensional case, let 
 $n_{x+}$
 and
$n_{x+}$
 and 
 $n_{z+}$
 be the lengths of
$n_{z+}$
 be the lengths of 
 $\mathbf{x}+$
 and
$\mathbf{x}+$
 and 
 $\mathbf{z}+$
 and note
$\mathbf{z}+$
 and note 
 $n_+ = n_{x+} \times n_{z+}$
.
$n_+ = n_{x+} \times n_{z+}$
.
 We define matrices 
 $C_x$
 and
$C_x$
 and 
 $C_z$
 such that each extracts the indices of the original data from the larger domain. Specifically:
$C_z$
 such that each extracts the indices of the original data from the larger domain. Specifically: 
 $C_j = (O \mid I_{n_j} \mid O)$
, where
$C_j = (O \mid I_{n_j} \mid O)$
, where 
 $j \in \{x, z\}$
 and
$j \in \{x, z\}$
 and 
 $I_{n_j}$
 is an identity matrix aligned with the observed positions. Define the matrix C as
$I_{n_j}$
 is an identity matrix aligned with the observed positions. Define the matrix C as
 \begin{align*}C = \begin{cases}C_x & \text{in the one-dimensional case}, \\ C_z \otimes C_x & \text{in the two-dimensional case.}\end{cases}\end{align*}
\begin{align*}C = \begin{cases}C_x & \text{in the one-dimensional case}, \\ C_z \otimes C_x & \text{in the two-dimensional case.}\end{cases}\end{align*} 
Then C has the following useful properties:
- 
• For any full-domain vector  $\mathbf{y}+$
, $\mathbf{y}+$
, $C\mathbf{y}+$
 returns the observed values only. $C\mathbf{y}+$
 returns the observed values only.
- 
•  $C^T \mathbf{y}$
 embeds the observed values into a larger zero-padded vector. $C^T \mathbf{y}$
 embeds the observed values into a larger zero-padded vector.
- 
•  $CC^T = I_n$
 and $CC^T = I_n$
 and $C^TC$
 is a $C^TC$
 is a $2 \times 2$
 block matrix with an identity matrix block and zeros everywhere else. $2 \times 2$
 block matrix with an identity matrix block and zeros everywhere else.
The extrapolated WH smoothing is defined as the solution to the following extended problem:
 \begin{equation}{\hat{\mathbf{y}}_+ = \underset{\boldsymbol{\theta}_+}{\text{argmin}}\left\lbrace (\mathbf{y}_+ - \boldsymbol{\theta}_+)^TW_+(\mathbf{y}_+ - \boldsymbol{\theta}_+) + \boldsymbol{\theta}_+^TP_+\boldsymbol{\theta}_+\right\rbrace}\end{equation}
\begin{equation}{\hat{\mathbf{y}}_+ = \underset{\boldsymbol{\theta}_+}{\text{argmin}}\left\lbrace (\mathbf{y}_+ - \boldsymbol{\theta}_+)^TW_+(\mathbf{y}_+ - \boldsymbol{\theta}_+) + \boldsymbol{\theta}_+^TP_+\boldsymbol{\theta}_+\right\rbrace}\end{equation}
where:
- 
•  $\mathbf{y}_+ = C^{T}\mathbf{y}$
 is the extended data vector (zeros for unobserved points), $\mathbf{y}_+ = C^{T}\mathbf{y}$
 is the extended data vector (zeros for unobserved points),
- 
•  $W_+ = C^{T}WC$
 is the extended weight matrix (zeros for unobserved points), $W_+ = C^{T}WC$
 is the extended weight matrix (zeros for unobserved points),
- 
•  $P_+$
 is the penalization matrix over the extended grid, defined as $P_+$
 is the penalization matrix over the extended grid, defined as \begin{align*}P_+ = \begin{cases}\lambda D_{n_{+},q}^{T}D_{n_{+},q} & \text{in the one-dimensional case,} \\ \lambda_x I_{z+} \otimes D_{n_{x+},q_x}^{T}D_{n_{x+},q_x} + \lambda_z D_{n_{z+},q_z}^{T}D_{n_{z+},q_z} \otimes I_{x+} & \text{in the two-dimensional case.}\end{cases}\end{align*} \begin{align*}P_+ = \begin{cases}\lambda D_{n_{+},q}^{T}D_{n_{+},q} & \text{in the one-dimensional case,} \\ \lambda_x I_{z+} \otimes D_{n_{x+},q_x}^{T}D_{n_{x+},q_x} + \lambda_z D_{n_{z+},q_z}^{T}D_{n_{z+},q_z} \otimes I_{x+} & \text{in the two-dimensional case.}\end{cases}\end{align*}
 Importantly, the smoothing parameters 
 $\lambda$
,
$\lambda$
, 
 $\lambda_x$
, and
$\lambda_x$
, and 
 $\lambda_z$
 must remain fixed during extrapolation – they are inherited from the original fit and no new information is introduced.
$\lambda_z$
 must remain fixed during extrapolation – they are inherited from the original fit and no new information is introduced.
The fidelity term in Equation (7.1) simplifies to:
 \begin{align*}(\mathbf{y}_+ - \boldsymbol{\theta}_+)^TW_+(\mathbf{y}_+ - \boldsymbol{\theta}_+) = (C^{T}\mathbf{y} - \boldsymbol{\theta}_+)^TC^{T}WC(C^{T}\mathbf{y} - \boldsymbol{\theta}_+) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta})\end{align*}
\begin{align*}(\mathbf{y}_+ - \boldsymbol{\theta}_+)^TW_+(\mathbf{y}_+ - \boldsymbol{\theta}_+) = (C^{T}\mathbf{y} - \boldsymbol{\theta}_+)^TC^{T}WC(C^{T}\mathbf{y} - \boldsymbol{\theta}_+) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta})\end{align*} 
where 
 $\boldsymbol{\theta} = C \boldsymbol{\theta}_+$
. This is the fidelity term from the original fit.
$\boldsymbol{\theta} = C \boldsymbol{\theta}_+$
. This is the fidelity term from the original fit.
 The smoothness criterion, on the other hand, now applies to the entire extended domain, constraining the extrapolated parts of 
 $\hat{\mathbf{y}}_+$
 to remain smooth and consistent with the trend learned from the data.
$\hat{\mathbf{y}}_+$
 to remain smooth and consistent with the trend learned from the data.
 The same extrapolation approach applies directly to generalized WH smoothing, simply by replacing 
 $\mathbf{y}$
 by
$\mathbf{y}$
 by 
 $\mathbf{z}_k$
 and W by
$\mathbf{z}_k$
 and W by 
 $W_k$
, obtained at convergence of the PIRLS algorithm and setting
$W_k$
, obtained at convergence of the PIRLS algorithm and setting 
 $\sigma^2 = 1$
 in the derived credible intervals.
$\sigma^2 = 1$
 in the derived credible intervals.
7.2. Unconstrained solution for the 1D case
 The solution to the extrapolation problem in Equation 7.1 can be obtained directly, as in Section 1.1, by taking derivatives with respect to 
 $\boldsymbol{\theta}_+$
 and setting them to zero. This yields the closed-form solution:
$\boldsymbol{\theta}_+$
 and setting them to zero. This yields the closed-form solution:
 \begin{align*}\hat{\mathbf{y}}_+ = (W_+ + P_+)^{- 1}W_+\mathbf{y}_+ \quad\text{where}\quad \mathbf{y}_+ = C^{T}\mathbf{y} \quad\text{and}\quad W_+ = C^{T}WC.\end{align*}
\begin{align*}\hat{\mathbf{y}}_+ = (W_+ + P_+)^{- 1}W_+\mathbf{y}_+ \quad\text{where}\quad \mathbf{y}_+ = C^{T}\mathbf{y} \quad\text{and}\quad W_+ = C^{T}WC.\end{align*} 
 Assuming a Bayesian model where 
 $\mathbf{y}+ \mid \boldsymbol{\theta}+ \sim \mathcal{N}(\boldsymbol{\theta}+, \sigma^2 W+^{-1})$
 and
$\mathbf{y}+ \mid \boldsymbol{\theta}+ \sim \mathcal{N}(\boldsymbol{\theta}+, \sigma^2 W+^{-1})$
 and 
 $\boldsymbol{\theta}+ \sim \mathcal{N}(0, \sigma^2 P+^{-1})$
, we obtain, as in Section 2, the following credible interval:
$\boldsymbol{\theta}+ \sim \mathcal{N}(0, \sigma^2 P+^{-1})$
, we obtain, as in Section 2, the following credible interval:
 \begin{align*}\mathbb{E}(\mathbf{y}_+) \mid \mathbf{y}_+ \in \left[(W_+ + P_+)^{- 1}W_+\mathbf{y}_+ \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\sigma^2\textbf{diag}\left\lbrace(W_+ + P_+)^{-1}\right\rbrace}\right].\end{align*}
\begin{align*}\mathbb{E}(\mathbf{y}_+) \mid \mathbf{y}_+ \in \left[(W_+ + P_+)^{- 1}W_+\mathbf{y}_+ \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\sigma^2\textbf{diag}\left\lbrace(W_+ + P_+)^{-1}\right\rbrace}\right].\end{align*} 
 To get a better understanding about how the variance-covariance matrix 
 $V_+ = (W_+ + P_+)^{- 1}$
 for the unconstrained extrapolation problem of Equation (7.1) is related to the variance-covariance matrix
$V_+ = (W_+ + P_+)^{- 1}$
 for the unconstrained extrapolation problem of Equation (7.1) is related to the variance-covariance matrix 
 $V = (W + P_\lambda)^{- 1}$
 of the original smoothing problem, introduce matrices
$V = (W + P_\lambda)^{- 1}$
 of the original smoothing problem, introduce matrices 
 $\overline{C}_j$
 (for
$\overline{C}_j$
 (for 
 $j \in {x, z}$
), which selects the rows in the extrapolated domain that are not part of the original data and define:
$j \in {x, z}$
), which selects the rows in the extrapolated domain that are not part of the original data and define:
 \begin{align*}\quad \overline{C} = \begin{cases}\overline{C}_x & \text{in the one-dimensional case}, \\ \overline{C}_z \otimes \overline{C}_x & \text{in the two-dimensional case},\end{cases} \quad\text{and}\quad Q = \begin{bmatrix} C \\ \overline{C} \end{bmatrix}.\end{align*}
\begin{align*}\quad \overline{C} = \begin{cases}\overline{C}_x & \text{in the one-dimensional case}, \\ \overline{C}_z \otimes \overline{C}_x & \text{in the two-dimensional case},\end{cases} \quad\text{and}\quad Q = \begin{bmatrix} C \\ \overline{C} \end{bmatrix}.\end{align*} 
With this definition, Q is a permutation matrix moving observed positions to the top.
 In the unidimensional case, the extended difference matrix 
 $D_{n_+,q}$
 takes the block-wise form:
$D_{n_+,q}$
 takes the block-wise form:
 \begin{align*}D_{n_+,q} = \begin{bmatrix}D_{2-} &\quad D_{1-} &\quad 0 \\ 0 &\quad D_{n,q} &\quad 0 \\ 0 &\quad D_{1+} &\quad D_{2+}\\ \end{bmatrix} = Q^T\begin{bmatrix}D_{n,q} &\quad 0 \\ D_1 &\quad D_2 \\ \end{bmatrix}Q \quad\text{where} \; D_1 = \begin{bmatrix}D_{1-} \\ D_{1+} \\ \end{bmatrix} \;\text{and}\; D_2 = \begin{bmatrix}D_{2-} &\quad 0 \\ 0 &\quad D_{2+} \\ \end{bmatrix}.\end{align*}
\begin{align*}D_{n_+,q} = \begin{bmatrix}D_{2-} &\quad D_{1-} &\quad 0 \\ 0 &\quad D_{n,q} &\quad 0 \\ 0 &\quad D_{1+} &\quad D_{2+}\\ \end{bmatrix} = Q^T\begin{bmatrix}D_{n,q} &\quad 0 \\ D_1 &\quad D_2 \\ \end{bmatrix}Q \quad\text{where} \; D_1 = \begin{bmatrix}D_{1-} \\ D_{1+} \\ \end{bmatrix} \;\text{and}\; D_2 = \begin{bmatrix}D_{2-} &\quad 0 \\ 0 &\quad D_{2+} \\ \end{bmatrix}.\end{align*} 
The extended weight and penalization matrices may be rewritten:
 \begin{align*}W_+ = Q^T\begin{bmatrix}W &\quad 0 \\ 0 &\quad 0 \\ \end{bmatrix}Q \quad\text{and}\quad P_+ = D_{n_+,q}^TD_{n_+,q} = \lambda Q^T\begin{bmatrix}P_\lambda + P_+^{11} &\quad P_+^{12} \\ P_+^{21} &\quad P_+^{22} \\ \end{bmatrix}Q\end{align*}
\begin{align*}W_+ = Q^T\begin{bmatrix}W &\quad 0 \\ 0 &\quad 0 \\ \end{bmatrix}Q \quad\text{and}\quad P_+ = D_{n_+,q}^TD_{n_+,q} = \lambda Q^T\begin{bmatrix}P_\lambda + P_+^{11} &\quad P_+^{12} \\ P_+^{21} &\quad P_+^{22} \\ \end{bmatrix}Q\end{align*} 
where 
 $P_+^{ij} = \lambda D_i^TD_j$
, for
$P_+^{ij} = \lambda D_i^TD_j$
, for 
 $i,j\in\{1,2\}$
.
$i,j\in\{1,2\}$
.
This block structure allows us to apply standard results for partitioned matrix inverses to derive:
 \begin{align*}V_+ = Q^T\begin{bmatrix}V_+^{11} &\quad V_+^{12}\\ V_+^{21} &\quad V_+^{22} \\ \end{bmatrix}Q = Q^T\begin{bmatrix}V_+^{11} &\quad - V_+^{11}P_+^{12}(P_+^{22})^{- 1} \\ - (P_+^{22})^{- 1}P_+^{21}V_+^{11} &\quad (P_+^{22})^{- 1}P_+^{21}V_+^{11}P_+^{12}(P_+^{22})^{- 1} + (P_+^{22})^{- 1}\end{bmatrix}Q\end{align*}
\begin{align*}V_+ = Q^T\begin{bmatrix}V_+^{11} &\quad V_+^{12}\\ V_+^{21} &\quad V_+^{22} \\ \end{bmatrix}Q = Q^T\begin{bmatrix}V_+^{11} &\quad - V_+^{11}P_+^{12}(P_+^{22})^{- 1} \\ - (P_+^{22})^{- 1}P_+^{21}V_+^{11} &\quad (P_+^{22})^{- 1}P_+^{21}V_+^{11}P_+^{12}(P_+^{22})^{- 1} + (P_+^{22})^{- 1}\end{bmatrix}Q\end{align*} 
with 
 $V_+^{11} = [W + P_\lambda + P_+^{11} - P_+^{12}(P_+^{22})^{- 1}P_+^{21}]^{- 1}$
.
$V_+^{11} = [W + P_\lambda + P_+^{11} - P_+^{12}(P_+^{22})^{- 1}P_+^{21}]^{- 1}$
.
From the above, we retrieve:
 \begin{align*}C\hat{\mathbf{y}}_+ = CV_+W_+y_+ = CQ^TV_+QC^TWy = V_+^{11}Wy.\end{align*}
\begin{align*}C\hat{\mathbf{y}}_+ = CV_+W_+y_+ = CQ^TV_+QC^TWy = V_+^{11}Wy.\end{align*} 
 This coincides with the original fit 
 $\hat{\mathbf{y}}$
 only if
$\hat{\mathbf{y}}$
 only if 
 $V_+^{11} = V$
. In general, this equality does not hold, since the extrapolation solution minimizes the total smoothness of the extended vector, not just of the observed part.
$V_+^{11} = V$
. In general, this equality does not hold, since the extrapolation solution minimizes the total smoothness of the extended vector, not just of the observed part.
 In 
 $V_+^{22}$
, we identify:
$V_+^{22}$
, we identify:
- 
• a propagation term:  $(P_+^{22})^{-1} P_+^{21} V_+^{11} P_+^{12} (P_+^{22})^{-1}$
, capturing the uncertainty transferred from the known part to the extrapolated part; $(P_+^{22})^{-1} P_+^{21} V_+^{11} P_+^{12} (P_+^{22})^{-1}$
, capturing the uncertainty transferred from the known part to the extrapolated part;
- 
• an innovation error term:  $(P_+^{22})^{- 1}$
 associated with the prior on the extrapolated coefficients $(P_+^{22})^{- 1}$
 associated with the prior on the extrapolated coefficients $\overline{C} \hat{\mathbf{y}}_+$
. $\overline{C} \hat{\mathbf{y}}_+$
.
 In the one-dimensional case, 
 $D_2$
 is block-diagonal with invertible triangular blocks, so:
$D_2$
 is block-diagonal with invertible triangular blocks, so:
 \begin{align*}P_+^{11} - P_+^{12}(P_+^{22})^{- 1}P_+^{21} = D_1^TD_1 - D_1^TD_2(D_2^TD_2)^{- 1}D2^TD_1 = 0\end{align*}
\begin{align*}P_+^{11} - P_+^{12}(P_+^{22})^{- 1}P_+^{21} = D_1^TD_1 - D_1^TD_2(D_2^TD_2)^{- 1}D2^TD_1 = 0\end{align*} 
which means that 
 $V_+^{11} = (W + P_\lambda)^{- 1} = V$
. This confirms the result from Carballo et al. (Reference Carballo, Durban and Lee2021), namely that with a difference-based penalty, a perfectly smooth extrapolation that leaves the original fit unchanged can always be constructed in the one-dimensional case.
$V_+^{11} = (W + P_\lambda)^{- 1} = V$
. This confirms the result from Carballo et al. (Reference Carballo, Durban and Lee2021), namely that with a difference-based penalty, a perfectly smooth extrapolation that leaves the original fit unchanged can always be constructed in the one-dimensional case.
 This behaviour is illustrated in Figure 8, which shows the extrapolated fit (with 
 $q = 2$
) obtained from generalized WH smoothing applied to the annuity portfolio used previously. The extrapolation follows a straight line – the polynomial of degree
$q = 2$
) obtained from generalized WH smoothing applied to the annuity portfolio used previously. The extrapolation follows a straight line – the polynomial of degree 
 $q-1 = 1$
 – and joins smoothly with the original curve.
$q-1 = 1$
 – and joins smoothly with the original curve.

Figure 8. Extrapolation of one-dimensional WH smoothing. The smoother is extrapolated on both sides of the initial observation range following a polynomial of degree 
 $q-1$
 (in this case a straight line as
$q-1$
 (in this case a straight line as 
 $q=2$
).
$q=2$
).
7.3. Constrained solution for the 2D case
 In the two-dimensional case, while the extended penalization matrix 
 $P_+$
 still takes the same structure as previously described, the expressions of its block components
$P_+$
 still takes the same structure as previously described, the expressions of its block components 
 $P_+^{11}$
,
$P_+^{11}$
, 
 $P_+^{12}$
,
$P_+^{12}$
, 
 $P_+^{21}$
, and
$P_+^{21}$
, and 
 $P_+^{22}$
 are more complex. In particular, we no longer have the simplification
$P_+^{22}$
 are more complex. In particular, we no longer have the simplification 
 $P_+^{11} - P_+^{12}(P_+^{22})^{-1}P_+^{21} = 0$
, therefore
$P_+^{11} - P_+^{12}(P_+^{22})^{-1}P_+^{21} = 0$
, therefore 
 $V_+^{11} \ne V$
 and
$V_+^{11} \ne V$
 and 
 $C\hat{\mathbf{y}}_+ \ne \hat{\mathbf{y}}$
. Solving the unconstrained extrapolation problem thus leads to a modification of the estimated coefficients for the observed data positions, as demonstrated by Carballo et al. (Reference Carballo, Durban and Lee2021).
$C\hat{\mathbf{y}}_+ \ne \hat{\mathbf{y}}$
. Solving the unconstrained extrapolation problem thus leads to a modification of the estimated coefficients for the observed data positions, as demonstrated by Carballo et al. (Reference Carballo, Durban and Lee2021).
This difference arises because, unlike the one-dimensional case, the smoothness criterion in two dimensions penalizes both rows and columns simultaneously, making it impossible to extrapolate without increasing the penalization. Since no new data are introduced in the extrapolated region, the smoothness criterion weighs more heavily in the optimization, prompting adjustments to the originally fitted values in order to produce a globally smoother estimate.
 To address this, we follow the approach proposed by Carballo et al. (Reference Carballo, Durban and Lee2021) and formulate a constrained optimization problem that enforces preservation of the original fitted values in the smoothing region. This is done by introducing a Lagrange multiplier 
 $\boldsymbol{\omega}$
 and solving the following constrained problem:
$\boldsymbol{\omega}$
 and solving the following constrained problem:
 \begin{align*}(\hat{\mathbf{y}}_+^\ast, \hat{\boldsymbol{\omega}}) = \underset{\boldsymbol{\theta}_+^\ast, \boldsymbol{\omega}}{\text{argmin}}\left\lbrace (\mathbf{y}_+ - \boldsymbol{\theta}_+^\ast)^TW_+(\mathbf{y}_+ - \boldsymbol{\theta}_+^\ast) + \boldsymbol{\theta}_+^{\ast T}P_{+}\boldsymbol{\theta}_+^\ast + 2 \boldsymbol{\omega}^T(C\boldsymbol{\theta}_+^\ast - \hat{\mathbf{y}})\right\rbrace\!.\end{align*}
\begin{align*}(\hat{\mathbf{y}}_+^\ast, \hat{\boldsymbol{\omega}}) = \underset{\boldsymbol{\theta}_+^\ast, \boldsymbol{\omega}}{\text{argmin}}\left\lbrace (\mathbf{y}_+ - \boldsymbol{\theta}_+^\ast)^TW_+(\mathbf{y}_+ - \boldsymbol{\theta}_+^\ast) + \boldsymbol{\theta}_+^{\ast T}P_{+}\boldsymbol{\theta}_+^\ast + 2 \boldsymbol{\omega}^T(C\boldsymbol{\theta}_+^\ast - \hat{\mathbf{y}})\right\rbrace\!.\end{align*} 
 This optimization admits a closed-form solution for the constrained extrapolated estimator 
 $\hat{\mathbf{y}}_+^\ast$
 as a linear transformation of
$\hat{\mathbf{y}}_+^\ast$
 as a linear transformation of 
 $\hat{\mathbf{y}}$
. The derivation details are provided in Section F of the Online Supplementary Materials. The final form is
$\hat{\mathbf{y}}$
. The derivation details are provided in Section F of the Online Supplementary Materials. The final form is
 \begin{align*}\hat{\mathbf{y}}^\ast_+ = Q^T\begin{bmatrix}I \\ - (P_+^{22})^{- 1}P_+^{21}\end{bmatrix}\hat{\mathbf{y}}\end{align*}
\begin{align*}\hat{\mathbf{y}}^\ast_+ = Q^T\begin{bmatrix}I \\ - (P_+^{22})^{- 1}P_+^{21}\end{bmatrix}\hat{\mathbf{y}}\end{align*} 
and the associated variance-covariance matrix is
 \begin{align*}V_+^{\ast} = Q^T\begin{bmatrix}V &\quad - V P_+^{12}(P_+^{22})^{- 1} \\ - (P_+^{22})^{- 1}P_+^{21}V &\quad (P_+^{22})^{- 1}P_+^{21}V P_+^{12}(P_+^{22})^{- 1} + (P_+^{22})^{- 1} \end{bmatrix}Q.\end{align*}
\begin{align*}V_+^{\ast} = Q^T\begin{bmatrix}V &\quad - V P_+^{12}(P_+^{22})^{- 1} \\ - (P_+^{22})^{- 1}P_+^{21}V &\quad (P_+^{22})^{- 1}P_+^{21}V P_+^{12}(P_+^{22})^{- 1} + (P_+^{22})^{- 1} \end{bmatrix}Q.\end{align*} 
 This formulation differs from the variance matrix of the unconstrained solution. Indeed, it enforces the constraint that the initial coefficients remain unchanged, as reflected by the presence of V (the original variance matrix) instead of 
 $V_+^{11}$
. The corresponding credible intervals are
$V_+^{11}$
. The corresponding credible intervals are
 \begin{align*}\mathbb{E}(\mathbf{y}_+) \mid \mathbf{y}_+ \in \left[\hat{\mathbf{y}}^\ast_+ \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\sigma^2\textbf{diag}(V_+^{\ast})}\right].\end{align*}
\begin{align*}\mathbb{E}(\mathbf{y}_+) \mid \mathbf{y}_+ \in \left[\hat{\mathbf{y}}^\ast_+ \pm \Phi^{- 1}\left(1 -\alpha / 2\right)\sqrt{\sigma^2\textbf{diag}(V_+^{\ast})}\right].\end{align*} 
The following figures illustrate the impact of the constrained extrapolation procedure discussed above, using the LTC portfolio of 100,000 policyholders as a case study.
- 
• Figure 9, left (mortality rates): this panel shows the estimated mortality rates obtained after applying the constrained extrapolation procedure to the two-dimensional WH smoothing model. The dotted lines indicate the boundaries of the original smoothing region. Visually, the transition from the smoothing region to the extrapolated area is seamless – the extrapolated surface naturally extends the smoothed mortality rates while respecting the original fitted values within the data range.  Figure 9. Constrained extrapolation of 2D WH smoothing. The contour lines of mortality rates and the associated standard deviation are depicted. The dotted lines delimit the boundaries of the initial smoothing region. 
- 
• Figure 9, right (standard deviation): this panel displays the posterior standard deviation (or credible interval width) associated with the extrapolated estimates. It reflects both the uncertainty from the original smoothing and the innovation error introduced in the extrapolated region. As expected, the standard deviation increases as we move away from the observed region, illustrating growing uncertainty about farther values. 
- 
• Figure 10 (ratio of mortality rates): this heatmap shows the pointwise ratio between the unconstrained and constrained extrapolation of the mortality rates. A value above 1 indicates that the unconstrained version overshoots the constrained one at that location, while values below 1 indicate underestimation. We observe that discrepancies exist not only in the extrapolated region but also within the original data region – confirming that the unconstrained approach distorts the original estimates in order to achieve overall smoothness.  Figure 10. Ratio of mortality rates resulting from the extrapolation of 2D WH smoothing. The numerator corresponds to the unconstrained extrapolation and the denominator to the constrained extrapolation presented in Figure 9. 
- 
• Figure 11 (ratio of standard deviations): this final figure includes two panels comparing uncertainty estimates. - 
• Left panel: ratio of standard deviation from the unconstrained extrapolation over that from the constrained extrapolation (including innovation error). The unconstrained version underestimate the actual uncertainty not only in the extrapolated region but also within the original data region, again reflecting the adjustments made to the original estimates in order to achieve overall smoothness. 
- 
• Right panel: ratio of standard deviation from the constrained extrapolation without innovation error over the fully constrained version with innovation error. This illustrates the contribution of the innovation error to the total uncertainty – it is substantial and should not be neglected. 
 
- 

Figure 11. Ratio of standard deviation of log-mortality rates from the three extrapolation methods. Left: unconstrained versus constrained with innovation error. Right: constrained without versus with innovation error. In both, the denominator is the fully constrained method of Figure 9.
8. Discussion
Choosing the order of the penalization
Throughout this work, we have assumed second-order difference matrices for penalization. This choice is both standard and meaningful: from a Bayesian perspective, it corresponds to a prior belief that the log-transformed quantity of interest evolves linearly, which implies exponential behaviour on the original scale – consistent with actuarial models such as Gompertz.
The difference order directly shapes both the estimated trend and its extrapolation: higher-order penalties allow for more flexibility, but may induce unstable or erratic behaviour outside the data range. While Whittaker originally used third-order differences and higher orders can marginally improve model fit according to information criteria such as AIC, second-order penalties typically offer a robust compromise between smoothness, interpretability, and extrapolation stability. A detailed evaluation is provided in Section G of the Online Supplementary Materials.
Summary of contributions
This paper revisits the classical WH smoothing approach through the lens of modern statistical modelling. Each section brought forward a key practical insight:
- 
• Section 2 established that WH smoothing is more than an empirical method. It has a firm Bayesian foundation. Under Gaussian assumptions, credibility intervals may be derived and used as practical substitutes to confidence intervals. 
- 
• Section 3 clarified how to construct observation and weight vectors in survival analysis models: using log-crude rates as observations and event counts as weights yields a sound statistical formulation. 
- 
• Section 4 introduced generalized WH smoothing, in which the penalization is applied directly to the likelihood rather than a normal approximation. This refined method yields more accurate results, especially in situations where the available data volume is limited, but the number of combinations is high, such as in the two-dimensional case. 
- 
• Section 5 advocated for smoothing parameter selection via marginal likelihood (or its Laplace approximation, LAML), offering a principled and robust alternative to heuristic criteria like AIC or GCV. 
- 
• Section 6 presented two computational improvements: one exploits the banded structure of WH matrices to reduce runtime by up to a factor of 25; the other relies on reduced-rank smoothing basis, leading to even faster estimation (up to 250  $\times$
 speed-up) with limited loss in accuracy, slightly outperforming P-splines. $\times$
 speed-up) with limited loss in accuracy, slightly outperforming P-splines.
- 
• Section 7 addressed extrapolation: while WH smoothing naturally extends beyond the data range, constraints are needed in two dimensions to preserve the original fit. We proposed a method to extrapolate while accounting for both structural uncertainty and innovation error and provided credible intervals accordingly. 
 All these techniques are available in the WH package for the statistical software 
 $\texttt{R}$
 (R Core Team, Reference Core Team2025), including automated smoothing parameter selection and constrained extrapolation with uncertainty quantification.
$\texttt{R}$
 (R Core Team, Reference Core Team2025), including automated smoothing parameter selection and constrained extrapolation with uncertainty quantification.
Limitations and outlook
Despite its strong practical appeal, WH smoothing has limitations that suggest several avenues for future work:
- 
• Regular spacing requirement: WH smoothing assumes evenly spaced observations, which aligns well with standard life insurance grids (age and/or duration). However, this is less suitable when events are concentrated in a short period, such as in disability or long-term care claims. One solution is to combine finer discretization in early durations with methods like P-splines that accommodate irregular grids. Alternatively, and adaptive WH smoothing procedure (based on the ideas in Ruppert and Carroll, Reference Ruppert and Carroll2000; Krivobokova et al., Reference Krivobokova, Crainiceanu and Kauermann2008) could offer a way to retain regular spacing while varying the smoothness locally. 
- 
• Limited covariate handling: The basic WH framework does not accommodate additional explanatory variables (e.g., gender or policy features). However, WH smoothing can be extended using ideas from smoothing spline ANOVA and hierarchical models (Lee and Durban, Reference Lee and Durban2011; Gu, Reference Gu2013), allowing for structured random effects and flexible interactions. This opens the door to richer, more personalized experience modelling while preserving interpretability. 
In sum, revisiting WH smoothing through a modern lens reinforces its theoretical foundations and offers practitioners fast, transparent, and adaptable tools for experience modelling. It remains a compelling alternative to more recent – yet often more opaque – techniques when working with evenly spaced discrete data.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/asb.2025.10061.
Acknowledgments
The author thanks the two anonymous referees for their valuable comments, which helped improve the quality and clarity of the paper.
Data availability statement
The synthetic data used to conduct the performance comparisons in this study are available from the author upon request.
Competing interests
The author declares no competing interests.
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 


 
















