What KAN mortality say: smooth and interpretable mortality modeling using Kolmogorov−Arnold networks

Lianzeng Zhang; Yuan Zhuang

doi:10.1017/asb.2025.10079

What KAN mortality say: smooth and interpretable mortality modeling using Kolmogorov−Arnold networks

Published online by Cambridge University Press: 28 November 2025

Lianzeng Zhang

and

Yuan Zhuang

Show author details

Lianzeng Zhang: Affiliation:
Nankai-Taikang College of Insurance and Actuarial Science Nankai University Tianjin 300350, P. R. China
Yuan Zhuang*: Affiliation:
School of Risk and Actuarial Studies UNSW Sydney Sydney, Australia Department of Actuarial Science, School of Finance Nankai University Tianjin 300350, P. R. China
*: Corresponding author: Yuan Zhuang; Email: yuan.zhuang4@unsw.edu.au

Article contents

Abstract
Introduction
Models
Applying KAN and KANN
Empirical investigation and results
Conclusion
Supplementary material
Data availability statement
Funding statement
Competing interests
Declaration of generative AI and AI-assisted technologies in the writing process
Footnotes
References

Rights & Permissions

Abstract

In machine learning-based mortality models, interpretation methods are well established, and they can reveal structures resembling the age or time effects in traditional mortality models. However, in the reverse direction, using such traditional components to guide the initialization of a neural network remains highly challenging due to information loss during model interpretation. This study addresses this gap by exploring how components from pre-fitted traditional mortality models can be used to initialize neural networks, enabling structural information to be incorporated into a deep learning framework. We introduce Kolmogorov–Arnold Networks (KAN) and first construct two shallow models, KAN[2,1] and ARIMAKAN, to examine their applicability to mortality modeling. We then extend the Combined Actuarial Neural Network (CANN) into a KAN-based Actuarial Neural Network (KANN), in which classical model components calibrated via generalized nonlinear models or generalized additive models are naturally used for initialization. Three KANN variants, namely KANN[2,1], KANNLC, and KANNAPC, are proposed. In these models, neural networks assist in improving the accuracy of traditional models and help refine the original parameter estimates. All KANN-based models can also produce smooth mortality curves as well as smooth age, period, and cohort effects through simple regularization. Experiments on 34 populations demonstrate that KAN-based approaches achieve stable performance while balancing interpretability, smoothness, and predictive accuracy.

Keywords

Mortality forecasting transfer learning Kolmogorov–Arnold networks interpretable machine learning

Information

Type: Research Article
Information: ASTIN Bulletin: The Journal of the IAA , First View , pp. 1 - 28

DOI: https://doi.org/10.1017/asb.2025.10079 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The International Actuarial Association

1. Introduction

Mortality has significantly declined since the mid-20th century. In 1950, the global average life expectancy was slightly above 45 years, rising to around 75 years by 2020. According to the United Nations, by 2100, the global average life expectancy (under medium scenario) is projected to reach approximately 85 years, based on probabilistic projections using a double-logistic mortality improvement model with Bayesian estimation (World Population Prospects 2022: Methodology Report, 2022; World Population Prospects 2022: Summary of Results, 2022). Dynamic changes in mortality rates necessitate continuous evolution in life insurance products. From an actuarial perspective, it is essential to incorporate future mortality improvement into pricing and reserving strategies of life insurance companies to effectively manage longevity risk (Pitacco et al., Reference Pitacco, Denuit, Haberman and Olivieri2009). Therefore, accurate forecasting of future mortality rates is of paramount importance.

Last century, Lee and Carter (Reference Lee and Carter1992) introduced the Lee–Carter (LC) model, a dynamic approach to mortality forecasting that has proven highly successful and remains a widely used baseline today. The model estimates age and time factors using singular value decomposition and predicts time factors using time series analysis. Furthermore, Brouhns et al. (Reference Brouhns, Denuit and Vermunt2002) used a Poisson distribution and estimated parameters of the LC model by likelihood maximization. During the past twenty years, researchers have proposed numerous improved and extended mortality models. For example, Renshaw and Haberman (Reference Renshaw and Haberman2006) introduced cohort effects, while Cairns et al. (Reference Cairns, Blake and Dowd2006) considered the impact of mortality at higher age, developing CBD model. Then, Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009) expanded CBD model to several models with higher-order terms and Plat (Reference Plat2009) combined the factors of CBD and LC to form the Plat model, which captures cohort effects and different mortality improvement trends for younger and older age groups. These improved models enhance predictive accuracy and applicability by introducing more variables. For a thorough and detailed description of these models, we refer to Basellini et al. (Reference Basellini, Camarda and Booth2023).

Over the past decade, the actuarial field has increasingly adopted machine learning methods to address a multitude of emerging challenges, including claims reserving, non-life pricing, telematics, and mortality modeling (Richman, Reference Richman2021a,b; Embrechts and Wüthrich, Reference Embrechts and Wüthrich2022). The application of machine learning in mortality models dates back to Deprez et al. (Reference Deprez, Shevchenko and Wüthrich2017), who pioneered the use of machine learning to supplement traditional models. In the early stages of applying machine learning methods to mortality, most studies focused on forecasting time-dependent components in LC and CBD models (Nigri et al., Reference Nigri, Levantesi, Marino, Scognamiglio and Perla2019; Odhiambo et al., Reference Odhiambo, Weke and Ngare2021; Lindholm and Palmborg, Reference Lindholm and Palmborg2022) or on enhancing the predictive performance of classic stochastic mortality models (Deprez et al., Reference Deprez, Shevchenko and Wüthrich2017; Levantesi and Pizzorusso, Reference Levantesi and Pizzorusso2019). Nowadays, scholars tend to model mortality directly, rather than merely using machine learning methods as auxiliaries to classic models. Specialized architectures that once excelled in natural language processing and computer vision tasks, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), have been widely adopted for neighborhood feature extraction (Wang et al., Reference Wang, Zhang and Zhu2021; Qiao et al., Reference Qiao, Wang and Zhu2024) and time series modeling of mortality (Richman and Wüthrich, Reference Richman and Wüthrich2019; Perla et al., Reference Perla, Richman, Scognamiglio and Wüthrich2021). These methods reached high accuracy in predicting future mortality. For a brief overview of the latest developments in this area, interested readers are referred to Zheng et al. (Reference Zheng, Wang, Zhu and Xue2025).

With regulators and internal control sectors emphasizing the need for model transparency, improving model interpretability becomes a necessary step (Owens et al., Reference Owens, Sheehan, Mullins, Cunneen, Ressel and Castignani2022). Although machine learning models are often seen as black boxes, many studies aim to analyze results from specialized mortality models to better understand the underlying decision-making processes. Some previous attempts include

• Partial dependence plots (PDP). PDP can visualize the marginal effect of selected features on model predictions. Bjerre (Reference Bjerre2022) has used ensemble tree-based models to model differenced log-mortality rates and explained the model behavior by deploying PDP. Notably, interpretability tools like PDP and SHAP values (Lipovetsky and Conklin, Reference Lipovetsky and Conklin2001) can be applied to a wide range of models, including neural networks.
• Autoencoders, a special type of neural networks that, once trained, can replicate approximately its inputs to its outputs. The outputs of the hidden layers of autoencoders can be extracted to compare with traditional LC model (Hainaut, Reference Hainaut2018; Miyata and Matsuyama, Reference Miyata and Matsuyama2022).
• Utilizing neural networks for interval estimation. Some scholars have attempted to combine neural networks with statistical theory to gain deeper insights, as seen in Schnürch and Korn (Reference Schnürch and Korn2022), Marino et al. (Reference Marino, Levantesi and Nigri2023).
• Entity embeddings. This method transforms input variables in mortality models (such as age, calendar year, gender, and region) into new representations. Researchers have also attempted to reduce the dimensionality of the new representations of variables to achieve interpretability, as seen in Vincelli (Reference Vincelli2019), Richman and Wüthrich (Reference Richman and Wüthrich2021).
• Actuarial networks, including Combined Actuarial Neural Network (Wüthrich and Merz, Reference Wüthrich and Merz2019, CANN), Combined Actuarial Explainable Neural Network (Richman, Reference Richman2022, CAXNN), and LocalGLMnet (Richman and Wüthrich, Reference Richman and Wüthrich2023). Linear components of CANN and CAXNN could be directly expressed in familiar terms and could serve as a basis for life table construction (Richman, Reference Richman2022), while attention coefficients in LocalGLMnet can be analyzed to observe lag effects (Perla et al., Reference Perla, Richman, Scognamiglio and Wüthrich2024).

Although neural-network approaches to mortality modeling have progressed rapidly, their architectures are becoming increasingly complex nowadays. Most deep models must be trained from scratch, with no access to prior structural knowledge on mortality, and offer limited flexibility to adjust the smoothness of the fitted mortality curves. To close these gaps, we ask two questions:

(1) Given fitted parameters from classical mortality models – such as $a_x$ , $b_x$ , and $k_t$ in the Lee–Carter model – can we initialize a deep network with similar decision process using these components, thereby capturing mortality patterns more efficiently, and potentially improving predictive performance? In essence, we aim to develop a deep model that is not only interpretable but also capable of learning from traditional models through pretraining. Such initialization can substantially reduce both training time and model complexity.

Learning from the structure of classical models is often more challenging than merely providing an interpretation consistent with them. This is because many interpretations are essentially condensed representation of model outputs or intermediate results. Once information is lost in the interpretation process, it becomes difficult to recover the original parameters and structure from the final explanation. Consequently, few models have succeeded in this direction, with CANN (Wüthrich and Merz, Reference Wüthrich and Merz2019) being a notable exception. However, although CANN has demonstrated effective knowledge transfer in the context of generalized linear models (GLMs), many classical mortality models are calibrated by generalized nonlinear models (GNMs) or fitted using generalized additive models (GAMs) for smoothness. These settings fall beyond the natural scope of CANN, underscoring the need for new methods that can facilitate knowledge transfer from classical mortality models to deep learning frameworks in a more general and flexible manner.
(2) Is it possible to directly control the neural networks to ensure smoother mortality forecasts? Currently, some neural network-based mortality studies produce forecasts that appear smoother than those of traditional models (Scognamiglio, Reference Scognamiglio2022; Wang et al., Reference Wang, Wen, Xiao and Wang2023). However, these models lack explicit mechanisms for controlling smoothness, leaving the origin of such smoothness opaque and beyond the modeller’s control. While some approaches, such as ICEnet, emphasize monotonicity and smoothness (Richman and Wüthrich, Reference Richman and Wüthrich2024), these models still rely heavily on pseudo samples and post hoc interpretation, suggesting the need to explore alternative efficient methods for achieving smoothness without synthetic data.

To address the above challenges, we need a special type of neural network. First, it should connect naturally with the components of classical mortality models without relying on dimensionality reduction. If a neural network can only match the components of classical mortality models after compressing its parameters into a lower dimension, then it becomes almost impossible to infer the original parameters from these compressed components. As a result, using the components of traditional models to initialize a neural network also becomes problematic. Second, to obtain smooth age, period (and cohort) effects without synthetic samples, the model should interpret itself (intrinsic interpretation): smoothness is enforced during training by regularization with no post hoc tools (e.g., PDP, ICE) involved. Finally, if parts of the model can be built using splines, then established actuarial experience with spline-based smoothing can be transferred directly. Kolmogorov–Arnold Networks (KAN) meet all these requirements.

This paper investigates the use of KAN as single-population mortality models to fit and predict male and female mortality across 17 countries in the Human Mortality Database (HMD). KAN is an architecture that has been studied for years and features trainable activation functions, which in this paper are implemented as a mixture of the Sigmoid Linear Unit (SiLU) and a set of B-splines. To first examine the applicability of KAN, we design two simple structures KAN[2,1] and ARIMAKAN, both of which are shallow models. For deeper architectures, we extend the vanilla Combined Actuarial Neural Network (Wüthrich and Merz, Reference Wüthrich and Merz2019, CANN) into a KAN-based Actuarial Neural Network (KANN). In this framework, classical model components calibrated via generalized nonlinear models or generalized additive models can be directly used for initialization, thereby overcoming the earlier restriction to GLMs. Three concrete implementations, namely KANN[2,1], KANNLC, and KANNAPC, are proposed. We also provide sample interpretation figures and model comparison analysis against GAM, 2D P-Spline, and LSTM.

The remainder of this article is organized as follows. Section 2 provides notations of KAN, introduces KAN[2,1] and ARIMAKAN, and presents the KANN variants and their pretraining strategies. Section 3 describes the training details, including smoothness regularization and performance metrics. Section 4 reports the empirical results based on 34 populations from the Human Mortality Database, with a focus on interpretation, model behaviors, smoothness, and predictive performance. Section 5 concludes this paper. Tutorial codes and an interactive dashboard are also provided; interested readers are referred to the supplementary material and data availability statement part in this paper for access.

2. Models

2.1. Lee–Carter and Age-Period-Cohort (APC) model

Regarding stochastic mortality models, the earliest and most classic is the LC model proposed by Lee and Carter (Reference Lee and Carter1992). Let $\mathcal{X}=\left\{x_0, x_1, \ldots, \omega\right\}$ be the set of possible ages and $\mathcal{T} = \left\{t_0, t_1, \ldots, t_n\right\}$ be the set of years, where $\omega$ is the maximum attainable age. In LC model, the structure of mortality could be described as

(2.1)

\begin{equation} \ln (m_{x, t})=\alpha_x+\beta_x k_t\end{equation}

Here, $m_{x, t}\in[0,1]$ stands for the central mortality rate for individuals aged x at time t, $\alpha_x \in \mathbb{R}$ and $\beta_x \in \mathbb{R}$ represent age factors. Specifically, $\alpha_x$ is the base level of log-mortality rate, representing the average mortality rate over time for age x. $\beta_x$ is the trend of mortality rate changes across different age groups, standing for the sensitivity of the mortality rate change to age x. $k_t$ is the stochastic temporal factor, which is the variation in mortality rates at time t. Since there are multiple sets of $\alpha_x, \beta_x$ , and $k_t$ that satisfy the requirements of the LC model, to obtain a unique set of parameter estimates, it is commonly assumed $\sum_{x} \beta_x=1, \sum_{t} k_t=0$ .

Renshaw and Haberman (Reference Renshaw and Haberman2006) incorporated cohort effect into the mortality model, with the model taking the following form:

(2.2)

\begin{equation} \ln (m_{x, t})=\alpha_x+\beta_x^{(1)} k_t+\beta_x^{(2)} \gamma_{t-x}\end{equation}

It can be observed that compared to the classic LC model, this model (RH) additionally includes a stochastic cohort effect $\gamma_{t-x}$ , which is a function of the birth year $(t-x)$ . Furthermore, Renshaw and Haberman’s data analysis of England and Wales indicated that while the mortality model can be improved by adding a cohort effect, the RH model is not as robust as the LC model. Moreover, the parameter estimates of the model are influenced by the age range of the mortality data.

Currie (Reference Currie2006) simplified the RH model, resulting in the Age-Period-Cohort (APC) model. The core idea is to set $\beta_x^{(1)}$ and $\beta_x^{(2)}$ in (2.2) equal to 1, with the model as follows:

(2.3)

\begin{equation} \ln (m_{x, t})=\alpha_x+k_t+\gamma_{t-x}\end{equation}

In order to overcome the identifiability problem and let the cohort effect fluctuate around zero, several constraints should be added:

(2.4)

\begin{equation} \sum_t k_t=0, \quad \sum_{c=t_0-\omega}^{t_n-x_0} \gamma_c=0, \quad \sum_{c=t_0-\omega}^{t_n-x_0} c \gamma_c=0\end{equation}

In this research, LC and APC models are our baselines, and they are implemented in R using package StMoMo (Villegas et al., Reference Villegas, Kaishev and Millossovich2018). Once the model is fitted, the next step is essentially to use the model for forecasting, that is, forecasting the time-dependent parameters ( $k_t$ and $\gamma_{t-x}$ ). These parameters will be extrapolated using ARIMA processes to obtain the final mortality forecasts.

2.2. Kolmogorov–Arnold Networks (KAN)

In mathematical theory, the universal approximation theorem (Cybenko, Reference Cybenko1989) establishes the ability of neural networks to fit any function, which is ensured by the deep structure of feed-forward neural networks (Hornik et al., Reference Hornik, Stinchcombe and White1989). As a result, multilayer perceptrons (MLPs) are the most commonly used architecture in mortality research.

Compared to MLP, KANs rely on the Kolmogorov–Arnold representation theorem, also known as the Kolmogorov–Arnold superposition theorem (Kolmogorov, Reference Kolmogorov1957; Arnold, Reference Arnold1958). The theorem states that a smooth function $f\,:\,[0,1]^n \rightarrow \mathbb{R}$ can be expressed as

(2.5)

\begin{equation} f\left(x_1, \ldots, x_n\right)=\sum_{q=1}^{2 n+1} \varphi_q\left(\sum_{p=1}^n \phi_{q, p}(x_p)\right),\end{equation}

where each $\phi_{q, p}$ is a mapping from [0,1] to $\mathbb{R}$ , and each $\varphi_q$ is a real-valued function. This formulation demonstrates that multivariate functions can fundamentally be reduced to a suitably defined composition of univariate functions, where the composition only involves simple addition.

Equation (2.5) implies a neural network approach for fitting multivariate functions, which necessitates two layers of neurons to perform nonlinear transformations on the inputs, with one layer comprising n neurons and the other layer consisting of $2n+1$ neurons. However, the majority of research (Lin and Unbehauen, Reference Lin and Unbehauen1993; Köppen, Reference Köppen2002; Sprecher and Draghici, Reference Sprecher and Draghici2002) has adhered to the traditional depth-2, width- $(2n + 1)$ framework, losing opportunities to incorporate modern training methodologies, such as backpropagation. In recent studies, the above issue has been effectively addressed as researchers have abandoned the strict “ $2n+1$ ” assumption inherent in the Kolmogorov–Arnold theorem, generalizing the original Kolmogorov–Arnold representation to arbitrary widths and depths (Liu et al., Reference Liu, Wang, Vaidya, Ruehle, Halverson, Soljacic, Hou and Tegmark2024b). Moreover, through a series of comprehensive empirical studies, researchers discovered its capacity for time-series analysis (Vaca-Rubio et al., Reference Vaca-Rubio, Blanco, Pereira and Caus2024), computer vision (Bodner et al., Reference Bodner, Tepsich, Spolski and Pourteau2024; Kiamari et al., Reference Kiamari, Kiamari and Krishnamachari2024), and AI for science (Liu et al., Reference Liu, Ma, Wang, Matusik and Tegmark2024a), emphasizing KAN’s precision and clarity.

A KAN comprises several KAN layers. Figure 1 shows a KAN with three KAN layers: $\Phi_1, \Phi_2$ and $\Phi_3$ , which highlights the function from the input to the output. One can also say there are four layers of nodes (neurons) in this KAN. The nodes are denoted as black points in Figure 1.Footnote ¹

Figure 1. Sample KAN with three KAN layers (four node layers).

The shape of KAN can be expressed as an integer array $\left[n_1, n_2, \ldots, n_L\right]$ , where $n_l$ is the number of nodes (neurons) in the l-th node layer (see Figure 1, it is KAN[2,3,4,1]). The i-th neuron in the l-th node layer is denoted by (l, i), while the value of the (l, i)-neuron is represented by $a_{l, i}$ . There are $n_l n_{l+1}$ activation functions between the consecutive node layers l and $l+1$ . The activation function connecting the neurons (l, i) and $(l+1, j)$ is denoted by:

(2.6)

\begin{equation} \phi_{l, i, j}, \quad l=1, \ldots, L, \quad i=1, \ldots, n_l, \quad j=1, \ldots, n_{l+1} .\end{equation}

As shown in Figure 1, the activation functions $\phi_{l, i, j}$ appear on the edges rather than the nodes. Let the pre-activation of $\phi_{l, i, j}$ be $a_{l, i}$ . Then, KAN calculates values of the nodes in the next layer as follows:

(2.7)

\begin{equation} a_{l+1, j}=\sum_{i=1}^{n_l} \phi_{l, i, j}(a_{l, i}).\end{equation}

Equation (2.7) can be rewritten in matrix form as

(2.8)

\begin{equation} \mathbf{a}_{l+1} = \boldsymbol{\Phi}_l (\mathbf{a}_{l}) \mathbf{e}\end{equation}

where $\boldsymbol{\Phi}_l (\mathbf{a}_{l})=(\phi_{l, i, j}(a_{l, i}))_{i=1, \ldots, n_l, j=1, \ldots, n_{l+1}} \in \mathbb{R}^{n_{l+1} \times n_l}$ is the function matrix corresponding to the l-th KAN layer, and $\mathbf{e}$ is a column vector of length $n_l$ with all elements equal to one. For an input $\mathbf{x} \in \mathbb{R}^{n_1}$ , a KAN model can be expressed as a composition of L layers:

(2.9)

\begin{equation} \operatorname{KAN}(\mathbf{x}) = \boldsymbol{\Phi}_{L-1} \left( \boldsymbol{\Phi}_{L-2} \left( \cdots \boldsymbol{\Phi}_2 \left( \boldsymbol{\Phi}_1(\mathbf{x}) \mathbf{e}_{1} \right) \mathbf{e}_{2} \cdots \right) \mathbf{e}_{L-2} \right) \mathbf{e}_{L-1}\end{equation}

Comparing KAN with conventional FNNs, one can observe essential structural and computational differences. A traditional FNN applies a linear transformation using weight matrices and bias terms, followed by a fixed activation function (e.g., ReLU or sigmoid). The forward computation at the l-th layer is expressed as

(2.10)

\begin{equation} \mathbf{a}_{l+1} = \sigma_l(\mathbf{W}_l \mathbf{a}_l + \mathbf{b}_l),\end{equation}

where $\mathbf{W}_l$ and $\mathbf{b}_l$ are learnable parameters, and $\sigma_l$ is a fixed nonlinear function. This structure represents a linear-nonlinear computation sequence.

In contrast, KAN removes the weight matrices entirely. Instead of applying a fixed activation to a linear combination of inputs, in Equation (2.7), KAN applies learnable nonlinear functions $\phi_{l,i,j}$ directly to each input $a_{l,i}$ and sums the results across edges. This reversal of the linear-nonlinear order places most of the model’s expressive capacity in the nonlinear edge functions, making the architecture more compact.

The trainable activation functions can be customized according to individual needs. Here, we adopt a widely used definition in (Liu et al., Reference Liu, Wang, Vaidya, Ruehle, Halverson, Soljacic, Hou and Tegmark2024b):

(2.11)

\begin{equation} \phi_{l,i,j}(a_{l,i})=w^b_{l,i,j} \frac{a_{l,i}}{1 + e^{-a_{l,i}}}+w^s_{l,i,j} s_{l,i,j}(a_{l,i})\end{equation}

where $s_{l,i,j}(a_{l,i})$ is a linear combination of B-splines such that

(2.12)

\begin{equation} s_{l,i,j}(a_{l,i})=\sum_k c_{l,i,j,k} B_k(a_{l,i})\end{equation}

Here, $c_{l,i,j,k}, w^b_{l,i,j}$ and $w^s_{l,i,j}$ are trainable parameters. The B-spline basis $B_k(x)$ , are characterized by polynomials of order n (typically, $n = 3$ , which is the cubic spline), and the number of grid points G. There are $G+n$ B-spline basis on a given interval for predetermined n and G. For more detailed information on B-splines, see Cohen et al. (Reference Cohen, Riesenfeld, Elber and Riesenfeld2001).

Thus, (2.11) could be expressed as

(2.13)

\begin{equation} \phi_{l,i,j}(a_{l,i})=w^b_{l,i,j} \frac{a_{l,i}}{1 + e^{-a_{l,i}}} + w^s_{l,i,j} \sum_{k=1}^{G+n} c_{l,i,j,k} B_k(a_{l,i})\end{equation}

Equation (2.13) represents a trade-off between SiLU and smoothing methods (splines), allowing for adaptation to more complex data. Spline methods offer high flexibility, making them a common approach for smoothing mortality rates, as seen in techniques like P-splines (Currie et al., Reference Currie, Durbán and Eilers2004; Richards et al., Reference Richards, Kirkby and Currie2006) and their implementation in R through package MortalitySmooth (Camarda, Reference Camarda2012). However, pure spline methods may lack support at the boundaries and, due to their strong focus on fitting the training data, may yield extrapolated trends that deviate significantly from historical patterns and may even distort age structure of mortality. Therefore, we use the activation functions defined in Equation (2.13), which balance SiLU and B-splines, aiming to incorporate the advantages of both approaches while maintaining the trainability of the activation functions.

The aforementioned KAN can be easily optimized using backpropagation and gradient descent. We refer to implementations such as pykan and EfficientKAN.

2.3. Shallow KAN for mortality: KAN[2,1] and ARIMAKAN

In a minimal setting, KAN already exhibits effective modeling capacity for mortality data. To discuss its potential in structural knowledge transfer and smoothness control, we introduce two shallow and intuitive models. These models not only represent the most basic applications of KAN to mortality modeling but also lay the conceptual and technical groundwork for more advanced methods developed later.

Model 1: KAN[2,1] Since mortality is generally regarded as a function of age x and year t, it is natural to model the log-mortality $\ln(m_{x,t})$ using these two covariates. We adopt a simple KAN architecture with two input variables and one output node, denoted as KAN[2,1]. The model can be written as

(2.14)

\begin{equation} \ln(m_{x,t}) = \phi_{\text{age}}(x) + \phi_{\text{year}}(t),\end{equation}

where $\phi_{\text{age}}({\cdot})$ and $\phi_{\text{year}}({\cdot})$ are learnable activation functions associated with age and year, respectively. This additive structure is transparent and intuitive: the model separately captures the age effect and the time effect on mortality.

In this context, the interpretability offered by KAN [2,1] is both global and intrinsicFootnote ² . The functions $\phi_{\text{age}}({\cdot})$ and $\phi_{\text{year}}({\cdot})$ explain, across the entire input space, how the model depends on each feature to make predictions. These components are “what you see is what you get” – they can be directly visualized without the need for any post hoc interpretability techniques.

Compared with other commonly used interpretability tools (feature importance, PDP, SHAP), shallow KAN exhibits distinctive advantages. We provide a comparison below:

• $\phi_{\text{age}}({\cdot})$ and $\phi_{\text{year}}({\cdot})$ provide complete functional effects, showing not only the magnitude of influence but also the direction (positive or negative), the shape (linear or nonlinear), and how these relationships vary across the domain. These explanations are generated immediately after model fitting, without requiring additional computation.
• Feature importance methods only quantify the relative magnitude of importance, offering no insight into the sign or functional form of the effect.
• PDP approximate the average relationship between a feature and the prediction by marginalizing over other variables. This averaging can obscure heterogeneous effects, and the resulting plots may not fully uncover the model’s behavior, as discussed in Xin et al. (Reference Xin, Hooker and Huang2025).
• SHAP, by design, provides explanations for individual predictions (local interpretability), and as a post hoc method, its relatively complex computation and subsequent analysis can be time-consuming.

For the KAN[2,1] model, direct fitting is the easiest way to train it. An alternative training strategy involves freezing the temporal component $\phi_{\text{year}}(t)$ initially, and pretraining the age effect $\phi_{\text{age}}(x)$ by fitting to the log-mortality curve of a selected year. After this initialization, the year effect is unfrozen and optimized jointly. While such a phased procedure may be redundant for shallow KANs, the similar idea proves useful for specially designed deeper KAN architectures as knowledge extraction tools. We will revisit this approach in the context of deep KAN models in Subsection 2.4.

One-KAN-layer models can be formally regarded as a type of generalized additive models (GAMs) (Wood, Reference Wood2017), since both express the target variable as a sum of univariate functions. However, in practice, there is a key difference between KAN and traditional GAMs. GAMs typically represent each effect using splines, such as the widely used mgcv package in R. In contrast, KAN does not rely purely on splines. Each function $\phi$ is modeled as a learnable combination of spline basis functions and the SiLU (Sigmoid Linear Unit) activation, as shown in (2.13). This spline–SiLU mixture introduces a notable departure from the conventional GAM setup, affecting both the interpretation and prediction performance. We will discuss empirical differences in greater detail in Section 4. Nevertheless, shallow KAN models can still accommodate smoothness penalties similar to those used in GAMs, which provide a natural starting point for introducing regularization in deeper KAN structures.

The above models model time and age simultaneously. However, an alternative approach of first modeling the mortality time series for each age separately and then smoothing along the age dimension remains highly valuable. Modeling multiple time series helps to capture mortality improvements across different ages, while the post-smoothing approach preserves the correct age structure of mortality and ensures strong interpretability of the model.

Model 2: ARIMAKAN We propose a straightforward two-step approach called ARIMAKAN, which is based on the following procedure:

(1) ARIMA Fitting: For each age x, an ARIMA(p, d, q) model is fitted to the time series of log-mortality on $\mathcal{T} = \left\{t_0, t_1, \ldots, t_n\right\}$ :
(2.15) \begin{equation} \alpha_p(B) \nabla^d \ln (m_{x, t}) = \theta_q(B) \epsilon_t, \end{equation}

where B is the backward shift operator. $\alpha_p(B)$ is the autoregressive polynomial of order p, $\nabla^d$ represents differencing of order d. $\theta_q(B)$ stands for the moving average polynomial of order q while $\epsilon_t$ is a white noise error term with zero mean and constant variance. The model selection is based on minimizing the Akaike Information Criterion (AIC).
(2) ARIMA Forecasting: Once the ARIMA models have been fitted for each x, they are used to forecast future values of $\ln (m_{x, t})$ . Assuming that forecasting starts from year $t_{n}$ , and define the forecasting horizon as $\mathcal{T}^{\prime}=\left\{t_{n+1}, \ldots, t_{n+s}\right\}, s \geq 1$ , then the forecasted value h steps ahead is denoted as $\widetilde{\ln\left(m_{x, t_{n+h}}\right)}$ .
(3) KAN for Age Structure: We have predicted log-mortality curve for each year in $\mathcal{T}^{\prime}$ with $x \in \mathcal{X}$ , after ARIMA forecasting. Now, every predicted log-mortality curve is separately fitted using a shallow KAN model with one input and one output, i.e., KAN[1,1]:
(2.16) \begin{equation} \widetilde{\ln\left(m_{x, t_{n+h}}\right)} = \phi_{\text{age}}(x), x \in \mathcal{X}, \end{equation}
where $\phi_{\text{age}}$ is a learnable nonlinear function representing the age effect. Repeating this process across all h yields the full set of estimated values $\widehat{\ln\left(m_{x, t_{n+h}}\right)}$ .

2.4. KAN with deeper structure: From CANN to KANN

We now discover deeper KAN models with ability to learn structural knowledge of traditional morality models. To introduce our models, we first discuss the CANN architecture. CANN is widely used in actuarial modeling for achieving interpretability and facilitating knowledge transfer from GLMs. With modifications, CANN can be pre-trained using either traditional mortality models or existing GAM-based structures. This straightforward yet powerful modification is to replace all MLP blocks in CANN with KANs, giving rise to a new class of models which we refer to as KANN (KAN-based Actuarial Neural Network).

Let the features be $ \mathbf{x} = (x_1, x_2, \dots, x_p)^\top $ , true value of the response variable be y and its estimation be $\hat{y}$ , for an n-layer CANN, the forward propagation can be written as

(2.17)

\begin{equation} \mathbf{a}_1 = \sigma_0(\mathbf{W}_0 \mathbf{x} + \mathbf{b}_0), \quad \mathbf{a}_2 = \sigma_1(\mathbf{W}_1 \mathbf{a}_1 + \mathbf{b}_1), \quad \dots, \quad \mathbf{a}_{n-1} = \sigma_{n-2}(\mathbf{W}_{n-2} \mathbf{a}_{n-2} + \mathbf{b}_{n-2}),\end{equation}

(2.18)

\begin{equation} \hat{y} = \sigma_{n-1}(\mathbf{W}_{n-1} \mathbf{a}_{n-1} + b_{n-1} + \mathbf{B}\mathbf{x}).\end{equation}

Here, $\mathbf{W}_i, \mathbf{b}_i$ and $\mathbf{B}$ are trainable parameters, and $\sigma_i$ denotes activation functions. The vector $\mathbf{B}$ is a skip connection that introduces linear knowledge. It is either calibrated directly from the data or inherited from a pretrained GLM.

To pretrain the CANN using GLM coefficients, one can set $\mathbf{W}_{n-1}$ and $b_{n-1}$ to zeroFootnote ³ and directly initialize $\mathbf{B}$ using the maximum likelihood estimates from a GLM. During training, $\mathbf{B}$ is updated along with other parameters, and the final prediction $\hat{y}$ is composed of both linear and nonlinear components, thereby preserving interpretability while enhancing predictive performance.

When all MLP blocks in CANN are replaced by KANs, the architecture undergoes a fundamental transformation. The forward propagation in KANN becomes:

(2.19)

\begin{equation} \mathbf{a}_1 = \boldsymbol{\Phi}_0(\mathbf{x}) \mathbf{e}_0, \quad \mathbf{a}_2 = \boldsymbol{\Phi}_1(\mathbf{a}_1) \mathbf{e}_1, \quad \dots, \quad \mathbf{a}_{n-1} = \boldsymbol{\Phi}_{n-2}(\mathbf{a}_{n-2}) \mathbf{e}_{n-2},\end{equation}

(2.20)

\begin{equation} \hat{y} = \boldsymbol{\Phi}_{n-1}(\mathbf{a}_{n-1}) \mathbf{e}_{n-1} + \sum_{j=1}^p \phi_j(x_j),\end{equation}

where each $\boldsymbol{\Phi}_i({\cdot})$ is a learnable function matrix, and each $\phi_j(x_j)$ represents univariate effect from original feature. Each $\mathbf{e}_{l}$ is a vector of ones with dimension $n_{l}$ (the number of nodes in the l-th node layer). Briefly speaking, skip connection refers to passing the information of original features to the final node layer. In CANN, this is done by a linear combination of original features introduced into the last node. Then, the linear combination is added by the nonlinear effect $\mathbf{W}_{n-1} \mathbf{a}_{n-1}$ and passed through the activation function $\sigma_{n-1}({\cdot})$ to output. In the context of KAN, however, the node only performs summation, while the nonlinear transformation has already been completed on the edges. Therefore, KANN finally presents an additive form, as shown in (2.20).

KANN also possesses high interpretability. As long as one plots the function $\phi_j(x_j)$ against $x_j$ , it is possible to understand what kind of nonlinear relationship the model has learned, without the need to use other complex post hoc tools to explore the model behavior. Of course, this requires that $\boldsymbol{\Phi}_{n-1}(\mathbf{a}_{n-1})$ is not excessively large, otherwise the decision process given by $\phi_j(x_j)$ will not be regarded as dominant. Later, we will refer to $\boldsymbol{\Phi}_{n-1}(\mathbf{a}_{n-1})$ as the deep output, and $\phi_j(x_j)$ as the shallow output.

If the original CANN can be viewed as a hybrid architecture combining neural networks and the Generalized Linear Model (GLM), then the KANN in (2.20) can be regarded as a hybrid of neural networks and the Generalized Additive Model (GAM). The reason is that, in CANN, the interpretable component $\mathbf{B}$ corresponds to the coefficient vector in a GLM, whereas in KANN, the shallow functions $\phi_j(x_j)$ represent the decomposition of nonlinear effects for each variable, analogous to the additive components in a GAM.

It is worth noting that, compared with CANN, KANN relies more heavily on the specification of the functional form of $\phi_j(x_j)$ . This is because, in CANN, each variable is associated with a single interpretable coefficient, whereas in KANN, the corresponding effect must be expressed as a full nonlinear curve. Consequently, obtaining meaningful interpretability in KANN is inherently more demanding. There are at least three ways to train KANNs, and the interpretability obtained from each method may not be the same:

(1) Jointly learn both deep and shallow part of KANN. This method is straightforward, but the effect of the shallow KAN is likely to be taken over by the deep KAN, making the final interpretation confusing. An improved approach is phased training: first set $\boldsymbol{\Phi}_{n-1}$ to be $\mathbf{0}$ and freeze the parameters of $\boldsymbol{\Phi}_{n-1}$ , then use a large learning rate to train $\phi_j(x_j)$ until convergence. Subsequently, we unlock parameters of $\boldsymbol{\Phi}_{n-1}$ for fine-tuning with a small learning rate. In this way, the deep output will not be too large, which allows improving model accuracy while retaining interpretability.
(2) Residual fitting. First fit a KAN[n,1] (or a GAM) to the original data, representing the shallow part of KANN. Then calculate the residuals and use them to train the deep part (a deep KAN). This method is closest in essence to the boosting enhancement. However, when using this method for mortality modeling, the resulting mortality curves may not be smooth, because the deep model, when fitting residuals, fits not only the mortality pattern but also the historical noise. As a result, future mortality predictions may include a lot of noisy outputs.
(3) Pretraining and formal training. This is our most recommended method. We first treat each $\phi_j(x_j),\ j \in \{1, 2,\dots, p\}$ as an independent unit (KAN[1,1]), and let them separately fit known approximate univariate relationships $f_j(x_j),\ j \in \{1, 2,\dots, p\}$ to achieve the effect of pretraining. Generally, the parameters of the shallow part can be loaded from many sources. Even if only a rough relationship between $x_j$ and y is known, it can be used as an initialization. For example, if we know that $y \approx kx_j$ when some detailed effects are ignored, we can also take points from $y = kx_j$ at uniform intervals and let $\phi_j(x_j)$ learn. If the approximate relationship and the actual relationship may differ significantly, we can also add the phased training mechanism from (1).

Under the KANN framework, we first extend the original KAN[2,1] to KANN[2,1], with architecture in Figure 2.

Figure 2. KANN[2,1] with a deep part of KAN[2,8,8,1]. The dashed arrows indicate the components of KANN. $\phi_{\text{age}}(x)$ and $\phi_{\text{year}}(t)$ are connected to the nodes in the final layer, forming an additive relationship with the deep part.

Model 3: KANN[2,1] The input $ \mathbf{x} = (x_1, x_2)^\top $ , representing age and year. The KANN[2,1] with deep part structure KAN[2,8,8,1] and shallow part KAN[2,1] is expressed as

(2.21)

\begin{equation} \ln(m_{x,t}) = \boldsymbol{\Phi}_3\left( \boldsymbol{\Phi}_2\left( \boldsymbol{\Phi}_1(\mathbf{x}) \cdot \mathbf{e}_1 \right) \cdot \mathbf{e}_2 \right) \cdot \mathbf{e}_3 + \phi_{\text{age}}(x) + \phi_{\text{year}}(t),\end{equation}

where:

• $\Phi_1 \in \mathbb{R}^{8 \times 2}$ , with 16 learnable activation functions $\phi_{1,i,j}$ , $i=1,2;\ j=1,\dots,8,$
• $\Phi_2 \in \mathbb{R}^{8 \times 8}$ , with 64 learnable activation functions $\phi_{2,i,j},\ i,j=1,\dots,8,$
• $\Phi_3 \in \mathbb{R}^{1 \times 8}$ , with 8 learnable activation functions $\phi_{3,i,1},\ i=1,\dots,8.$

Although a three- or four-layer neural network is far from deep in the traditional FNN framework, it is a different story under the KAN architecture. Due to KAN’s trainable activation functions, there is no need to stack dozens or even hundreds of layers.

To assess whether the KANN[2,1] model is capable of learning from and extending existing mortality models, we adopt a pretraining-based strategy. Specifically, the prior knowledge is derived from a commonly used generalized additive model (GAM). Fitting this GAM provides two components: $f_{\text{age}}(x)$ and $f_{\text{year}}(t)$ . By sampling the components at integer values of input domains, and letting $\phi_{\text{age}}(x)$ and $\phi_{\text{year}}(t)$ fit the corresponding GAM estimates, we can effectively load these prior knowledge into the model. The entire network is then formally trained using a small learning rate. During this phase, the deep component of KANN begins to detect mortality structures not captured by the original GAM, interacting with $\phi_{\text{age}}(x)$ and $\phi_{\text{year}}(t)$ to refine and adjust their initial estimates.

Importantly, the knowledge input to $\phi_{\text{age}}(x)$ and $\phi_{\text{year}}(t)$ need not be restricted to GAM smooths. Although not employed in this study, several alternative sources of prior knowledge are listed here, as they may serve as “study material” to accelerate convergence and shape special model interpretation. Examples include:

• Alternative knowledge inputs for $\phi_{\text{age}}(x)$ :
1. − The log-mortality curve of a selected year;
2. − A graduated version of a single-year log-mortality curve. If smoothing is performed using a parametric model and $\phi_{\text{age}}(x)$ is frozen during formal training, this results in a partially parametric specification of dynamic mortality;
3. − The $a_x$ component from the LC model;
4. − A positively sloped line that lies below 0 on the y-axis for ages 0–100. Such heuristic guesses may require phased training for stable convergence.
• Alternative knowledge inputs for $\phi_{\text{year}}(t)$ :
- − The mortality time series of a selected age group;
- − The $k_t$ component from the LC model;
- − A negatively sloped line.

Historically, mortality improvements have not been uniform across ages. This leaves scope for KANN to pre-learn more intricate mortality patterns from certain traditional models. To this end, we introduce two additional model classes, equipping KANN with the capability to be pre-trained by traditional models featuring factor interaction terms or cohort effects.

Model 4: KANNLC The KANNLC with deep part structure KAN[2,8,8,1] and shallow LC part is defined as

(2.22)

where the LC components corresponds to $\phi_{\text{age}_1}(x)$ as well as the interaction term $\phi_{\text{age}_2}(x) \times \phi_{\text{year}}(t)$ . In the pretraining phase, $\phi_{\text{age}_1}(x)$ , $\phi_{\text{age}_2}(x)$ , and $\phi_{\text{year}}(t)$ are, respectively, aligned with $a_x$ , $b_x$ , and $k_t$ from the LC model. This approach enables KANN to assimilate the structures of mortality models traditionally estimated via GNM. The architecture of KANNLC is depicted in Figure 3.

Figure 3. KANNLC architecture. $\phi_{\text{age}_2}(x)$ and $\phi_{\text{year}}(t)$ are multiplied together and connected to the final layer.

Model 5: KANNAPC The KANNAPC consisting of a deep component KAN[2,8,8,1] and a shallow cohort component, is specified as

(2.23)

Here, $\phi_{\text{age}_2}(x) = -x$ and $\phi_{\text{year}_2}(t) = t$ , and the inclusion of these two structures enables KANN to automatically compute $t-x$ , as shown in Figure 4. Another approach is to include the cohort as an additional variable, which is easier to implement. During pretraining, $\phi_{\text{age}_1}(x)$ , $\phi_{\text{year}_1}(t)$ , and $\phi_{\text{cohort}}(\gamma)$ are calibrated using $a_x$ , $k_t$ , and $\gamma_{t-x}$ from the APC model. The formal training procedure is similar to that described for KANN[2,1].

Figure 4. KANNAPC architecture. Age x is transformed into $-x$ after passing through $\phi_{\text{age}_2}(x)$ , while year t remains t after passing through $\phi_{\text{year}_2}(t)$ . The two are then added to obtain the birth year $\gamma$ . The birth year (cohort) is transformed by $\phi_{\text{cohort}}(\gamma)$ and connected to the nodes in the final layer.

Apart from the above structures, we also trained a pure KAN[2,8,8,1] (without any skip connection) as Model 6. While such deep architectures inevitably sacrifice much of the interpretability of shallow KANs, comparing their outputs offers valuable insights into how model behavior evolves with increased depth.

We conclude this subsection with a remark of the key features of KANN. Overall, KANN exhibits the following notable characteristics:

• Interpretability and knowledge transfer: KANN inherits the clear interpretability of shallow KANs while extending the applicability of the classical CANN framework to a broader range of mortality modeling tasks. Simply observing the shallow component of KANN allows its decision-making process to be interpreted directly. This interpretation provides an exact description of the model’s behavior, without the need for any additional interpretation techniques. Due to its structural simplicity, KANN can directly absorb knowledge embedded in traditional mortality models, and in turn use this interpretive capability to guide and enhance its own training.
• Highly modular design: The architecture of KANN is inherently modular, similar to building with blocks. Adding a new univariate effect only requires incorporating an additional KAN[1,1] with a skip connection, without altering the overall structure, thus making experimentation and model extension easier.
• Balance between interpretability and prior knowledge: The degree to which KANN relies on prior mortality knowledge depends on how strictly interpretability is desired, leading to a trade-off.
1. − Without any external information, KANN[2,1], KANNLC, and KANNAPC are all capable of fitting mortality patterns, though their interpretations may differ substantially from those of traditional models.
2. − If one wishes to obtain interpretations from KANN[2,1] that are comparable to traditional models under zero prior knowledge, a phased training strategy can be adopted: initialize the function matrix in the final layer of the deep part to zero and keep it frozen, focusing first on training the two elements in the shallow part. Once the error has sufficiently decreased, the deep network is unlocked for joint training, providing the desired interpretative form.
3. − For certain models corresponding to traditional mortality frameworks with identifiability constraints, such as KANNLC and KANNAPC, obtaining interpretations that closely match those of the original models is unlikely without any prior knowledge. In such cases, the prior information provided should be sufficient to make the model exactly identifiable (e.g., provide $b_x$ in LC model); otherwise, the desired interpretation can only be obtained by chance.

2.5. Other models for comparison

A comparison limited to the LC and APC models may not fully reveal additional insights related to KAN and KANN. Since KAN is a neural-network-based approach whose components incorporate spline functions, we therefore present the results of several additional models to facilitate a more comprehensive comparison.

• LSTM: Following the multivariate time series framework of Perla et al. (Reference Perla, Richman, Scognamiglio and Wüthrich2021), we trained and tuned LSTM models separately for each population. LSTM is a widely used neural network architecture in mortality modeling, valued for its easy training procedure and reasonable predictive accuracy. Specifically, we implemented the model using PyTorch, simplifying the gender and country embedding components of the LCLSTM1 model from Perla et al. (Reference Perla, Richman, Scognamiglio and Wüthrich2021) to adapt it for single-population mortality forecasting. Hyperparameter tuning was performed using a small validation set (see Subsection 4.1).Footnote ⁴ A 10-year rolling window for historical observations was adopted, consistent with the original study.
• GAM: Generalized additive models are also employed for the initialization of KANN[2,1]. We highlight this model to compare both the interpretation differences between shallow KAN and GAM. Utilizing the mgcv package in R, we optimized three key parameters on the validation set: the number of splines $k_{\text{year}}$ for the time effect $f_{\text{year}}(t)$ , the number of splines $k_{\text{age}}$ for the age effect $f_{\text{age}}(x)$ , and the smoothing penalty term sp Footnote ⁵ . Thin plate splines are selected, following Hall and Friel (Reference Hall and Friel2011).
• 2D P-spline: We also use the penalized spline approach of Currie et al. (Reference Currie, Durbán and Eilers2004), Camarda (Reference Camarda2012) to model mortality rates. This method is generally regarded as striking a favourable balance between smoothness and predictive accuracy, and we adopt it as the benchmark for evaluating whether the smoothness achieved by our models is satisfactory. Modeling was performed using the MortalitySmooth package in R, with parameters following the default settings of Mortality2Dsmooth except for the over-dispersion parameter, which was explicitly set to TRUE since over-dispersion is a common issue in mortality modeling (Macdonald et al., Reference Macdonald, Richards and Currie2018).

3. Applying KAN and KANN

In the previous section, we discussed how to design models that are interpretable while still capable of incorporating knowledge from traditional models, and presented several KAN-based models along with their variants. In this section, we address the problem of implementing smoothness control without resorting to synthetic samples, and define several metrics that will be used in the subsequent empirical experiments.

3.1. Smoothing control

As shown in (2.13), spline functions are embedded in the architecture of KAN. This means that existing spline-based smoothing techniques can naturally be applied to KAN. In what follows, we implement these techniques for KAN[2,1], KANN[2,1], KANNLC, and KANNAPC. Since the shallow components of KAN[2,1] and all three variants of KANN can directly explain model behavior and provide global interpretability, smoothing these components is essentially equivalent to smoothing the model output itself, without the need to generate synthetic data for smoothing post hoc explanations.

For effective smoothness control and a balance between smoothness and accuracy, the most important tuning parameter is the number of grid points, G. Recommended choices of G vary depending on the model component:

• High G : For any component capturing the direct relationship between mortality and age, such as $\phi_{\text{age}}$ in KAN[2,1] and KANN[2,1], or $\phi_{\text{age}_1}$ in KANNLC and KANNAPC. If G is too small, the “hook”-shaped structure of log-mortality over age cannot be adequately captured, resulting in poor predictions at young ages. In short, one should not sacrifice the number of splines in these components purely for excessive smoothness; the optimal G is typically in the range of 15–50.
• Low G : For most time-related terms, such as $\phi_{\text{year}}$ in KAN[2,1], KANN[2,1], and KANNLC, or $\phi_{\text{year}_1}$ in KANNAPC. Increasing G for these terms may improve in-sample fit, but can also lead to unrealistic temporal trends in forecasts. We recommend choosing G from 5 to 20 for these terms.
• Validation-dependent: For components such as $\phi_{\text{age}_2}$ in KANNLC and $\phi_{\text{cohort}}$ in KANNAPC, where the appropriate degree of smoothness is less clear and may vary by country and gender. In these cases, G should be selected using a validation set.

After choosing a good G, it is also necessary to apply smoothness regularization to the components of KAN and KANN to achieve the smoothing effect. In shallow KAN models, the output from the SiLU portion of the activation function is very smooth, meaning that the smoothness of the overall model output is almost entirely determined by the spline component of the activation function. Therefore, the coefficients of the spline component can be constrained to make the output smoother, thereby improving the model’s generalization ability. When training KAN[2,1] and ARIMAKAN, we use a different loss function to train the model:

(3.1)

\begin{equation} L(m_{x, t}, \boldsymbol{\theta})=\frac{\sum_x\sum_t\left(\widehat{\ln m_{x, t}} - \ln m_{x, t}\right)^2}{N} + \lambda \sum_{l, i, j} \sum_{k=1}^{n+G-1}\left|c_{l, i, j, k+1}-c_{l, i, j, k}\right|\end{equation}

where $\boldsymbol{\theta}$ stands for all training parameters in KAN, N represents the number of training samples, and $\lambda$ is the regularization parameter. The first term on the right-hand side of Equation (3.1) is basically the training MSE, focusing on the discrepancy between fitted and actual mortality rates. The second term penalizes the sum of the absolute values of the differences in B-spline coefficients to smooth model outputs.

Intuitively, if the difference between adjacent spline coefficients is large, the next spline can significantly alter the curve characteristics as the previous spline approaches zero. When there are many splines, this results in frequent local fluctuations and oscillations, making the fitted mortality rate quite rough. Therefore, the loss function can directly control the smoothness of the fit and predictions through the choice of $\lambda$ , in light of Currie et al. (Reference Currie, Durbán and Eilers2004).

Figure 5 depicts the changes in the goodness-of-fit as the level of regularization increases. When the smoothing parameter rises from 0 to 0.001, the estimated mortality rates around ages 5 and 20 become smoother. Further penalty significantly adjusts the fit at age 0, causing the mortality curve to gradually degrade into a sloped line to satisfy the constraint. We consider $\lambda = 0.1$ to be a critical threshold, as the model at this point has already made substantial compromises for smoothness, and at high ages (90–100), the fitted curve begins to lose spline support at the boundaries. As the penalty continues to increase to 1, the model retains a rough fit for middle-aged mortality, but the structure of mortality at younger and older ages becomes completely distorted.

Figure 5. $\lambda$ ’s impact on fitted mortality from KAN[2,1] on 1975 US female population.

It is important to note that when fitting KANN, we compute the second term in (3.1) using only the parameters from its shallow component. This design ensures that the smoothness constraint directly targets the interpretable part of the model without interfering with the flexibility of the deep component, thereby preserving both smoothness and predictive accuracy.

3.2. Performance metrics

This paper evaluates the model’s predictive results from two perspectives: accuracy and smoothness. The goodness of fit is calculated using two measures: RMSE and MAE.

(3.2)

\begin{equation} \text {RMSE}=\sqrt{\frac{\sum_{x}\sum_{t}\left[\widehat{\ln \left(m_{x,t}\right)}-\ln \left(m_{x,t}\right)\right]^2}{N}}, \text {MAE}=\frac{\sum_{x}\sum_{t}\left|\widehat{\ln \left(m_{x,t}\right)}-\ln \left(m_{x,t}\right)\right|}{N}\end{equation}

This paper also utilizes difference operator to measure smoothness:

(3.3)

\begin{equation} \text{Smoothness across Ages} = \sum_t \sum_x \left(\Delta^2_x \widehat{\ln \left(m_{x,t}\right)}\right)^2\end{equation}

In simple terms, the measure of smoothness across ages is computed by taking the second-order differences of the predicted log-mortality rates along the age dimension, squaring these differences elementwise, and then summing them to obtain one number. If the curve resembles a straight line locally, the smoothness metric along the age dimension is lower. We do not compare smoothness across years, as capturing the correct trend in the time dimension is more important than the smoothness of the curve.

4. Empirical investigation and results

4.1. Data

We validate our models using mortality data from the Human Mortality Database (HMD), which compiles high-quality demographic data from numerous countries and regions. Our data include countries where both male and female populations exceed two million, and for which complete mortality records are available from 1950 to 2019. This results in a total of 17 countries (34 populations) being selected. For each, we use age-specific mortality rates with age range $\mathcal{X} = {x \in \mathbb{N}_0\,:\,0 \leq x \leq \omega = 99}$ and calendar years $\mathcal{T} = {t \in \mathbb{N}\,:\,1950 \leq t \leq 2019}$ .

In certain countries, mortality rates for specific ages are recorded as zero in the HMD. These values are imputed using the average mortality rates from other countries or regions for the same age and year. The list of included countries is provided in Appendix A.

The data preprocessing required for fitting the KAN is relatively simple: all t values are shifted by subtracting the earliest recorded year (1950 in this study), transforming t into incremental years. This step is crucial, as differences in the scales of time and age can easily lead to training difficulties, resulting in a flat age effect and rendering the model ineffective. For the LSTM in the baseline, we applied min–max normalization to the variables to ensure its performance.

Since hyperparameter tuning is required, the data are divided into three parts: training set $\mathcal{T}_1 = \{t \in \mathbb{N}\,:\, 1950 \leqslant t \leqslant 1999\}$ , validation set $\mathcal{T}_2 = \{t \in \mathbb{N}\,:\, 2000 \leqslant t \leqslant 2004\}$ and test set $\mathcal{T}_3 = \{t \in \mathbb{N}\,:\,$ $2005 \leqslant t \leqslant 2019\}$ , where $\mathcal{T}_1 \cup \mathcal{T}_2 \cup \mathcal{T}_3 = \mathcal{T}$ . The usage of these subsets depends on whether a model requires hyperparameter tuning:

• For models that do not require hyperparameter tuning (LC, APC, ARIMAKAN, 2D P-spline, DeepKAN), we train them on $\mathcal{T}_1 \cup \mathcal{T}_2$ and test them on $\mathcal{T}_3$ .
• For those that do need fine-tuning (KAN[2,1], KANN[2,1], KANNLC, KANNAPC, LSTM, GAM), we train them on $\mathcal{T}_1$ with different values of the hyperparameters, monitoring the MSE on the validation set $\mathcal{T}_2$ . After selection, we refit the final model on $\mathcal{T}_1\cup\mathcal{T}_2$ using the chosen hyperparameters and evaluate once on $\mathcal{T}_3$ . For models that require pretraining, the input ranges of the year and cohort effects during pretraining are kept identical to those used in formal training.

The tuning details for KAN-based models are summarized in Table 1 (for the tuning of other baselines, see Subsection 2.5). Here, $\Lambda = \{0,10^{-4}, 5\times10^{-4}, 10^{-3}, 5\times10^{-3}, 10^{-2}, 5\times10^{-2}, 10^{-1}\}$ and $\Gamma = \{5, 10, 25, 50, 100\}$ . The abbreviation lr stands for the learning rate. During pretraining (i.e., when learning the components of traditional models), a larger $\text{lr}_{\text{pre}}$ is adopted to ensure that the learned components closely resemble those from the traditional models. In the subsequent formal training, a smaller learning rate is used to avoid large distortion of the learned components.

Table 1. Hyperparameter configurations for KAN-based models. The “Candidate Hyperparameters” column lists the hyperparameters to be tuned; if an entry is NA, no tuning is performed. The “Fixed Hyperparameters” column lists those kept constant throughout training.

4.2. What can mortality say: Model interpretation

The interpretability of KAN and KANN can be directly obtained from the trainable activation functions. In this subsection, we examine how different models can be interpreted and what outputs they produce and compare these outputs with those of traditional models.

4.2.1. KAN[2,1] and KANN[2,1]

KAN[2,1] and KANN[2,1] are two simple models whose interpretations consist of the base mortality $\phi_{\text{age}}(x)$ and the overall shift of mortality curve $\phi_{\text{year}}(t)$ . Figure 6 presents a set of comparisons of model interpretations and outputs,Footnote ⁶ taking Netherlands female population as an example. The phenomenon illustrated in this figure is highly representative across major countries and regions. Interested readers are referred to Figshare, where the results for all countries and both sexes are presented in full. We provide the links in Data Availability Statement.

Figure 6. Interpretation and outputs of KAN[2,1] and KANN[2,1], fitted on female population of Netherlands. Panel (a) shows the estimated age effects $\phi_{\text{age}}(x)$ from KAN[2,1] and KANN[2,1], compared with the GAM-based $f_{\text{age}}(x)$ . Panel (b) presents the estimated time effects $\phi_{\text{year}}(t)$ from KAN[2,1] and KANN[2,1], along with the GAM-based $f_{\text{year}}(t)$ . Panel (c) illustrates the deep-part outputs of KANN[2,1] across ages and years. Panel (d) compares the predicted mortality curves for 2019 (the last year of the test set) from five models. For KAN[2,1], KANN[2,1], and LSTM, the results are based on a single run.

As shown in Figure 6(a), the estimated age effects from the three models are nearly identical, with KANN[2,1] appearing slightly smoother. This indicates that, during formal training, KANN[2,1] made only minor adjustments to the GAM-based estimates of age effects. In contrast, Figure 6(b) reveals substantial differences in time effects: the GAM suggests highly conservative mortality improvements, whereas both KAN[2,1] and KANN[2,1] capture a persistent decline (resembling mortality improvement speed in ages 20–50). Similar patterns are observed across all countries and regions, where GAM-based improvements are consistently lower.

The deep part of KANN provides refinements to the shallow part. As shown in Figure 6(c), KANN[2,1] typically reduces estimated mortality at younger ages and slightly increases mortality at older ages, thereby capturing the rotation of mortality curves – a phenomenon widely observed across countries. These adjustments substantially enhance KANN[2,1]’s predictive accuracy for young-age mortality compared to KAN[2,1] and GAM, as seen in Figure 6(d). However, while KANN[2,1] raises old-age mortality estimates, the adjustments are often insufficient, leading to suboptimal predictions at high ages.

Figure 6(d) further shows that both KAN[2,1] and KANN[2,1] produce remarkably smooth mortality curves, in sharp contrast to the LSTM results. These curves provide initial validation of the grid-size selection and smoothness regularization discussed in Subsection 3.1. Moreover, although the 2D P-spline method also outputs relatively smooth and accurate forecasts, it often fails to capture the underlying structure of mortality curves, resulting in implausible bends. By comparison, KANN[2,1] better accommodates temporal mortality changes, yet it inherits GAM’s drawback at older ages, where the age-specific slope lacks the “acceleration–deceleration” pattern. Thus, KANN[2,1] delivers highly smooth curves, with strengths and limitations both evident.

Beyond the above analysis, a deeper question naturally arises: as seen in (2.13) and (2.14), KAN[2,1] also possesses an additive structure similar to that of GAM. Why, then, can their estimated mortality improvements differ so substantially? To address this issue, it is helpful to decompose the trainable activation functions of the shallow KAN into two components. Recall Equation (2.11), where the age and time effects of KAN[2,1] can be expressed as

(4.1)

\begin{align} \phi_{\text{age}}(x) &= w^b_{\text{age}} \frac{x}{1 + e^{-x}} + w^s_{\text{age}}\, s_{\text{age}}(x), \end{align}

(4.2)

\begin{align} \phi_{\text{year}}(t) &= w^b_{\text{year}} \frac{t}{1 + e^{-t}} + w^s_{\text{year}}\, s_{\text{year}}(t).\end{align}

In Figure 7, we illustrate the above decomposition using KAN[2,1] as an example. Along the age dimension, to ensure the age structure of mortality is preserved, the splines primarily fit the mortality rates at younger ages, with very smooth fits at higher ages. SiLU (i.e., $x / (1 + e^{-x})$ ) is mainly used to compensate for the discrepancies between splines and the actual mortality rates at older ages. Along the time dimension, SiLU provides a straightforward linear estimate of mortality improvement, while the splines add further details to capture the acceleration or deceleration of mortality improvement.

Figure 7. Decomposition of KAN[2,1]’s trainable activation functions, fitted on Netherlands female mortality; $\lambda = 10^{-4}$ .

The above phenomenon provides a clear explanation for the differences between KAN[2,1] and GAM in estimating the year effect. In GAM, the spline component accounts for 100% of the fitted year effect, whereas in KAN[2,1], the spline only plays an auxiliary role, with the SiLU function capturing the dominant temporal trend. These characteristics allow KAN[2,1] to produce a consistently declining year effect, while the forecasts of GAM are largely influenced by the trends in the final few years of the training sample.

An additional advantage of KAN[2,1] lies in the auxiliary role of the spline component: a relatively larger number of splines can be employed without raising concerns about poor extrapolation. In GAM, the optimal number of splines $k_{\text{year}}$ selected on the validation set typically lies between 5 and 8, whereas KAN[2,1] can accommodate a substantially larger number, such as 15 in Table 1. This larger number of splines even enables KAN[2,1] to capture major mortality-related events. For example, the year effect fitted by KAN[2,1] remains nearly flat during the 1960s and 1970s in many countries. This stagnation in mortality improvement reflects the combined impact of industrial pollution (Woolf and Schoomaker, Reference Woolf and Schoomaker2019), chronic diseases among older populations (Giles and Wilkie, Reference Giles and Wilkie1971), economic inequality (Thakrar et al., Reference Thakrar, Forrest, Maltenfort and Forrest2018), and rising mortality from traffic accidents (Evans, Reference Evans2014).

In summary, both KAN[2,1] and KANN[2,1] demonstrate clear interpretability and strong smoothing capabilities, offering a balance between structure and flexibility. KAN[2,1] provides declining year effects and accommodates larger spline bases without risking implausible extrapolations, while KANN[2,1] further refines predictions by capturing rotations in mortality curves.

4.2.2. KANNLC and KANNAPC

The structures of KANNLC and KANNAPC are specifically designed to align with traditional mortality models and “learn” from their components. After pretraining and formal training, the model outputs are obtained by adding the shallow and deep parts. Compared with KANN[2,1], the interaction between the shallow and deep parts in KANNLC and KANNAPC is more intricate. We summarize the potential changes arising from this interaction into four categories:

• Effect smoothing: Due to the regularization parameter $\lambda$ , the shallow part tends to produce smoother effects. Local fluctuations are suppressed and are no longer picked up by the deep part;
• Effect modification: A structural change that occurs directly in the shallow component, while the deep part neither absorbs nor compensates for it;
• Effect transfer: Certain functional structures originally carried by the shallow part (such as local trends or slope changes) may be taken over by the deep part. The deep part may reproduce them better, worse, or in some cases leave the overall performance unchangedFootnote ⁷ ;
• Effect extension: Patterns or structures that the shallow part cannot represent may be complemented and realized in the deep part, thereby enhancing the model’s capacity.

It should be emphasized that these four types of effects often intertwine during training, and it is difficult to separate them in a strict quantitative manner. Nevertheless, the resulting figures still make it possible to judge qualitatively whether such effects are present. For instance, when the deep outputs are generally large and certain shallow components undergo substantial parallel shifts, one may interpret this as clear evidence of effect transfer; whereas if the deep part accounts for most of the changes in particular age ranges, this provides strong reason to regard it as effect extension.

Figure 8 illustrates the components of KANNLC and KANNAPC, together with the resulting mortality forecasts in 2019, using French female population as an example. For KANNLC, several common phenomena can be observed:

• The age effect $\phi_{\text{age}_1}(x)$ of KANNLC exhibits a shape similar to the $a_x$ of LC. As shown in Figure 8(a), once information about the shape of the mortality curve is captured during pretraining, KANNLC rarely makes drastic changes to $\phi_{\text{age}_1}(x)$ , with the most typical modification being a slight rotation. Across all countries and regions, the “hook”-shaped structure of this shallow effect is consistently preserved.
• The year effect $\phi_{\text{year}}(t)$ of KANNLC is smoother than the $k_t$ of LC, and differences arise in extrapolation, as displayed in Figure 8(a). Due to the spline-based formulation, $\phi_{\text{year}}(t)$ exhibits nonlinearity when extrapolated, in contrast to the random walk or simple ARIMA extrapolation used in LC. It is unlikely that the model intentionally produces such discrepancies, since within the entire period $\mathcal{T}_1 \cup \mathcal{T}_2$ , $\phi_{\text{year}}(t)$ typically fits $k_t$ well. Moreover, because $\phi_{\text{year}}(t)$ is multiplied by a very small $\phi_{\text{age}_2}(x)$ term, the extrapolation differences either have negligible impact on the forecasts or are offset by the outputs of the deep part.
• The sensitivity term $\phi_{\text{age}_2}(x)$ is the shallow component most substantially modified during training. In our example, KANNLC smooths and rotates the original $b_x$ of LC, providing higher estimates below age 50, lower estimates above age 50, and eliminating local fluctuations. The impact of these changes is immediately visible in Figure 8(d): compared with the original LC model, KANNLC produces lower forecasts for young-age mortality and higher forecasts for middle and older ages, thereby achieving a rotation of the mortality curve.Footnote ⁸
• The deep part of KANNLC typically captures complex effects involving both age and year. In Figure 8(c), two prominent oblique bands can be observed. The first one is a diagonal structure running from (1950, 0) to (2019, 99). It corresponds to a clear cohort structure, where the deep part applies virtually no adjustment. In this case, the shallow output alone provides KANNLC’s prediction of mortality for the 1950 cohort. The second structure also originates from (1950, 0) but extends obliquely while concentrating around middle ages (approximately 40–55) in $\mathcal{T}_3$ , with its magnitude increasing over time and reaching about 0.5 by 2019. This structure is not a conventional cohort effect, yet it plays a crucial role in adjusting mortality estimates. In the shallow part of KANNLC, $\phi_{\text{age}_1}(x)$ and $\phi_{\text{age}_2}(x)$ undergo little modification between ages 40 and 55, so the task of adjusting mortality in this age range is delegated to the deep part. As shown in Figure 8(d), KANNLC forecasts noticeably higher mortality than LC for ages 40–60 in 2019, aligning more closely with observed rates, and this improvement is driven by the above slanted band. Other adjustments, such as changes to infant mortality, the shape of the curve between ages 0–20, or the “acceleration–deceleration” pattern at old ages, are also commonly seen, though they are not present in this example. Beyond this example, the contribution of the deep part of KANN differs across populations, reflecting population-specific adjustments in mortality patterns.

Figure 8. Interpretation and outputs of KANNLC and KANNAPC. Panel (a) and panel (b) show the shallow-part outputs. Panel (c) presents the deep-part heatmap of KANNLC (left) and KANNAPC (right). Panels (d) and (e) provide predictions in 2019 with zoomed-in views on age interval 10–25.

Compared with the complex behaviors of KANNLC, the behaviors of KANNAPC are much more consistent. All adjustments made by KANNAPC aim to improve APC’s performance for ages 10–50. As shown in Figure 8(e), APC tends to produce low mortality estimates for this age range, which deviate substantially from the observed data. KANNAPC typically employs three types of adjustments:

• A smoother and more conservative year effect. As shown in Figure 8(b), $\phi_{\text{year}_1}(t)$ is far less steep than the $k_t$ of APC. This adjustment helps raise the long-term mortality forecasts.
• Compared with APC, KANNAPC smooths the cohort effect in the earlier part of the curve, while lifting the later part. As shown in Figure 8(b), the cohort effect of APC declines sharply after 1950, whereas KANNAPC maintains a higher level. This upward adjustment in the later cohorts prevents the unrealistic decline implied by APC and yields more plausible mortality forecasts, particularly for the 10–50 age range.
• Direct upward adjustments from the deep part. As shown in Figure 8(c), a clear triangular region appears in the lower-left corner of the figure for KANNAPC. This region corresponds to the elevation of mortality estimates for cohorts after 1950, especially for populations aged 10 and above.

Overall, while KANNLC produces more diverse adjustment behaviors and KANNAPC employs more unified corrections, both architectures effectively generate smooth mortality curves, ensuring interpretability comparable to their traditional counterparts. In the next subsection, we further assess these models by comparing their forecasting accuracy and smoothness using quantitative metrics.

4.3. Forecasting future mortality

In this subsection, we will comprehensively compare the predictive performance of KAN-based models with other benchmarks. The performance of all neural-based models is evaluated across ten independent runs, with the mean RMSE, MAE, and smoothness metrics considered.

Table 2 summarizes the relative performance of different models across the three evaluation metrics. The ranking procedure is constructed as follows: for each of the 34 populations, we record the RMSE, MAE, and smoothness values of all models and assign ranks accordingly. We then calculate the mean and standard deviation of these ranks across all populations. This approach provides an integrated view of the comparative performance of each model. Detailed metric values for all models are provided in Online Supplementary Material A.

Table 2. Model rankings and parameter complexity. Mean ranks are reported with standard deviations in parentheses; mean ranks close to 1 indicate stronger overall performance across populations. Parameter counts reflect the architectures ultimately adopted in mortality forecasting on test set, rather than the smaller or larger values that may have arisen during validation but were not retained.

Table 2 also reports the parameter counts of each model. For ARIMAKAN, KANNLC, KANNAPC, LSTM, and GAM, the optimal architectures vary across populations; therefore, both the minimum and maximum numbers of parameters are reported. In the case of LSTM, the gap between the minimum and maximum parameter counts is particularly large, which stems from differences in the number of recurrent layers and the number of neurons of the hidden layer.

From the overall forecasting results, model performance varies across countries and genders, while the KANN family consistently exhibits a more balanced profile. KANNAPC and KANNLC achieve top rankings in both RMSE and MAE, outperforming traditional approaches and delivering slightly higher overall accuracy than LSTM. Moreover, their forecasts are smoother than those of APC and LC, and markedly smoother than LSTM. KANN[2,1] and ARIMAKAN stand out as extremely smooth models: their smoothness consistently ranks among the top three, while their RMSE and MAE fall in the mid-range, making them suitable choices when the primary objective is to obtain exceptionally smooth mortality curves. The performance of KAN[2,1] surpasses that of GAM; however, owing to its very simple structure, it cannot outperform LC or APC. By contrast, KAN[2,8,8,1] delivers the weakest results in both accuracy and smoothness, and it frequently ranks at or near the bottom across populations. In practice, we recommend KANNLC and KANNAPC as the first choices, given their robust performance across metrics and their ability to generate smooth forecasts that are well suited for practical mortality modeling.

Furthermore, we record the RMSE and smoothness metrics for 34 populations over ten independent runs and compute the standard deviation of each metric for every population. The resulting standard deviations are summarized as boxplots in Figure 9. The MAE metric is not displayed here, as its variation pattern is similar to that of RMSE. Among all models, ARIMAKAN exhibits highly stable RMSE performance across different initialization, while the KANN family also demonstrates good overall stability. The KAN[2,1] model shows relatively higher variability on RMSE for certain populations, mainly because its overly simple temporal component does not fit well for populations with small exposures. Regarding smoothness, almost all models except KAN[2,8,8,1] show very stable behavior. It is worth noting that although many models appear stable in Figure 9, their interpretation visualizations may still differ substantially due to the effect transfer phenomenon discussed in Subsection 4.2.2.

Figure 9. Model stability across random initialization.

In summary, our investigation of KAN-based mortality modeling reveals several interesting findings:

• Deeper does not necessarily mean better. In the conventional practice of using MLPs, a network without hidden layers can hardly achieve anything, and one typically increases depth to enhance its capacity. By contrast, within the KAN framework, shallow models should not be underestimated. This is because the trainable activation functions $\phi$ already possess representational capacity. For instance, as shown in Figure 6(d), KAN[2,1] with only about 40 parameters can already capture the rough shape of the mortality curve. If one deepens KAN without careful structural design (analogous to stacking layers in a feedforward network), the outcome may even deteriorate.

Figure 10 illustrates this phenomenon using Italian female mortality as an example. Among all KAN-based models, KAN[2,8,8,1] is the only one that fails to reproduce the basic shape of the mortality curve, performing even worse than the simple ARIMAKAN. The underlying reason is that KAN[2,8,8,1] is severely overparameterized relative to KAN[2,1], even though it does not appear particularly deep. In theory, there may exist parameter configurations that allow KAN[2,8,8,1] to achieve excellent forecasts, but the optimization process is unlikely to consistently find them. Put simply, there are many ways to fit the observed mortality, and a deep KAN may not know which solution is correct. This phenomenon resembles the difficulties faced by 2D P-spline in extrapolation: the model appears to have learned the underlying patterns, yet the learned structure proves unsuitable when extrapolated beyond the observed range. To enable this type of KAN to be trained, it may require very complex regularization.

Figure 10. Predicted mortality curves from different KAN-based models for Italian female population in 2019.

One alternative method is KANN, where traditional mortality models “say something to” KAN. In this setup, the model not only inherits interpretability but also achieves solid predictive accuracy and smoothness. For any feedforward KAN-based modeling,Footnote ⁹ we recommend that users carefully consider whether the relationship between dependent and independent variables involves multilayer nonlinear compositions. If not, the structure of KAN or the shallow part of KANN should avoid unnecessary stacking of layers.

• Limitations learned during pretraining are not always fully corrected in subsequent formal training. As discussed in Subsection 4.2.2, the APC model tends to substantially underestimate mortality between ages 10 and 50, whereas KANNAPC can largely adjust this bias during formal training. However, for the underestimation of old-age mortality in APC, KANNAPC introduces some adjustments but they remain relatively limited, as illustrated in Figures 8(e) and 10. In comparison, LC already performs better than APC in forecasting old-age mortality, and this strength is naturally inherited by KANNLC. These findings highlight the importance of carefully assessing which features of a traditional mortality model are beneficial and which are problematic, since not all deficiencies can be automatically corrected by KANN. Users are therefore encouraged to select initial structures with greater caution rather than assuming that later training will resolve all shortcomings.
• Parameter counts are not directly proportional to training time. As seen from Table 2, although KAN and KANN models contain far fewer parameters than LSTM (about two thousand versus hundreds of thousands or even millions), the training of KAN and KANN still takes around 10–20 s on an NVIDIA GeForce RTX 4060 Laptop GPU with 8GB of memory (pretraining requires an additional 2–3 s), which is almost the same as LSTM. This is because each parameter involves spline evaluations and nonlinear computations that are less optimized in hardware than matrix multiplications. However, this level of training time is still very good to accept. At the same time, the much smaller parameter scale highlights the parameter efficiency of KANN and makes the models considerably easier to deploy in actuarial practice.

5. Conclusion

This study is motivated by two challenges in mortality forecasting: (i) how to effectively transfer the structural information of traditional mortality models into a deep learning framework and (ii) how to directly control the neural networks to generate smooth mortality forecasts. To address these challenges, we introduce KAN and develop a family of actuarial extensions (KANN), in which classical components are embedded in the shallow part while the deep part provides nonlinear refinements. This design allows prior actuarial knowledge to guide neural network training and ensures that smoothness can be enforced intrinsically through regularization, which echoes with our paper’s title: “What KAN Mortality Say”.

Empirical investigations across 34 populations demonstrate that the proposed KANNLC and KANNAPC models strike a favorable balance between interpretability, smoothness, and predictive accuracy. Models like KANN[2,1] and ARIMAKAN are extremely smooth models and could be useful when exceptionally smooth mortality curves are the priority. By contrast, a naively deepened KAN (e.g., KAN[2,8,8,1]) performs poorly in both accuracy and smoothness. These findings demonstrate the effectiveness of initializing neural networks with traditional mortality models and highlight the pivotal role of KAN in enabling smoothness control within neural architectures.

While this paper focuses on single-population mortality, the modular design of KANN naturally accommodates multi-population extensions. Region indicators or population-specific shallow effects can be incorporated to learn both shared structures and cross-country heterogeneity. A full empirical study of such multi-population structures and their actuarial implications is left for future research. Moving beyond, the potential applications of KAN in other actuarial and statistical modeling tasks remain largely unexplored (Wüthrich et al., Reference Wüthrich, Richman, Avanzi, Lindholm, Maggi, Mayer, Schelldorfer and Scognamiglio2025). We look forward to KAN serving as a versatile tool that bridges traditional actuarial methods and modern machine learning, thereby opening new directions for research and practice.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/asb.2025.10079.

Acknowledgments

We express our sincere gratitude to the editor and the 4 referees for their insightful comments and suggestions, which have substantially enhanced the quality of this manuscript. We are also indebted to Fei Huang, Binghui Guo, Jiamei Sun, Haonan Li and Laijuan Luo for their valuable feedback on this work.

Data availability statement

The original data for this study comes from the Human Mortality Database: https://www.mortality.org. Figures for all countries and regions are compiled into an interactive dashboard: https://yuanzhuang.shinyapps.io/What_KAN_Mortality_Say/, which also includes figure download links for interested readers.

Funding statement

This research was supported by grants from the National Natural Science Foundation of China (No. 72371246).

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to improve the readability and language of the manuscript in the writing process. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

A. Appendix: Data Details

Table A1. Countries and regions selected from Human Mortality Database.

Footnotes

1 To prevent ambiguity, every time the term “layer” appears, we indicate whether a layer is a KAN layer or a node layer.

2 For these definitions, we refer to Owens et al. (Reference Owens, Sheehan, Mullins, Cunneen, Ressel and Castignani2022).

3 In this case, if we set $\sigma_{n-1} = \exp({\cdot})$ , the model becomes equivalent to a GLM with log link.

4 We search for the optimal hyperparameters over the following combinations: the number of features in the hidden state $\in \{128, 256, 512\}$ , learning rate $\in \{10^{-5}, 10^{-4}, 10^{-3}\}$ , number of recurrent layers $\in \{1, 2, 3\}$ , and dropout rate $\in \{0.1, 0.2, 0.3\}$ .

5 The candidate hyperparameters are $k_{\text{year}} \in \{1,3,5,8,10\},\ k_{\text{age}} \in \{5,10,25,50,100\},\ sp \in \{0,10^{-3},10^{-2},10^{-1}\}$ .

6 Subfigures (a), (b), and (c) are translated for alignment. Specifically, the starting point of the KAN[2,1] and GAM year effect is fixed at (1950, 0), and the translation is applied inversely to the KAN[2,1] and GAM age effect. For KANN[2,1], the year effect is fixed at (1950, 0), while the age effect is shifted so that its value at age 40 matches that of KAN[2,1]; the above shifts are applied inversely to subfigure (c). These transformations do not alter the model output, but facilitate a more direct comparison of the components across models.

7 For example, when the shallow part only undergoes a vertical shift that is fully absorbed by the deep part, the model performance remains the same.

8 It should be noted that after such modifications, $\phi_{\text{age}_2}(x)$ may not strictly satisfy the identification constraints. If needed, the shallow components can be adjusted following the method documented in Lee and Carter (Reference Lee and Carter1992). In general, as long as the corresponding traditional model can incorporate identification constraints, so can the shallow components of KANN. Here, to provide an intuitive demonstration of how KANNLC modifies the $b_x$ of LC during training, we present the shallow components without applying such constraints.

9 Here we do not discuss time-series extensions of KAN.

References

Arnold, V.I. (1958) On the representation of functions of several variables as a superposition of functions of a smaller number of variables. Matematicheskoe Prosveshchenie, (3), 241–250.Google Scholar

Basellini, U., Camarda, C.G. and Booth, H. (2023) Thirty years on: A review of the Lee–Carter method for forecasting mortality. International Journal of Forecasting, 39(3), 1033–1049.10.1016/j.ijforecast.2022.11.002CrossRef Google Scholar

Bjerre, D.S. (2022) Tree-based machine learning methods for modeling and forecasting mortality. ASTIN Bulletin, 52(3), 765–787.10.1017/asb.2022.11CrossRef Google Scholar

Bodner, A.D., Tepsich, A.S., Spolski, J.N. and Pourteau, S. (2024) Convolutional Kolmogorov-Arnold Networks. arXiv: 2406.13155.Google Scholar

Brouhns, N., Denuit, M. and Vermunt, J.K. (2002) A Poisson log-bilinear regression approach to the construction of projected lifetables. Insurance: Mathematics and Economics, 31(3), 373–393.Google Scholar

Cairns, A.J.G., Blake, D. and Dowd, K. (2006) A two-factor model for stochastic mortality with parameter uncertainty: Theory and calibration. Journal of Risk and Insurance, 73(4), 687–718.10.1111/j.1539-6975.2006.00195.xCrossRef Google Scholar

Cairns, A.J.G., Blake, D., Dowd, K., Coughlan, G.D., Epstein, D., Ong, A. and Balevich, I. (2009) A quantitative comparison of stochastic mortality models using data from England and Wales and the United States. North American Actuarial Journal, 13(1), 1–35.10.1080/10920277.2009.10597538CrossRef Google Scholar

Camarda, C.G. (2012) MortalitySmooth: An R package for smoothing Poisson counts with P-splines. Journal of Statistical Software, 50(1), 1–24.10.18637/jss.v050.i01CrossRef Google Scholar

Cohen, E., Riesenfeld, R., Elber, G. and Riesenfeld, R.F. (2001) Geometric Modeling with Splines: An Introduction. AK Peters.10.1201/9781439864203CrossRef Google Scholar

Currie, I.D. (2006) Smoothing and Forecasting Mortality Rates with P-splines. London: Institute of Actuaries.Google Scholar

Currie, I.D., Durbán, M. and Eilers, P.H.C. (2004) Smoothing and forecasting mortality rates. Statistical Modelling, 4, 279–298.10.1191/1471082X04st080oaCrossRef Google Scholar

Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.10.1007/BF02551274CrossRef Google Scholar

Deprez, P., Shevchenko, P.V. and Wüthrich, M.V. (2017) Machine learning techniques for mortality modeling. European Actuarial Journal, 7(2), 337–352.10.1007/s13385-017-0152-4CrossRef Google Scholar

Embrechts, P. and Wüthrich, M.V. (2022) Recent challenges in actuarial science. Annual Review of Statistics and Its Application, 9(1), 119–140.10.1146/annurev-statistics-040120-030244CrossRef Google Scholar

Evans, L. (2014) Traffic fatality reductions: United States compared with 25 other countries. American Journal of Public Health, 104(8), 1501–1507.10.2105/AJPH.2014.301922CrossRef Google Scholar PubMed

Giles, P. and Wilkie, A.D. (1971) Recent mortality trends: Some international comparisons. Transactions of the Faculty of Actuaries, 33, 375–514.10.1017/S0071368600005292CrossRef Google Scholar

Hainaut, D. (2018) A neural-network analyzer for mortality forecast. ASTIN Bulletin, 48(02), 481–508.10.1017/asb.2017.45CrossRef Google Scholar

Hall, M. and Friel, N. (2011) Mortality projections using generalized additive models with applications to annuity values for the Irish population. Annals of Actuarial Science, 5(1), 19–32.10.1017/S1748499510000011CrossRef Google Scholar

Hornik, K., Stinchcombe, M. and White, H. (1989) Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.10.1016/0893-6080(89)90020-8CrossRef Google Scholar

Kiamari, M., Kiamari, M. and Krishnamachari, B. (2024) GKAN: Graph Kolmogorov-Arnold Networks. arXiv: 2406.06470.Google Scholar

Kolmogorov, A.N. (1957) On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk, 114, 953–956.Google Scholar

Köppen, M. (2002) On the training of a Kolmogorov network. Artificial Neural Networks – ICANN 2002, pp. 474–479. Springer Berlin Heidelberg.10.1007/3-540-46084-5_77CrossRef Google Scholar

Lee, R.D. and Carter, L.R. (1992) Modeling and forecasting U.S. mortality. Journal of the American Statistical Association, 87(419), 659–671.Google Scholar

Levantesi, S. and Pizzorusso, V. (2019) Application of machine learning to mortality modeling and forecasting. Risks, 7(1), 26.10.3390/risks7010026CrossRef Google Scholar

Lin, J.-N. and Unbehauen, R. (1993) On the realization of a Kolmogorov network. Neural Computation, 5(1), 18–20.10.1162/neco.1993.5.1.18CrossRef Google Scholar

Lindholm, M. and Palmborg, L. (2022) Efficient use of data for LSTM mortality forecasting. European Actuarial Journal, 12(2), 749–778.10.1007/s13385-022-00307-3CrossRef Google Scholar

Lipovetsky, S. and Conklin, M. (2001) Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry, 17(4), 319–330.10.1002/asmb.446CrossRef Google Scholar

Liu, Z., Ma, P., Wang, Y., Matusik, W. and Tegmark, M. (2024a) KAN 2.0: Kolmogorov-Arnold Networks Meet Science. arXiv: 2408.10205.Google Scholar

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T.Y. and Tegmark, M. (2024b) KAN: Kolmogorov–Arnold Networks. arXiv: 2404.19756.Google Scholar

Macdonald, A.S., Richards, S.J. and Currie, I.D. (2018) Modelling Mortality with Actuarial Applications, 1st ed. Cambridge University Press.10.1017/9781107051386CrossRef Google Scholar

Marino, M., Levantesi, S. and Nigri, A. (2023) A neural approach to improve the Lee-Carter mortality density forecasts. North American Actuarial Journal, 27(1), 148–165.10.1080/10920277.2022.2050260CrossRef Google Scholar

Miyata, A. and Matsuyama, N. (2022) Extending the Lee-Carter model with variational autoencoder: A fusion of neural network and Bayesian approach. ASTIN Bulletin, 52(3), 789–812.10.1017/asb.2022.15CrossRef Google Scholar

Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S. and Perla, F. (2019) A deep learning integrated Lee–Carter model. Risks, 7(1), 33.10.3390/risks7010033CrossRef Google Scholar

Odhiambo, J., Weke, P. and Ngare, P. (2021) A deep learning integrated Cairns-Blake-Dowd (CBD) sytematic mortality risk model. Journal of Risk and Financial Management, 14(6), 259.10.3390/jrfm14060259CrossRef Google Scholar

Owens, E., Sheehan, B., Mullins, M., Cunneen, M., Ressel, J. and Castignani, G. (2022) Explainable Artificial Intelligence (XAI) in insurance. Risks, 10(12), 230.10.3390/risks10120230CrossRef Google Scholar

Perla, F., Richman, R., Scognamiglio, S. and Wüthrich, M.V. (2021) Time-series forecasting of mortality rates using deep learning. Scandinavian Actuarial Journal, 2021(7), 572–598.10.1080/03461238.2020.1867232CrossRef Google Scholar

Perla, F., Richman, R., Scognamiglio, S. and Wüthrich, M.V. (2024) Accurate and explainable mortality forecasting with the LocalGLMnet. Scandinavian Actuarial Journal, 1–23.Google Scholar

Pitacco, E., Denuit, M., Haberman, S. and Olivieri, A. (2009) Modelling Longevity Dynamics for Pensions and Annuity Business. Oxford University Press.10.1093/oso/9780199547272.001.0001CrossRef Google Scholar

Plat, R. (2009) On stochastic mortality modeling. Insurance: Mathematics and Economics, 45(3), 393–404.Google Scholar

Qiao, Y., Wang, C.-W. and Zhu, W. (2024) Machine learning in long-term mortality Ffrecasting. The Geneva Papers on Risk and Insurance - Issues and Practice, 49(2), 340–362.10.1057/s41288-024-00320-5CrossRef Google Scholar

Renshaw, A. and Haberman, S. (2006) A cohort-based extension to the Lee–Carter model for mortality reduction factors. Insurance: Mathematics and Economics, 38(3), 556–570.Google Scholar

Richards, S.J., Kirkby, J.G. and Currie, I.D. (2006) The importance of year of birth in two-dimensional mortality data. British Actuarial Journal, 12(1), 5–38.10.1017/S1357321700004682CrossRef Google Scholar

Richman, R. (2021a) AI in actuarial science – a review of recent advances – Part 1. Annals of Actuarial Science, 15(2), 207–229.10.1017/S1748499520000238CrossRef Google Scholar

Richman, R. (2021b) AI in actuarial science – a review of recent advances – Part 2. Annals of Actuarial Science, 15(2), 230–258.10.1017/S174849952000024XCrossRef Google Scholar

Richman, R. (2022) Mind the gap – safely incorporating deep learning models into the actuarial toolkit. British Actuarial Journal, 27, e21.10.1017/S1357321722000162CrossRef Google Scholar

Richman, R. and Wüthrich, M.V. (2019) Lee and Carter go machine learning: Recurrent neural networks. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3441030 10.2139/ssrn.3441030CrossRef Google Scholar

Richman, R. and Wüthrich, M.V. (2021) A neural network extension of the Lee-Carter model to multiple populations. Annals of Actuarial Science, 15(2), 346–366.10.1017/S1748499519000071CrossRef Google Scholar

Richman, R. and Wüthrich, M.V. (2023) LocalGLMnet: Interpretable deep learning for tabular data. Scandinavian Actuarial Journal, 2023(1), 71–95.10.1080/03461238.2022.2081816CrossRef Google Scholar

Richman, R. and Wüthrich, M.V. (2024) Smoothness and monotonicity constraints for neural networks using ICEnet. Annals of Actuarial Science, 18(3), 712–739.10.1017/S174849952400006XCrossRef Google Scholar

Schnürch, S. and Korn, R. (2022) Point and interval forecasts of death rates using neural networks. ASTIN Bulletin, 52(1), 333–360.10.1017/asb.2021.34CrossRef Google Scholar

Scognamiglio, S. (2022) Calibrating the Lee-Carter and the Poisson Lee-Carter models via neural networks. ASTIN Bulletin, 52(2), 519–561.10.1017/asb.2022.5CrossRef Google Scholar

Sprecher, D.A. and Draghici, S. (2002) Space-filling curves and Kolmogorov superposition-based neural networks. Neural Networks, 15(1), 57–67.10.1016/S0893-6080(01)00107-1CrossRef Google Scholar PubMed

Thakrar, A.P., Forrest, A.D., Maltenfort, M.G. and Forrest, C.B. (2018) Child mortality in the US and 19 OECD comparator nations: A 50-year time-trend analysis. Health Affairs, 37(1), 140–149.10.1377/hlthaff.2017.0767CrossRef Google Scholar

Vaca-Rubio, C.J., Blanco, L., Pereira, R. and Caus, M. (2024) Kolmogorov-Arnold Networks (KANs) for Time Series Analysis. arXiv: 2405.08790.10.1109/GCWkshp64532.2024.11100692CrossRef Google Scholar

Villegas, A.M., Kaishev, V.K. and Millossovich, P. (2018) StMoMo : An R package for stochastic mortality modeling. Journal of Statistical Software, 84(3), 1–38.10.18637/jss.v084.i03CrossRef Google Scholar

Vincelli, M. (2019) A Machine Learning Approach to Incorporating Industry Mortality Table Features into a Company’s Insured Mortality Analysis. https://www.soa.org/resources/research-reports/2019/2019-machine-learning-approach/ Google Scholar

Wang, C.-W., Zhang, J. and Zhu, W. (2021) Neighbouring prediction for mortality. ASTIN Bulletin, 51(3), 689–718.10.1017/asb.2021.13CrossRef Google Scholar

Wang, J., Wen, L., Xiao, L. and Wang, C. (2023) Time-series forecasting of mortality rates using transformer. Scandinavian Actuarial Journal, 2024(2), 109–123.10.1080/03461238.2023.2218859CrossRef Google Scholar

Wood, S.N. (2017) Generalized Additive Models: An Introduction with R, 2nd ed. CRC Press, Taylor & Francis Group.10.1201/9781315370279CrossRef Google Scholar

Woolf, S.H. and Schoomaker, H. (2019) Life expectancy and mortality rates in the United States, 1959–2017. JAMA, 322(20), 1996–2016.10.1001/jama.2019.16932CrossRef Google Scholar PubMed

World Population Prospects 2022: Methodology Report (UN DESA/POP/2022/DC/NO. 6). (2022) United Nations Department of Economic and Social Affairs, Population Division.Google Scholar

World Population Prospects 2022: Summary of Results (UN DESA/POP/2022/TR/NO. 3). (2022) United Nations Department of Economic and Social Affairs, Population Division.Google Scholar

Wüthrich, M.V. and Merz, M. (2019) Editorial: Yes, We CANN! ASTIN Bulletin, 49(1), 1–3.10.1017/asb.2018.42CrossRef Google Scholar

Wüthrich, M.V., Richman, R., Avanzi, B., Lindholm, M., Maggi, M., Mayer, M., Schelldorfer, J. and Scognamiglio, S. (2025) AI Tools for Actuaries. https://ssrn.com/abstract=5162304 10.2139/ssrn.5162304CrossRef Google Scholar

Xin, X., Hooker, G. and Huang, F. (2025) Pitfalls in machine learning interpretability: Manipulating partial dependence plots to hide discrimination. Insurance Mathematics and Economics, 125, 103135.10.1016/j.insmatheco.2025.103135CrossRef Google Scholar

Zheng, H., Wang, H., Zhu, R. and Xue, J.-H. (2025) A brief review of deep learning methods in mortality forecasting. Annals of Actuarial Science, 1–16.10.1017/S1748499525100110CrossRef Google Scholar

Figure 1. Sample KAN with three KAN layers (four node layers).

Figure 3. KANNLC architecture. $\phi_{\text{age}_2}(x)$ and $\phi_{\text{year}}(t)$ are multiplied together and connected to the final layer.

Figure 4. KANNAPC architecture. Age x is transformed into $-x$ after passing through $\phi_{\text{age}_2}(x)$, while year t remains t after passing through $\phi_{\text{year}_2}(t)$. The two are then added to obtain the birth year $\gamma$. The birth year (cohort) is transformed by $\phi_{\text{cohort}}(\gamma)$ and connected to the nodes in the final layer.

Figure 5. $\lambda$’s impact on fitted mortality from KAN[2,1] on 1975 US female population.

Figure 6. Interpretation and outputs of KAN[2,1] and KANN[2,1], fitted on female population of Netherlands. Panel (a) shows the estimated age effects $\phi_{\text{age}}(x)$ from KAN[2,1] and KANN[2,1], compared with the GAM-based $f_{\text{age}}(x)$. Panel (b) presents the estimated time effects $\phi_{\text{year}}(t)$ from KAN[2,1] and KANN[2,1], along with the GAM-based $f_{\text{year}}(t)$. Panel (c) illustrates the deep-part outputs of KANN[2,1] across ages and years. Panel (d) compares the predicted mortality curves for 2019 (the last year of the test set) from five models. For KAN[2,1], KANN[2,1], and LSTM, the results are based on a single run.