Analysis of Log Data From an International Online Educational Assessment System: A Multi-State Survival Modeling Approach to Reaction Time Between and Across Action Sequence

Jina Park; Ick Hoon Jin; Minjeong Jeon

doi:10.1017/psy.2025.10043

Analysis of Log Data From an International Online Educational Assessment System: A Multi-State Survival Modeling Approach to Reaction Time Between and Across Action Sequence

Published online by Cambridge University Press: 01 September 2025

Jina Park ,

Ick Hoon Jin

and

Minjeong Jeon

Show author details

Jina Park: Affiliation:
Department of Applied Statistics, https://ror.org/01wjejq96 Yonsei University , Seoul, South Korea Department of Statistics and Data Science, https://ror.org/01wjejq96 Yonsei University , Seoul, South Korea
Ick Hoon Jin*: Affiliation:
Department of Applied Statistics, https://ror.org/01wjejq96 Yonsei University , Seoul, South Korea Department of Statistics and Data Science, https://ror.org/01wjejq96 Yonsei University , Seoul, South Korea
Minjeong Jeon: Affiliation:
School of Education and Information Studies, https://ror.org/046rm7j60 University of California , Los Angeles, CA, USA
*: Corresponding author: Ick Hoon Jin; Email: ijin@yonsei.ac.kr

Article contents

Abstract
Introduction
Motivating examples
MSMs
Proposal: MSM for log data
Real data analysis
Model evaluation and robustness
Conclusion
Data availability statement
Funding statement
Competing interests
References

Rights & Permissions

Abstract

With increasingly available computer-based or online assessments, researchers have shown keen interest in analyzing log data to improve our understanding of test takers’ problem-solving processes. In this article, we propose a multi-state survival model (MSM) to action sequence data from log files, focusing on modeling test takers’ reaction times between actions, in order to investigate which factors and how they influence test takers’ transition speed between actions. We specifically identify the key actions that differentiate correct and incorrect answers, compare transition probabilities between these groups, and analyze their distinct problem-solving patterns. Through simulation studies and sensitivity analyses, we evaluate the robustness of our proposed model. We demonstrate the proposed approach using problem-solving items from the Programme for the International Assessment of Adult Competencies (PIAAC).

Keywords

key actions log data multi-state survival model problem-solving test reaction time

Information

Type: Application and Case Studies - Original
Information: Psychometrika , First View , pp. 1 - 30

DOI: https://doi.org/10.1017/psy.2025.10043 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society

1 Introduction

Over the past decade, advances in technology have accelerated innovation in educational assessment, leading to the development of a growing number of computer-based problem-solving assessments that evaluate test takers’ ability to solve complex problems in realistic environments. Examples of computer-based problem-solving assessments include the Program for the International Assessment of Adult Competencies (PIAAC), the Programme for International Student Assessment (PISA), and the National Assessment of Educational Progress (NAEP). A key feature of these assessments, compared with traditional paper-based assessments, is that user interactions with a testing system, such as clicking the button, dragging, dropping, and text input during assessments, are recorded in log files. This sequence of recorded user interactions, so-called log data or process data, is a valuable resource for various purposes, e.g., to explore and validate test takers’ item-solving processes and strategies, identify key behaviors that determine the performance, and formulate real-time feedback to test takers (Han et al., Reference Han, He and von Davier2019; He et al., Reference He, Borgonovi and Paccagnella2021; Jiao et al., Reference Jiao, He and Veldkamp2021; Liu et al., Reference Liu, Liu and Li2018; He et al., Reference He, Borgonovi and Paccagnella2021; Xiao et al., Reference Xiao, He, Veldkamp and Liu2021).

Analyzing log data using traditional statistical models, such as generalized linear models and item response theory models, is challenging due to the non-standard format, varying sequence lengths between participants, and high computational requirements, among others (Tang et al., Reference Tang, Wang, He, Liu and Ying2020, Reference Tang, Zhang, Wang, Liu and Ying2021; Xiao & Liu, Reference Xiao and Liu2023; Zhan & Qiao, Reference Zhan and Qiao2022). Researchers have proposed various methodologies to address these challenges in log data analysis, which could be classified into two types: 1) extraction of behavioral characteristics and 2) psychometric modeling of log data (Fu et al., Reference Fu, Zhan, Chen and Jiao2023; Han & Wilson, Reference Han and Wilson2022; Xiao & Liu, Reference Xiao and Liu2023). We briefly review these two types of methods below.

First, methods for extracting behavioral characteristics from log data fall into theory-based or data-driven approaches (Fu et al., Reference Fu, Zhan, Chen and Jiao2023; Han & Wilson, Reference Han and Wilson2022; Yuan et al., Reference Yuan, Xiao and Liu2019). Theory-based methods typically use expert-defined behavioral indicators, and thus different feature extraction rules are used for different problem-solving tests. For example, Greiff et al. (Reference Greiff, Wüstenberg and Avvisati2015, Reference Greiff, Niepel, Scherer and Martin2016) defined the optimal exploration strategy (e.g., vary-one-thing-at-at-time; VOTAT), time on task, and intervention frequency to examine their relationships with problem-solving test performance. Data-driven approaches, on the other hand, employ data mining, machine learning, and other statistical methods to extract features from log data. For example, Tang et al. (Reference Tang, Wang, Liu and Ying2020) used multidimensional scaling (MDS) to standardize varying lengths of log sequences. Tang et al. (Reference Tang, Wang, He, Liu and Ying2020) employed a sequence-to-sequence autoencoder that encodes log sequences as numeric vectors. Zhu et al. (Reference Zhu, Shu and von Davier2016) and Vista et al. (Reference Vista, Care and Awwal2017) used network analysis to visualize log sequences and extract meaningful information from log data. Qiao & Jiao (Reference Qiao and Jiao2018) applied supervised learning, such as classification and regression trees (CART), gradient boosting random forests, and support vector machines (SVMs), and unsupervised learning, such as self-organizing map (SOMs) and K-means, to log data, evaluating consistency of the results across methods. He et al. (Reference He, Liao and Jiao2019) utilized the longest common subsequence (LCS) method to define the optimal sequence from the log data. Xu et al. (Reference Xu, Fang and Ying2020) applied a latent topic model with a Markov structure, which extends the hierarchical Bayesian topic model with a hidden Markov structure, to obtain latent features of log data.

Second, psychometric modeling of log data has typically focused on estimating test-takers’ latent traits from log data. For example, Shu et al. (Reference Shu, Bergner, Zhu, Hao and von Davier2017) developed a Markov item response theory model that combines Markov models with item response theory to identify latent characteristics of test takers and the tendency of each transition to occur. Han & Wilson (Reference Han and Wilson2022) applied mixture Rasch models to log data, specifically mixture partial credit models, to estimate latent features of students. Han et al. (Reference Han, Liu and Ji2021) proposed a sequential response model (SRM) that combines a dynamic Bayesian network with a psychometric model to infer test takers’ continuous latent abilities from log data.

Time information has been recognized as a critical element in the analysis of test-taking behavior, which offers useful information about the engagement and performance of the respondents (Engelhardt & Goldhammer, Reference Engelhardt and Goldhammer2019; Goldhammer et al., Reference Goldhammer, Naumann, Stelter, Tóth, Rölke and Klieme2014; He et al., Reference He, Borgonovi and Paccagnella2019; Scherer et al., Reference Scherer, Greiff and Hautamäki2015; Voros & Rouet, Reference Voros and Rouet2016). Researchers have employed various methodologies to examine the relationship between time-related factors and test outcomes. For example, generalized linear mixed models have been used to study the effect of time on task in computer-based assessments of reading and problem-solving (Goldhammer et al., Reference Goldhammer, Naumann, Stelter, Tóth, Rölke and Klieme2014), while two-level response time item response theory models have explored the connection between problem-solving time and ability by modeling dichotomous item responses and log-transformed reaction times jointly (Scherer et al., Reference Scherer, Greiff and Hautamäki2015). Studies have consistently found that an increase in time investment often correlates with higher test scores (He et al., Reference He, Borgonovi and Paccagnella2019). Other studies used time data to validate the interpretation of test scores for skills such as reading and reasoning (Engelhardt & Goldhammer, Reference Engelhardt and Goldhammer2019), and investigated the relationship between response time and action frequency, along with the combined effect of these factors on task performance (Voros & Rouet, Reference Voros and Rouet2016).

Importantly, timestamps from log data can provide crucial insights into the efficiency and fluency of cognitive processing of respondents (Wang et al., Reference Wang, Tang, Liu and Ying2022; Xu et al., Reference Xu, Fang and Ying2020). These temporal markers are particularly valuable in multi-step or problem-solving tasks, where transition speeds between actions can indicate respondents’ proficiency. Recent studies have leveraged timestamped log data using a variety of methodologies, including continuous-time dynamic choice models (Chen, Reference Chen2020), latent topic models with Markov transitions (Xu et al., Reference Xu, Fang and Ying2020), sequence mining techniques (Ulitzsch et al., Reference Ulitzsch, He and Pohl2021), and joint models of action sequences and times (Fu et al., Reference Fu, Zhan, Chen and Jiao2023). These approaches have allowed researchers to estimate latent abilities and behavioral speeds (Chen, Reference Chen2020), cluster learning trajectories (Xu et al., Reference Xu, Fang and Ying2020), investigate the behavioral patterns of correct and incorrect groups (Ulitzsch et al., Reference Ulitzsch, He and Pohl2021), segment long processes into interpretable subprocesses (Wang et al., Reference Wang, Tang, Liu and Ying2022), and identify different problem-solving strategies (Zhang & Andersson, Reference Zhang and Andersson2023).

Analysis of transition times between actions can offer a comprehensive view of respondent behavior in educational assessments. This approach can go beyond evaluating respondents’ speed and further illuminate how respondents navigate tasks and employ cognitive strategies. By examining the temporal patterns of transitions in problem-solving tasks, we can also discern crucial differences in approach between correct and incorrect answer groups.

This article introduces a novel approach to analyzing the impact of various factors on action transition speeds in log data, focusing on differences in transition patterns between correct and incorrect answer groups. We propose a MSM Andersen et al., Reference Andersen, Abildstrom and Rosthøj2002; Commenges, Reference Commenges1999; Crowther & Lambert, Reference Crowther and Lambert2017; Hougaard, Reference Hougaard1999; Meira-Machado et al., Reference Meira-Machado, de Uña Álvarez, Cadarso-Suárez and Andersen2008; Putter et al., Reference Putter, Fiocco and Geskus2006 to overcome the limitations of traditional survival methods when dealing with non-standard log data formats. Our model examines progression through multiple states over time, particularly evaluating how key actions influence transition speeds. Rather than analyzing all possible actions, which would be conceptually and computationally inefficient given the large number and variety of actions in typical item solution processes, we concentrate on the effects of “key actions” at the start and end of transitions. These key actions are identified using $\chi ^2$ statistical approach (He & von Davier, Reference He and von Davier2015) as significantly differentiating between correct and incorrect responses.

The rest of the article is organized as follows. In Section 2, we start by describing the motivating data example. We explain how we extract key actions for the selected test items. In Section 3, we provide an overview of MSM and its traditional applications, highlighting theoretical and practical justification of our novel MSM application to log data. In Section 4, we present the proposed MSM approach for time sequence data across actions and explain the Bayesian estimation approach for the proposed model. In Section 3, we describe the application of the proposed model to the motivating data example. Section 5 presents simulation studies and prior sensitivity analyses to evaluate model performance and robustness. Finally, we conclude the article with a summary and discussion in Section 6.

2 Motivating examples

2.1 PIAAC problem solving test

The Organization for Economic Cooperation and Development (OECD) has implemented the PIAAC for adults from over 40 countries since 2011 (OECD, 2017). PIAAC measures adult literacy, numeracy, and problem-solving skills in technology-rich environments (PSTRE) and examines how adults apply these skills in a variety of areas, including home, work, and community.

We utilize the PSTRE assessment that focuses on “using digital technology, communication tools, and networks to acquire and evaluate information, communicate with others, and perform practical tasks” (OECD, 2011, 2012, 2016). PSTRE evaluates individuals’ problem-solving skills across various domains, including personal and professional domains, using computers. During PSTRE evaluation, user interactions, such as button clicks, links, drag, drop, copying, and pasting, are automatically logged into a separate log file with a timestamp.

In addition, we use the public use file (PUF), which includes PIAAC participants’ background information, including employment, income, health, education, and technology used in work and life. We selected four variables from the PUF and defined the “Electronic Skill”(EsKill) variable by aggregating seven ICT-related variables from the PUF, which measure respondents’ frequency of using various computer and Internet technologies for work. These seven variables related to ICT include the frequency of using email, work-related information, conducting transactions, spreadsheets, word processing, programming languages, and real-time discussions, all rated on a scale of 1–5. We normalize the “Eskill” variable, which initially ranges from 1 to 35, to prevent it from disproportionately influencing the analysis due to its larger range. Table 1 provides detailed information on all selected variables, including descriptions, measurement scales, and descriptive statistics. Our final analysis includes 9,117 participants for the CD Tally test item and 12,557 participants for the Lamp Return test item, all of whom provided responses to all selected variables and the respective test items.

Table 1 The description and mean (standard deviation) of the five selected covariates from the Public Use Files

Note: The total numbers of participants for the CD Tally and Lamp Return test items are 9,117 and 12,557, respectively.

2.2 CD Tally and Lamp Return

The PSTRE assessment from 2012 PIAAC consists of 14 problem-solving items. These items are typically designed based on four specific environments: email, web browsing, word processing, and spreadsheet. Figure 1 displays a publicly available example PSTRE test item in which participants engage in simulated job searches. In this item, participants are instructed to find a website that does not require registration or payment and then bookmark it. In order to solve this item successfully, participants need to navigate multiple website pages and bookmark the websites that do not require any registration or payment.

Figure 1 A publicly available example of the PSTRE assessment of 2012 PIAAC about simulated job searches.

Note: Figure (a) and (b) are the list of job search sites and the page for the first link, respectively.

In this article, we consider two problem-solving items, CD Tally (Item ID: U03A) and Lamp Return (Item ID: U23X) among the 14 PSTRE assessments from the 2012 PIAAC. Among PIAAC’s four environmental designs (email, web browsing, word processing, and spreadsheet), CD Tally item is based on web and spreadsheet applications, and the Lamp Return item is based on email and web applications. The following is a detailed description of each item.

2.2.1 CD Tally

In the CD Tally item, test takers are asked to update the store’s online inventory as requested by the manager. The CD Tally test item contains two pages: a website and a spreadsheet. The spreadsheet contains various details about the CDs, such as title, artist, genre, and release date. The goal is to count the number of CDs in the blues genre in the spreadsheet and enter them into the website. A total of 52 actions are identified for the CD Tally test item, which are detailed in the Section 1 of the Supplementary Material. An example of log data for CD Tally is “wb (1.33) - ss (1.37) - ss_file (1.86) - so (1.95) - so_1_3 (2.02) - so_2_asc (2.52) - so_ok (2.57) - wb (2.67) - combobox (2.76)”, where value in parentheses indicate the time (min) that the action occurred. This log shows the process of the participant’s selecting sort options, sorting the spreadsheet, and entering the answers into a combobox.

2.2.2 Lamp Return

The Lamp Return item assumes that the test taker receives a desk lamp in a different color from the one they ordered. The test taker is asked to request an exchange for a desk lamp in the correct color via the company’s website. To accomplish this goal, the respondent clicks on the customer service page of the company’s website and fills out a return form. The form requires an authorization number, which the participant receives via email. A total of 126 actions are identified for Lamp Return which is more than twice the CD Tally case. Details of the actions for Lamp Return are described in Section 1 of the Supplementary Material.

Table 2 lists the number and percentage of correct and incorrect answers to the two items summarized for each of 14 countries. Note that in the Lamp Return test, responses are scored on a scale from 0 to 3, where higher scores indicate higher accuracy. We dichotomized the answers with a score of 3 as correct and considered other answers as incorrect.

Table 2 The numbers and percentages of participants who answered CD tally and Lamp Return correctly and incorrectly in the 2012 PIAAC data

2.3 Extracting key actions

As described above, the two selected items, CD Tally and Lamp Return, involve a large number of actions: 52 and 126 actions, respectively, for the item solution. Recognizing that not all actions contribute equally to the item solution process, we identify “key actions” to evaluate their influence on transition speed between actions.

Researchers have applied subjective or objective methods to extract key actions from action sequence log data. Subjective methods select key actions based on personal knowledge (Greiff et al., Reference Greiff, Wüstenberg and Avvisati2015, Reference Greiff, Niepel, Scherer and Martin2016). Objective methods apply feature selection approaches for key action extraction. For example, Han et al. (Reference Han, He and von Davier2019) applied a random forest algorithm to extract the most predictive characteristics that distinguished the correct group from the incorrect group. He & von Davier (Reference He and von Davier2015) utilized the $\chi ^2$ statistics method and the weighted log-likelihood ratio test (WLLR) approach based on natural language processing.

Here, we apply He & von Davier (Reference He and von Davier2015)’s $\chi ^2$ statistic approach to select key actions that differentiate correct answers from incorrect answers for the two selected items. The $\chi ^2$ statistic approach is based on the following four steps:

1. Calculate the inverse sequence frequency (ISF) for action i, $\text {ISF}_i = \text {log}(E/\text {sf}_i),$ where $\text {sf}_i$ is the occurrence frequency of action i.
2. Calculate the term frequency (TF), $\text {tf}_{ij}$ , which indicates the frequency of action i for individual j.
3. Combine ISF and TF to calculate weights as follows:
$$ \begin{align*} \text{weight}(i,j) = \begin{cases} [1 + \log (\text{tf}_{ij})] \cdot \text{ISF}_i, & \text{if } \text{tf}_{ij} \le 1, \\ 0, & \text{if } \text{tf}_{ij} = 0. \end{cases} \end{align*} $$
4. Calculate the chi-square score for each action with weighted frequencies.

In the fourth step, a chi-square score is calculated using a $2 \times 2$ contingency table (given in Table 3), which is the frequency of crossing the presence of each action or action with the correctness or incorrectness of the response. $n_i$ and $m_i$ in Table 3 indicate the weighted frequency of action occurrences in the correct and incorrect groups, respectively, and len( $C_1$ ) and len( $C_2$ ) in Table 3 denote the sum of weighted frequency of occurrence in the correct and incorrect groups, respectively.

Table 3 The $2 \times 2$ contingency for chi-square test of action i

The $\chi ^2$ statistic approach aims to assess the independence of occurrence and correctness. Under the null hypothesis of independence, the chi-square score is given as

$$\begin{align*}\chi^{2}=\frac{E\left(O_{11} O_{22}-O_{12} O_{21}\right)^{2}}{\left(O_{11}+O_{12}\right)\left(O_{11}+O_{21}\right)\left(O_{12}+O_{22}\right)\left(O_{21}+O_{22}\right)}, \end{align*}$$

where $O_{ij}$ represents the cell in the ith row and jth column of Table 3. Chi-square scores indicate the discriminatory power of actions in classification. Consequently, we organized the $\chi ^2$ scores for each action in descending order. Additionally, if the ratio $n_i / m_i$ exceeds $\text {len}(c_1) / \text {len}(c_2)$ , action i is deemed more representative of the correct answer group ( $C_1$ ). As a result, actions with a high $\chi ^2$ score are selected among actions that satisfy $n_i/m_i> \text {len}(c_1) / \text {len}(c_2)$ as key actions.

To determine the cut-off point for selecting key actions, we use a line plot that visualizes the chi-square scores in descending order on the y-axis, where an elbow point is used as a threshold value. Figure 2 shows line plots for selecting key actions of the two test items. The actions with a higher chi-square score than action “sch_n” were selected as key actions.

Figure 2 Line plot of chi-square scores in descending order for selecting key actions of (a) CD Tally and (b) Lamp Return test items.

Note: The red circle indicates the elbow point of the line plot. Actions with chi-square scores above this point are designated as key actions.

2.3.1 CD Tally: Key actions

Table 4 lists the top 12 actions based on $\chi ^2$ scores for the CD Talley log data. Figure 2 is a line plot of $\chi ^2$ scores in descending order. In Figure 2, “ss_so” represents the elbow point of the line plot. Actions with a higher $\chi ^2$ score than “ss_so”—namely “so_1_3”, “so_2_asc”, “so_ok”, “so”, “ss_data”, “so_2_desc”, and “so_so”—are selected as key actions. For the CD Tally test item, the actions related to sorting spreadsheets are chosen as the key actions.

Table 4 The top 12 actions based on the chi-square scores, along with the average occurrence frequency and average occurrence time (min) per person for the CD Tally test item

Note: The 7 actions, marked in bold, are selected as key actions for this test item.

2.3.2 Lamp Return: Key actions

Table 5 and Figure 2 show the top 15 actions based on chi-square scores and a line plot of the descending chi-square scores to identify key actions for the Lamp Return test item, respectively. Actions with higher chi-square scores than action “wb_pg_8_3”, which is the elbow point in Figure 2, are selected as key actions. The selected key actions for Lamp Return are marked in bold in Table 5. The top ranked actions are related to the customer service page, including actions such as selecting a reason for the return, requesting a return, and submitting a return form. The key actions in the lower ranking are related to obtaining an authorization number by email.

Table 5 The top 15 actions based on the chi-square scores, along with the average occurrence frequency and average occurrence time (min) per person for the Lamp Return test item

Note: The top 13 actions, marked in bold, are selected as key actions for this test item.

3 MSMs

In this article, we adapt a MSM Andersen et al., Reference Andersen, Abildstrom and Rosthøj2002; Commenges, Reference Commenges1999; Crowther & Lambert, Reference Crowther and Lambert2017; Hougaard, Reference Hougaard1999; Meira-Machado et al., Reference Meira-Machado, de Uña Álvarez, Cadarso-Suárez and Andersen2008; Putter et al., Reference Putter, Fiocco and Geskus2006 to examine the influence of covariates and key actions on transition speeds between actions in log data, while identifying distinct transition patterns between correct and incorrect response groups. This approach effectively addresses the challenge of varying transition times across individuals, allowing for a comprehensive analysis of individuals’ item solution process through multiple states over time. In this section, we begin by providing a detailed overview of the MSM, highlighting the novelty of our application of the MSM to log data analysis, which seeks to shed light on the complex dynamics of test takers’ problem-solving behaviors. In Section 4, we detail the formulation and estimation of the proposed MSM for log data.

3.1 MSM: Background

The MSM approach (Andersen et al., Reference Andersen, Abildstrom and Rosthøj2002; Commenges, Reference Commenges1999; Crowther & Lambert, Reference Crowther and Lambert2017; Hougaard, Reference Hougaard1999; Meira-Machado et al., Reference Meira-Machado, de Uña Álvarez, Cadarso-Suárez and Andersen2008; Putter et al., Reference Putter, Fiocco and Geskus2006) is a sophisticated analytical tool designed to analyze longitudinal failure time data, by modeling the progression of individuals through various states or phases over time, such as the progression of the disease. The MSM approach offers a framework for investigating individual differences in trajectories across different states and the effects of covariates on the transition between two states (Crowther & Lambert, Reference Crowther and Lambert2017).

MSM can be grouped into different types, based on assumptions on the dependency of transition hazard rates on time and memory properties. For example, Markov models assume future states depend only on the current state (memoryless property), whereas semi-Markov models allow transition intensities to depend on the duration in the current state by relaxing the memoryless property. Non-Markov models are the most general type, allowing for transition intensities to be dependent on the entire process history. Additionally, MSMs can be further classified based on time homogeneity, where time-homogeneous models assume constant transition rates over time, while time-inhomogeneous models allow transition rates to vary over time. Time-homogeneous models, often assuming Markov properties, offer simplicity and are suitable for systems with constant transition probabilities. In contrast, time-inhomogeneous models offer more flexibility with time-varying transition probabilities. We will utilize time-homogeneous Markov models, as elaborated in the following section.

In MSM, nonparametric approaches offer additional flexibility by making no distributional assumptions (de Wreede et al., Reference de Wreede, Fiocco and Putter2010; Manevski et al., Reference Manevski, Putter, Pohar Perme, Bonneville, Schetelig and de Wreede2022), while parametric approaches bring simplicity to the analysis through specific distributional constraints (Krol & Saint-Pierre, Reference Krol and Saint-Pierre2015). Semi-parametric approaches, such as the Cox proportional hazards model, make fewer assumptions than parametric methods by not specifying the form of the baseline hazard function. Semi-parametric approaches, such as the Cox proportional hazards model, make fewer assumptions by not specifying a baseline hazard function form (Kneib & Hennerfeind, Reference Kneib and Hennerfeind2008). These models are commonly employed to examine the relationship between covariates and the hazard rate in time-to-event data, and we adopt these models in the current study.

The Cox proportional hazards model defines the transition hazard from state i to state j at time t as

$$ \begin{align*} \lambda_{ij}(t|X) = \lambda_{ij0}(t)\exp (\boldsymbol{X \beta_k }), \end{align*} $$

where $\lambda _{ij0}(t)$ is the baseline hazard hazard function, $\mathbf {X}$ represents $n \times p$ matrix of covariates, and $\boldsymbol {\beta _k}$ is the $p \times 1$ vector of regression coefficients for event k, with n referring to the number of observations and p indicating the number of covariates. This formulation allows for modeling transition hazards without specifying a parametric form for the hazard function.

Both frequentist and Bayesian approaches can be employed to estimate the hazard function in multi-state transitions. Maximum likelihood estimation (MLE), a frequentist method, finds parameter values that maximize the likelihood function based on observed data. It offers unbiased and efficient estimates under large samples and correct model specifications, and is easily implemented using standard software like the msm package in the R software. However, MLE can be unreliable with sparse data, may fail to converge in complex models, and often assumes time homogeneity and independence, which are not always valid, and struggles to handle unexplained individual heterogeneity (Matsena Zingoni et al., Reference Matsena Zingoni, Chirwa, Todd and Musenge2021; Shen et al., Reference Shen, Han, Petousis, Weiss, Meng, Bui and Hsu2017). On the other hand, Bayesian estimation yields consistent estimates even with small samples by incorporating prior knowledge and effectively handles individual heterogeneity and nonlinear covariates. It offers comprehensive parameter estimate distributions and robustness in complex models, though it is computationally intensive and sensitive to prior choices (Zingoni et al., Reference Zingoni, Chirwa, Todd and Musenge2020). We will go with Bayesian estimation as elaborated in Section 4.2

3.2 MSM: Traditional applications

MSMs have traditionally been employed in medical and epidemiological research to analyze transitions between distinct health states over time (Andersen et al., Reference Andersen, Abildstrom and Rosthøj2002; Crowther & Lambert, Reference Crowther and Lambert2017; Hougaard, Reference Hougaard1999). Specifically, MSMs are applied in contexts such as cancer progression, stroke recovery, and chronic disease management, where state transitions occur infrequently, may be only partially observable, and occur over extended timeframes (Grevel et al., Reference Grevel, Veasy and Chacko2024).

A well-known MSM application is the illness–death model, which captures an individual’s progression through discrete health states. In its standard form, this model comprises three states: State 0 (healthy), State 1 (diseased), and State 2 (dead). Figure 3 depicts the typical structure of the illness-death model. Individuals may transition from state 0 to state 1 (onset of disease), from state 1 to state 2 (mortality after illness), or directly from state 0 to state 2 (death without prior disease). These transitions are generally considered irreversible, with the time intervals between states characterizing the temporal dynamics of disease progression.

Figure 3 An illustration of the illness–death model.

Despite the widespread adoption of MSMs in medical and epidemiological research, their application have been limited to situations, where there are a few clearly defined states and infrequent, unidirectional transitions. While these assumptions adequately capture long-term clinical processes, they constrain the utility of MSMs in domains characterized by rapid, high-resolution state changes. Consequently, despite the inherent flexibility of the model, researchers have yet to fully explore its use in settings characterized by complex behavioral dynamics and high-frequency event sequences.

3.3 MSM: Novel application to log data from online educational assessment systems

The fundamental components of MSMs—discrete states, transition intensities, and covariate effects - can be effectively repurposed to analyze digital systems where entities transition over time. This is, particularly relevant for digital log data from educational assessments, where users perform actions (e.g., clicking, dragging, copying) in rapid succession. Unlike traditional applications in clinical contexts, these datasets pose unique challenges: transitions occur at extremely high frequencies (often seconds apart), all state changes are fully observable (rather than partially observed), and the temporal patterns themselves carry meaningful information about cognitive processes. Although the semantic interpretation of states in digital environments differs from medical research, the underlying statistical framework of MSMs remains robust and adaptable. The contribution of the proposed model lies in extending and tailoring the MSMs to capture these distinctive features and the temporal dynamics of digital assessments.

This section explores the application of MSMs to the log data generated within online educational assessment systems. We first establish the theoretical foundation for employing MSMs in this context, with particular emphasis on their alignment with cognitive and educational frameworks. We then examine the unique insights these models can provide regarding respondents’ behavioral patterns and instructional system design.

3.3.1 Theoretical justifications

Log data from educational assessments contain time-stamped records of actions captured during test takers’ problem-solving processes. A key task is to infer the strategies and reasoning patterns that guide test takers’ actions during the test. These actions are inherently sequential, as each decision depends on the outcomes of previous actions. Therefore, log data analysis is well-suited to modeling approaches that can capture temporal structures and transitions between behavioral states.

As noted in Section 1, various analytical approaches are available for log data, including latent topic modeling, sequence mining, and continuous-time dynamic modeling (e.g., Chen, Reference Chen2020; Fu et al., Reference Fu, Zhan, Chen and Jiao2023; Ulitzsch et al., Reference Ulitzsch, He and Pohl2021; Wang et al., Reference Wang, Tang, Liu and Ying2022; Xu et al., Reference Xu, Fang and Ying2020; Zhang & Andersson, Reference Zhang and Andersson2023). These methods have successfully identified frequent behavioral patterns, clustered learning trajectories, predicted task performance, and jointly modeled response times and behavioral transitions.

MSMs can provide a complementary analytical framework by explicitly modeling transition intensities between actions. Unlike methods primarily designed for clustering or prediction, MSMs uniquely capture both the timing and intensity of transitions while accounting for covariates that may influence transition probabilities. Furthermore, MSMs offer a powerful framework for modeling the dynamic progression of respondents through sequences of cognitive or behavioral states. This perspective is also consistent with cognitive theories such as information processing theory and cognitive load theory, which conceptualize cognition as an iterative process involving goal recognition, information structuring, action selection, and execution (Dostál, Reference Dostál2015). Importantly, MSMs effectively capture how individuals continuously monitor action outcomes and adapt their strategies through modeling of state transitions and their temporal characteristics.

According to cognitive load theory, excessive cognitive demands—particularly in complex tasks—can affect performance by overwhelming working memory capacity (Paas et al., Reference Paas, Renkl and Sweller2003). However, directly measuring cognitive load from log data presents significant challenges due to the absence of explicit indicators (Sweller, Reference Sweller1988). MSMs represent a potentially promising alternative in this context, as they can capture both the temporal dynamics and structural patterns of transitions throughout problem-solving sequences—providing indirect yet meaningful indicators of cognitive effort at the individual level. For instance, rapid task completion with minimal planning may reflect disengagement or misunderstanding, whereas structured, stage-based progression may indicate deliberate strategy use. Transition parameters can be theoretically mapped to cognitive constructs: transition probabilities may index processing efficiency; the timing and sequence of transitions may align with executive functioning; and repetitive loops may signal confusion or strategy revision under cognitive load. This alignment enables interpretation of behavioral data within a psychologically grounded framework, beyond surface-level statistical patterns.

Therefore, modeling the transition times between specific actions allows for more fine-grained inferences about respondents’ cognitive processing and problem-solving strategies. This framework also supports comparisons between correct and incorrect respondents, which can reveal differences in strategy patterns, processing efficiency, and decision pathways which cannot be captured by outcome measures alone. All together, MSMs offer a theoretically grounded and practically valuable approach for analyzing educational assessment log data. Hence, the proposed approach enables researchers to uncover subtle variations in respondents’ cognitive processing and strategies that might otherwise remain undetected in conventional performance metrics.

3.3.2 Research and practical contributions

MSMs offer interpretive advantages that contribute meaningfully to both educational research and practice. From a research perspective, MSMs facilitate analysis of how students navigate complex tasks—beyond merely determining success, MSMs illuminate progression through each stage of the problem-solving process. For instance, in digital problem-solving items, MSMs can reveal whether high-performing students allocate more time to information review before taking action, or whether low-performing students systematically omit critical steps such as hypothesis generation or planning. This process-oriented analytical approach enables researchers to test and refine cognitive models for task performance and identify distinct behavioral patterns that differentiate effective from ineffective problem-solving strategies.

By modeling, both the sequential structure and temporal dynamics of learner actions, MSMs transform behavioral log data into a theoretically grounded and diagnostically rich resource. This approach facilitates fine-grained comparisons between meaningful subgroups (e.g., correct versus incorrect responders), uncovering cognitive strategies that remain obscured in traditional outcome-based analyses. For instance, our findings highlight critical transition points at which test takers commonly experience difficulty, such as extended delays between task comprehension and strategy initiation, thereby identifying precise junctures where instructional scaffolding may be most impactful. Similarly, recurrent transitions between particular states (e.g., repeatedly returning to information sources without progressing) may signal conceptual confusion or indecisiveness.

These process-level insights can offer a foundation for the development of responsive educational systems that adapt in real time to learners’ behavioral indicators. Such applications are directly relevant to large-scale assessments and instructional platforms administered by organizations such as the OECD, which increasingly seek to integrate process data into evidence-based design. Specifically, MSM-informed diagnostics support:

• the design of targeted interventions addressing specific cognitive bottlenecks rather than generalized performance deficits,
• the implementation of adaptive feedback mechanisms grounded in behavioral process markers rather than final outcomes,
• the creation of interface designs that facilitate cognitively optimal task sequences, and
• the deployment of analytics that clarify not only what learners struggle with, but why those struggles occur.

Ultimately, the proposed MSM approach moves beyond descriptive modeling to offer a principled, empirically grounded framework for advancing personalized learning, intelligent tutoring, and diagnostic assessment systems. MSMs thus represent a meaningful contribution to both theory-driven research and the practical optimization of educational assessment systems.

4 Proposal: MSM for log data

4.1 Model formulation

Log data consist of the sequential progression of actions for individuals. Our idea is to view individual actions as different states and apply MSM to the sequence of actions and their executed times. We can then model transition times between actions that respondents take while they are working on the problem-solving test items.

To formulate the model, suppose N is the number of respondents, E is the total number of states (actions), and $E_i$ is the number of states that the respondent i has gone through. Let $\boldsymbol {X_i} = \{x_{i,1}, x_{i,2}, \ldots , x_{i,P}\}$ is a vector of respondents’ background characteristics and $\boldsymbol {A} = \{A_1, A_2, \ldots , A_K \}$ is a collection of key actions identified through a data-driven method as detailed in Section 2.3. We adopted the time-homogeneous Markov assumption, which implies that transition rates remain constant over time and that future states depend solely on the current state. We then define the hazard function $\lambda _{m,l,i}$ for the transition from action m to action l for respondent i as follows:

(1)

$$ \begin{align} \lambda_{m,l,i} = \kappa_{c_i,m,l} \,\tau_i \,\exp \Big\{ \sum_{p=1}^{P}\alpha_{p} x_{i,p} + \beta_{c_i, 1} I(m \in \boldsymbol{A}) + \beta_{c_i, 2} I(l \in \boldsymbol{A}) \Big\}, \end{align} $$

where $c_i$ is a binary indicator for correctness, with $c_i=0$ for incorrectness and $c_i=1$ for correctness, and $I(m \in \boldsymbol {A})$ and $I(l \in \boldsymbol {A})$ are indicator functions that equal 1 when start action m and end action l are key actions, respectively.

The model parameters, $\Theta = \{\boldsymbol {\kappa _{0}, \kappa _{1}, \tau , \alpha , \beta }\}$ , are explained as follows:

• $\kappa _{c_i, m, l}> 0$ represents the baseline hazard for transitions between actions m and l for correct ( $c_i=1$ ) and incorrect ( $c_i=0$ ) groups. A larger $\kappa _{c_i,m,l}$ indicates a higher likelihood and faster rate of transitioning from action m to action l, before accounting for the effects of covariates.
• $\tau _i> 0$ represents the overall speed of respondent i, with larger values indicating a tendency for respondents to transition between actions quickly.
• $\boldsymbol {\alpha } = \{\alpha _1, \dots , \alpha _P \}$ is a collection of the regression coefficients of respondents’ background characteristics on $\lambda _{m, l, i}$ . A greater $\alpha _{p}$ implies that individuals with a higher value for background ${x_{i,p}}$ have faster transition speeds compared to others.
• $\beta _{c_i, 1}$ and $\beta _{c_i, 2}$ represent the effects when the start and end actions are key actions for group $c_i$ , respectively. For group $c_i$ , a larger $\beta _{c_i, 1}$ indicates a faster transition when the start action is the key action, while a larger $\beta _{c_i, 2}$ means a faster transition when the end action is the key action.

In our MSM for log data analysis, we estimate transition probabilities between actions under a time-homogeneous Markov assumption (the transition hazard rates between actions remain constant over time). This approach allows us to calculate the likelihood of a respondent moving from one action to another within an event sequence. Specifically, the transition probability $P_{m,l,i}$ for respondent i moving from action m to action l is computed as

(2)

$$ \begin{align} P_{m,l,i} = P(a_{i, j+1} = l, l \not=m | a_{i, j} = m) = \frac{\lambda_{m, l, i}}{\sum_{u=1}^{E}\lambda_{m, u, i}}, \end{align} $$

where $a_{i, j}$ represent the respondent i-th action in their action sequence and $\lambda _{m, l, i}$ represents the transition hazard rate between behaviors m and l. This transition probability effectively quantifies the relative likelihood of the next action being l compared to all other possible actions.

4.2 Estimation

We estimate the model parameters of our proposed model using a fully Bayesian method via Markov chain Monte Carlo (MCMC). To define the likelihood function, we denote $a_{i,j}$ represent the j-th action executed by respondent i, with $t_{i,j}$ indicating the action occurrence time. Suppose $T_i$ represents the total time it took for individual i to solve the item. We define $D_{m,l,i,j}$ as

$$\begin{align*}D_{m, l, i, j} = \left\{ \begin{array}{cl} 1 & a_{i,j-1} = m, ~ a_{i,j} = l,\\ 0 & \mbox{otherwise}. \end{array} \right. \end{align*}$$

That is, $D_{m, l, i, j}$ implies whether the transition from action m to action l is respondent i’s the j-th action in his/her action sequence. In addition, we denote the risk set, $R_{m,i}(t) = \sum _{j=1}^{E_i}I\{a_{i,j-1}=m, t_{i,j-1} < t \leq t_{i,j}\}$ , to be 1 when the respondent i is in the state m at time t.

In Bayesian estimation, the posterior distribution is derived from the product of the likelihood function and the prior distribution, enabling the integration of prior knowledge with observed data for accurate parameter estimation. The likelihood is similar to the traditional MSM model (Hougaard, Reference Hougaard1999, Reference Hougaard2012), with the primary distinction lying in the definition of the hazard function as presented in Equation 1. The likelihood function of the proposed MSM for action transition from log data can be derived as follows:

$$ \begin{align*} \begin{aligned} L\Big({\mathcal{Y}}|\Theta\Big)&= \prod_{i=1}^N\prod_{m=1}^{E}\prod_{l=1, l\not=m}^{E} \left[ \Big(\prod_{j=1}^{E_i} \lambda_{m, l, i}^{D_{m, l, i, j}} \Big) \exp \Big( -\int_{0}^{T_i} R_{m, i}(t) \lambda_{m, l, i} dt \Big) \right], \\ &=\prod_{i=1}^N\prod_{m=1}^{E}\prod_{l=1, l\not=m}^{E} \left[ \Big(\prod_{j=1}^{E_i} \lambda_{m, l, i}^{D_{m, l, i, j}} \Big) \exp \Big( - \lambda_{m, l, i} \sum_{j=1}^{E_i}I(a_{i, j-1}=m)(t_{i, j} - t_{i, j-1})\Big) \right], \end{aligned} \end{align*} $$

where ${\mathcal{Y}}$ represents the sequence of actions and their occurrence time, $\lambda _{m,l.i}$ is the hazard function as defined in Equation 1, $T_i$ represents the time taken by respondent i to solve the problem, and $\Theta = \{\boldsymbol {\kappa , \tau , \alpha , \beta }\}$ represents parameters of interest.

The posterior distribution of $\Theta $ can then be written as follows:

$$ \begin{align*} \begin{aligned} \pi\Big( \boldsymbol{\Theta} \mid {\mathcal{Y}}\Big) &\propto P\Big({\mathcal{Y}} \mid \boldsymbol{\Theta} \Big) \pi\Big(\boldsymbol\kappa\Big) \pi\Big(\boldsymbol\tau\Big) \pi\Big(\boldsymbol\alpha\Big) \pi\Big(\boldsymbol\beta\Big)\\ &= \prod_{i=1}^N\prod_{m=1}^{E}\prod_{l=1, l\not=m}^{E} \left[ \Big(\prod_{j=1}^{E_i} \lambda_{m,l,i}^{D_{m,l,i,j}} \Big) \exp \Big( -\int_{0}^{T_i} R_{m,i}(t) \lambda_{m,l,i} dt \Big) \right] \\ & \times \pi(\boldsymbol\tau) \prod_{m=1}^{E} \prod_{l=1}^{E}\Big\{\pi(\kappa_{0, m, l}) \pi(\kappa_{1, m, l})\Big\} \prod_{p=1}^{P} \pi(\alpha_p) \pi(\beta_{0, 1}) \pi(\beta_{1, 1}) \pi(\beta_{0, 2}) \pi(\beta_{1, 2}), \end{aligned} \end{align*} $$

where the prior distributions for $\Theta $ are given as

$$ \begin{align*} \begin{aligned} \pi(\kappa_{\cdot m, l}) \sim \mbox{Gamma}&\big(a_{\kappa}, b_{\kappa}\big) , \quad \pi(\tau_{i}) \sim \mbox{Gamma}\big(a_{\tau}, b_{\tau}\big) , \quad \pi(\alpha_{p}) \sim \mbox{N}\big(0, \sigma_{\alpha} \big) , \\ &\pi(\beta_{\cdot 1}) \sim \mbox{N}\big(0, \sigma_{\beta} \big), \quad \pi(\beta_{\cdot 2}) \sim \mbox{N}\big(0, \sigma_{\beta} \big).\\ \end{aligned} \end{align*} $$

We use $a_{\kappa } = b_{\kappa } = a_{\tau } = b_{\tau }= 1.0$ and $\sigma _{\alpha } = \sigma _{\beta } = 2$ as the value of the prior distribution. Additional details of the MCMC sampling are in the Section 2 of the Supplementary Material and the proposal distribution variances are adjusted to ensure moderate acceptance rates (approximately 0.2–0.5).

5 Real data analysis

We apply the proposed approach to analyzing the PIAAC problem-solving test items, CD tally, and Lamp Return out of the log data from the 14 countries, described in Section 2. The MCMC algorithm was iterated 300,000 times for each country and item, with the initial 100,000 iterations discarded as part of the burn-in process. Among the remaining 200,000 iterations, 20,000 samples were collected at 10-iteration intervals. Details of the jumping rules for the proposal distribution can be found in Section 2 of the Supplementary Material.

To assess the convergence and reliability of parameter estimates, we executed five parallel MCMC chains and computed the Gelman-Rubin $\hat {R}$ statistics for all estimated parameters. Taking the CD Tally test item from the USA as an example, all $\hat {R}$ values remained below the commonly accepted threshold of 1.1, indicating satisfactory between-chain convergence. A histogram of the $\hat {R}$ values for all parameters in this case is presented in Figure 27 in Section 6 of the Supplementary Material. We also evaluated convergence across parallel MCMC chains using visual diagnostics. Convergence across parallel MCMC chains was further evaluated using visual diagnostics. Trace plots of MCMC samples (included in Section 6 of the Supplementary Material) demonstrated stable and consistent estimation across chains, confirming the reliability of the posterior estimates.

Figure 4 illustrates the process of interpreting results from the proposed method, including parameter estimation and comparison of transition probability between the correct and incorrect answer groups. Parameter estimation section summarizes estimated parameters including $\boldsymbol {\kappa }$ , $\boldsymbol {\tau }$ , $\boldsymbol {\alpha }$ , and $\boldsymbol {\beta }$ . For instance, $\boldsymbol {\kappa } \in \{\kappa _{c_i,m,l} \}$ denotes the baseline likelihood of transitioning from action m to action l for group $c_i$ , excluding covariate effects. We present the top and bottom 5 $\kappa _{c_i,m,l}$ values for each group. Furthermore, $\boldsymbol {\alpha }$ represents covariate effects, while $\boldsymbol {\beta }$ captures the impact of start and end actions being key actions. These estimates are tabulated with color-coded significance levels: blue for positive effects, red for negative effects, and black for non-significant effects. For statistical inference, we use posterior means and 95% Highest Posterior Density (HPD) intervals. We consider estimates with an HPD interval that includes 0 are considered statistically insignificant.

Figure 4 The process of interpreting the result of the proposed model which consist of parameter estimation and comparing the transition probability between correct and incorrect group.

Following parameter estimation, we compare correct and incorrect groups by calculating the transition probabilities between actions for each group. The estimated differences in transition probabilities are derived from all MCMC samples. Statistically significant differences are identified by checking whether the 95% HPD interval includes zero.

We used both heatmap and network visualizations to highlight differences in transition probabilities. In the heatmap, significant positive differences are represented in blue, negative differences in red, and non-significant differences in white. For instance, a blue color indicates that the transition probability for a specific transition is significantly higher in the correct answer group than in the incorrect answer group. Finally, we employ a network approach, representing action as nodes and transitions as directed edges. This network visualization makes it easy to identify notable differences in transition probabilities between actions. While the heatmap provides a complete and compact overview of all transitions between actions, its interpretability decreases when the number of actions is large and the matrix becomes dense. The network plot, on the other hand, selectively displays only meaningful or strong transitions, improving readability in complex settings. Thus, the two visualizations serve complementary purposes, with the network becoming particularly valuable as the number of actions increases.

Note that we fit the model to each country data separately, and thus, direct comparisons of the parameter estimates across countries are not desirable. We present the results from all countries with a goal in mind to identify any patterns in the results across the countries.

5.1 CD Tally

5.1.1 Parameter estimation

Parameter $\kappa _{c_i,m,l}$

The parameters $\kappa _{c_i,m,l}$ denote the baseline hazard for transition from state m to l for correct ( $c_i=1$ ) and incorrect ( $c_i=0$ ) groups. The baseline hazard reflects the inherent probability of moving from action m to l when all covariates are zero. A higher value of $\kappa _{c_i,m,l}$ indicates a more frequent and rapid transition between these states, independent of covariate effects.

Table 6 presents the five highest and lowest values of $\kappa _{0,m,l}$ and $\kappa _{1,m,l}$ for incorrect and correct groups, respectively, using USA as an example (results for other countries are available in Section 4 of the Supplementary Material). For the correct group ( $\kappa _{1,m,l}$ ), the fastest transition is from “ss_sch” (clicking the search engine on a spreadsheet) to “keypress” (pressing the keyboard) with a value of 8.09, indicating rapid transitions from search-related tasks to keyboard input. Other top five transitions include “sch” (clicking the sort engine) to “keypress” (pressing the keyboard), “combobox” (interacting with a combobox on the webpage) to “ss” (switching to the spreadsheet), “wb” (switching to the webpage) to “ss” (switching to the spreadsheet), and “ss_data” (viewing spreadsheet data) to “so” (clicking the sort engine). Conversely, the bottom five transitions in the correct group predominantly originated from the spreadsheet state and involved auxiliary actions, including “ss_help” (clicking the help button for the spreadsheet), “help_a” (opening the general help page on how to answer), “split_h” (splitting the spreadsheet view horizontally), and “ss_save” (clicking the save button).

Table 6 The five highest and lowest $\kappa _{1, m, l}$ and $\kappa _{0, m, l}$ for correct and incorrect groups in the USA CD Tally test item

For the incorrect group ( $\kappa _{0,m,l}$ ), the fastest transition is transition from “wb” (switching to the webpage) to “ss” (switching to the spreadsheet), with a value of 5.86. Other rapid transitions include “wb” to “combobox” and “ss_edit” (click edit on the menu) to “sch”. Similar to the correct group, the five slowest transitions for $\kappa _{0,m,l}$ predominantly involve spreadsheet action. This pattern suggests that transitions originating from “ss” actions occur at a slower pace in both correct and incorrect groups.

Parameter $\tau $

Parameter $\tau _i$ represents the overall action transition speed for the respondent i, with higher values indicating faster transitions. Figure 5 shows boxplots of the estimated $\tau _i$ across countries (full country names are provided in the Section 3 of the Supplementary Material). In Figure 5, the distributions are roughly symmetric and centered around 1 for all countries.

Figure 5 The boxplots of posterior means of individual transition speed parameters ( $\tau _i$ ) across the 14 countries for CD Tally test item.

Note: Outliers omitted.

Parameters $\boldsymbol {\alpha }$

The parameter $\boldsymbol {\alpha }$ quantifies the impact of various covariates on the action transition speed, with larger values indicating faster transitions for respondents with higher covariate values. The estimated $\boldsymbol {\alpha }$ is summarized in Table 7 with color-coded significance: blue for positive, red for negative, and black for non-significant. Across most countries, for CD Tally test item, age consistently demonstrates a negative impact on transition speed, suggesting older respondents solve problems more slowly. In contrast, “Eskill” positively influence speed, with more skilled individuals transitioning between actions more rapidly. Gender, education, and income percentile rank, however, generally show non-significant effects across countries.

Table 7 Posterior means of $\boldsymbol {\alpha }$ for the CD Tally and Lamp Return test items

Note: Blue and red text colors represent significant positive and negative values, respectively.

Parameters $\beta _1$ and $\beta _2$

Parameters $\beta _{c_i, 1}$ and $\beta _{c_i, 2}$ represent the impact of key actions on transition speed within the hazard function for group $c_i$ . For group $c_i$ , a larger $\beta _{c_i, 1}$ suggests a faster transition from the key action, while a larger $\beta _{c_i, 2}$ indicates a quicker transition to the key action. Statistical significance is determined using the 95% HPD interval, with parameters whose intervals include zero considered statistically insignificant.

Table 8 presents the estimated $\beta _{c_i, 1}$ and $\beta _{c_i, 2}$ values for 14 countries, comparing results for the CD Tally and Lamp Return test items. Significant positive and negative effects are marked in blue and red, respectively. For the CD Tally test item, both the correct and incorrect groups tend to exhibit statistically insignificant $\beta _{0,1}$ and $\beta _{1,1}$ values across most countries, suggesting that transitions from key actions generally have negligible effects.

Table 8 Posterior means of key action effect ( $\beta _{c_i, 1}$ and $\beta _{c_i, 2}$ ) for CD Tally and Lamp Return test items

Note: Significant positive effects in blue, negative in red.

The analysis of $\beta _{c_i, 2}$ , which represents the speed of transition to key actions, reveals an intriguing pattern. While the correct group demonstrates significant positively $\beta _{1, 2}$ values most countries, the incorrect group shows significant $\beta _{0, 2}$ in fewer countries. Notably, the magnitude of $\beta _{1, 2}$ generally exceeds that of $\beta _{0, 2}$ , suggesting that the correct group transitions to key actions more rapidly than the incorrect group.

5.1.2 Compare transition speed between two groups

To compare transition patterns between correct and incorrect answer groups, we analyzed differences in transition probabilities using parameters estimated from our proposed model. Transition probabilities were calculated using Equation 2, with covariates set to specific values: Gender = 1, Age = 5, Education = 2, IncPctRank = 4, and Eskill = 1. Statistical significance of differences was determined using 95% HPD intervals. Figure 6 illustrates these differences for the USA using two visualization methods: a heatmap (Figure 6a) and a network diagram (Figure 6b). These visualizations clearly show notable differences in action transition speeds between the two groups.

Figure 6 Transition probability differences between correct and incorrect groups for USA CD Tally test item: (a) heatmap and (b) network visualization.

Note: Blue indicates higher probability for correct group, red for incorrect. Color intensity in (a) and the arrow thickness in (b) represents difference magnitude between two groups. Key actions are highlighted in red text.

Figure 6a uses a color-coded heatmap to illustrate transition probability differences between correct and incorrect groups. X and Y axes represent the “to” and “from” action state, respectively, where key actions are colored in red. Blue cells represent transitions where the correct answer group has significantly higher probabilities, red cells indicate the incorrect group showing significantly higher probabilities. White cells represent non-significant differences between the two groups. The color intensity corresponds to the magnitude of these differences. Notable patterns emerge, such as the blue cell for the “combobox” (selecting a combobox value) to “ss” (switching to the spreadsheet page) transition, indicating faster movement from “combobox” to “ss” by correct respondents. Conversely, the red cells like the “sch_n” (clicking the next button in the search engine) to “keypress” (pressing a keyboard key) transition suggest quicker progression by the incorrect group in certain actions.

Figure 6b presents a network diagram of actions (nodes) and transitions (directed edges), where key actions are colored in red text. To improve the clarity of the presentation, we displayed only statistically significant differences in the graph. Arrow thickness represents the magnitude of the probability differences, with blue arrows indicating faster transitions by the correct group and red arrows indicating faster transitions by the incorrect group. This visualization reveals that the correct answer group typically executes transitions that involve key actions, which are crucial to reaching the correct answer, more rapidly. Key examples include faster transitions from “so_2_asc” (sorting the spreadsheet in ascending order) to “so_ok” (clicking “Ok” after setting sorting options), “ss_so” (clicking the sort engine on the spreadsheet page) to “so_1_3” (sorting by the third column, Genre), and “so” (clicking the sort engine through the data menu on the spreadsheet page) to “so_1_3” (sorting by the third column, Genre). These transactions that occur when sorting the spreadsheet to solve the problem are critical parts of the process of solving the correct answer. However, the incorrect group showed faster execution of actions that might be less directly related to solving the problem. For instance, “wb_t_h” (clicking the help button on the website page toolbar) to “combobox” (selecting a combobox value) and “wb_f” (clicking the file menu on the website page) to “combobox” were performed more rapidly by the incorrect group.