Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-11T06:02:17.582Z Has data issue: false hasContentIssue false

Random Effects Multinomial Processing Tree Models: A Maximum Likelihood Approach

Published online by Cambridge University Press:  01 January 2025

Steffen Nestler*
Affiliation:
Universität Münster
Edgar Erdfelder*
Affiliation:
Universität Mannheim
*
Correspondence should be made to Steffen Nestler, Institut für Psychologie, Universität Münster, Fliednerstr. 21, 48149Münster, Germany. Email: steffen.nestler@uni-muenster.de
Correspondence should be made to Edgar Erdfelder, Universität Mannheim, Fakultät für Sozialwissenschaften A5, 68159Mannheim, Germany. Email: erdfelder@uni-mannheim.de
Rights & Permissions [Opens in a new window]

Abstract

The present article proposes and evaluates marginal maximum likelihood (ML) estimation methods for hierarchical multinomial processing tree (MPT) models with random and fixed effects. We assume that an identifiable MPT model with S parameters holds for each participant. Of these S parameters, R parameters are assumed to vary randomly between participants, and the remaining S-R\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$S-R$$\end{document} parameters are assumed to be fixed. We also propose an extended version of the model that includes effects of covariates on MPT model parameters. Because the likelihood functions of both versions of the model are too complex to be tractable, we propose three numerical methods to approximate the integrals that occur in the likelihood function, namely, the Laplace approximation (LA), adaptive Gauss–Hermite quadrature (AGHQ), and Quasi Monte Carlo (QMC) integration. We compare these three methods in a simulation study and show that AGHQ performs well in terms of both bias and coverage rate. QMC also performs well but the number of responses per participant must be sufficiently large. In contrast, LA fails quite often due to undefined standard errors. We also suggest ML-based methods to test the goodness of fit and to compare models taking model complexity into account. The article closes with an illustrative empirical application and an outlook on possible extensions and future applications of the proposed ML approach.

Type
Theory & Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright
Copyright © 2023 The Author(s)

Multinomial processing tree (MPT) models are stochastic models for categorical data frequently used in different branches of behavioral science, primarily in cognitive psychology and social cognition research (for overviews and a recent tutorial see Batchelder & Riefer, Reference Batchelder and Riefer1999; Erdfelder et al., Reference Erdfelder, Auer, Hilbig, Assfalg, Moshagen and Nadarevic2009, Reference Erdfelder, Hu, Rouder and Wagenmakers2020; Hütter & Klauer, Reference Hütter and Klauer2016; Schmidt et al., Reference Schmidt, Erdfelder and Heck2023). They have been applied to a wide range of phenomena, including, for example, recognition memory (e.g., Riefer & Batchelder, Reference Riefer and Batchelder1995; Xu & Bellezza, Reference Xu and Bellezza2001), source monitoring (e.g., Meiser & Broder, Reference Meiser and Broder2002), recall memory (Batchelder & Riefer, Reference Batchelder and Riefer1986), and judgmental illusions, such as the hindsight bias (Erdfelder and Buchner, Reference Erdfelder and Buchner1998; Nestler and Egloff, Reference Nestler and Egloff2009; Nestler et al., Reference Nestler, Egloff, Küfner and Back2012). Specifically, MPT models are cognitive process models that refer to a particular experimental task or paradigm in which participants’ judgments are categorized into a well-defined set of responses. It is assumed that the observed frequencies of responses who fall into these categories follow a multinomial distribution and that the probabilities underlying these frequencies are determined by latent cognitive processes that drive observed response behavior.

The primary goal of fitting MPT models to observed response frequencies is to estimate the cognitive process parameters, that is, the latent probabilities that certain cognitive processes were successful or not (e.g., memory processes, such as encoding, storage, or retrieval). Furthermore, by estimating models that impose psychologically motivated restrictions on these parameters (e.g., equality constraints or parameter fixations), model comparisons can be used to statistically test psychological assumptions and cognitive hypotheses that are linked to the model.

In most past applications, MPT models have been estimated by aggregating observed category frequencies across participants. This approach presumes that individual differences in cognitive process parameters can be neglected. However, when there are substantial individual differences, parameter estimates may be biased, and the results of inferential statistical procedures might not be optimal (e.g., the standard errors of the parameter estimates may be misleading, Batchelder & Riefer, Reference Batchelder and Riefer1999; Erdfelder et al., Reference Erdfelder, Auer, Hilbig, Assfalg, Moshagen and Nadarevic2009; Klauer, Reference Klauer2006; Reference Klauer2010; Smith & Batchelder, Reference Smith and Batchelder2010). In addition to these statistical problems, the assumption that there are no between-person differences in relevant cognitive processes seems rather implausible (Lee & Webb, Reference Lee and Webb2005; Smith & Batchelder, Reference Smith and Batchelder2010). Such an assumption also precludes the exploration of interesting research questions about the origin of individual differences in process parameters and about their relationships with other model parameters as well as covariates (e.g., Coolin et al., Reference Coolin, Erdfelder, Bernstein, Thornton and Thornton2015, Reference Coolin, Erdfelder, Bernstein, Thornton and Thornton2016). Hence, it is desirable to quantify potential interindividual differences and to investigate which person variables can explain them.

A number of extensions have therefore been proposed to model and predict the heterogeneity of parameters between individuals. First, Klauer (Reference Klauer2006) suggested a latent class MPT framework in which a single person is assumed to be a member of one latent class, and model parameters may differ between latent classes. Second, Smith and Batchelder (Reference Smith and Batchelder2010) proposed a hierarchical MPT model in which participants’ model parameters are assumed to be sampled from independent beta distributions (i.e., a beta distribution for each MPT model parameter; hence the name “beta-MPT model”). More recently, Klauer (Reference Klauer2010) introduced another hierarchical extension of the standard MPT model that allows researchers to capture not only the variability in the (probit-transformed) parameters across individuals but also the covariances or correlations between these parameters. This latent-trait MPT model is based on the assumption that the person parameters stem from a multivariate normal distribution, and their expectations and covariance matrix are estimated from the observed frequency data.

Klauer (Reference Klauer2010) suggested a Bayesian approach for parameter estimation and inferences in which the posterior distribution is approximated using Monte Carlo-Markov chain methods (see Heck et al., Reference Heck, Arnold and Arnold2018a, for an implementation in R). In the present article, we show how the parameters of the latent-trait MPT model can be obtained through marginal maximum likelihood (ML) estimation. Our implementation of the ML method introduces a frequentist approach for hierarchical MPT models that is perhaps more familiar to researchers because standard errors, confidence intervals, and goodness-of-fit tests can be computed on the basis of well-known asymptotic properties of ML estimates. Moreover, in addition to the asymptotic optimality of ML estimates (Reed and Cressie, Reference Reed and Cressie1988), the subtleties of specifying appropriate prior distributions for Bayesian estimation can be avoided when using the ML method. In particular, one does not have to worry about whether and how the obtained estimates and model comparisons are affected by the prior (for a recent discussion, see Sarafoglou et al., Reference Sarafoglou, Kuhlmann, Aust and Haaf2022) or whether de-facto equivalent models may become nonequivalent as a consequence of the choice of the prior (Kellen & Klauer, Reference Kellen and Klauer2020). Importantly, the most efficient numerical ML algorithm is also typically faster than a Bayesian estimator, and the convergence of the ML algorithm is simpler to determine.

1. The Pair-Clustering Model

Before we introduce MPT models and the ML estimation methods in more detail, we briefly describe the pair-clustering model (e.g., Batchelder & Riefer, Reference Batchelder and Riefer1980, Reference Batchelder and Riefer1986) that we use throughout the article to illustrate the proposed methods. The pair-clustering model can be used to analyze data from a free-recall experiment in which participants are presented with a list of word pairs plus a number of singletons. All words are preselected to be equally difficult in terms of memorizability. Each word pair consists of semantically related words (e.g., rose–tulip), whereas singletons are unrelated to other words in the list. These pairs and the singletons are presented one word at a time, and participants are later asked to recall the items from the list in any order. On the basis of participants’ free recall performance, the studied word pairs and singletons are then assigned to one of six response categories. It is assumed that the probability of each response category can be modeled by two processing trees that include a total of four parameters (see Fig. 1 for a graphical illustration of the model).

Figure 1. The pair-clustering MPT model (adapted from Riefer & Batchelder, Reference Riefer and Batchelder1988, p. 330, Figure 2). Rectangles indicate stimulus classes (left) and observable responses (right). Rectangles with rounded corners represent latent cognitive states. Parameters attached to the branches indicate transition probabilities from left to right, specifically, storing a word pair as a cluster (c), retrieving a stored cluster in free recall (r), storing and retrieving a word from a non-clustered pair in free recall (u), and storing and retrieving a singleton in free recall (a).

Specifically, the model defines four response categories for the word pairs: Category C 11 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{11}$$\end{document} includes all cases where both words in a pair are recalled adjacently, C 12 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{12}$$\end{document} represents cases in which both words are recalled nonadjacently, C 13 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{13}$$\end{document} corresponds to cases where exactly one word is recalled, and finally C 14 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{14}$$\end{document} represents cases in which neither word from the pair is recalled. In addition, the singletons are scored in two categories: singletons recalled successfully ( C 21 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{21}$$\end{document} ) versus not successfully ( C 22 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{22}$$\end{document} ).

The pair-clustering model proposes four cognitive process parameters that jointly determine the response probabilities: (1) the probability c that a word pair is stored as a cluster in memory, (2) the probability r that a stored cluster is successfully retrieved in free recall, (3) the probability u that one word from a pair that is not stored as part of a cluster is successfully retrieved, and finally, (4) the probability a that a singleton is successfully stored and retrieved. These parameters can be used to derive response probabilities for each of the six categories by calculating the product of all parameters along a branch and then summing up all branch probabilities that refer to the same category. For example, because there is only one branch that terminates in category C 11 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{11}$$\end{document} (see Fig. 1), the probability of this category (i.e., both words recalled adjacently) is just the product c · r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c \cdot r$$\end{document} (i.e., successful storage as a cluster followed by successful retrieval of this cluster). Applying the same logic to all branches in Fig. 1 results in the following six model equations:

(1) P ( C 11 ) = c · r P ( C 12 ) = ( 1 - c ) · u 2 P ( C 13 ) = ( 1 - c ) · u · ( 1 - u ) + ( 1 - c ) · ( 1 - u ) · u P ( C 14 ) = c · ( 1 - r ) + ( 1 - c ) · ( 1 - u ) 2 P ( C 21 ) = a P ( C 22 ) = 1 - a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P(C_{11})&= c \cdot r \nonumber \\ P(C_{12})&= (1 - c) \cdot u^2 \nonumber \\ P(C_{13})&= (1 - c) \cdot u \cdot (1 - u) + (1 - c) \cdot (1 - u) \cdot u \nonumber \\ P(C_{14})&= c \cdot (1 - r) + (1 - c) \cdot (1 - u)^2 \nonumber \\ P(C_{21})&= a \nonumber \\ P(C_{22})&= 1 - a \end{aligned}$$\end{document}

The goal of MPT modeling, applied to the pair-clustering model, is to estimate the four cognitive process parameters and to test reasonable hypotheses about these parameters. One example of such a hypothesis would be that unclustered words from a pair behave like singletons in memory, that is, they are stored and retrieved with the same probability (i.e., u = a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u = a$$\end{document} ).

In the next section, we begin with a formal introduction of a typical MPT model without random effects. This is followed by an outline of the latent-trait MPT model with some extensions. In the third section, we present three methods of marginal ML estimation for the latent-trait model. Next, we describe the results of a small simulation study in which we compare the different methods with respect to accuracy and speed. In Sect. 3.2, we present an application of our method to an illustrative example using real data. In the final section, we discuss possible extensions and open questions for future research.

2. MPT Models Without Person-Level Random Effects

The pair-clustering model we just described is a specific instance of an MPT model. On a more general level, MPT models are statistical models of response frequencies of mutually exclusive, independent categories. To define the model, we assume that there are k = 1 , , K \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = 1, \ldots , K$$\end{document} category systems and that each category system consists of j = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j = 1$$\end{document} , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots $$\end{document} , J k \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J_k$$\end{document} categories C kj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{kj}$$\end{document} . In the pair-clustering model, there are K = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K = 2$$\end{document} category systems, one for the word pairs and one for the singletons. The first system consists of four categories: C 11 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{11}$$\end{document} , C 12 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{12}$$\end{document} , C 13 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{13}$$\end{document} , and C 14 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{14}$$\end{document} ; and the singleton system contains two categories: C 21 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{21}$$\end{document} and C 22 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{22}$$\end{document} . We further assume that data from t = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t = 1$$\end{document} , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots $$\end{document} , T individuals is available. For each single individual t, the data for category system k are given as a vector of frequencies, n kt = ( n k 1 t , , n k J k t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{n}_{kt} = (n_{k1t}, \ldots , n_{kJ_{k}t})$$\end{document} . For instance, n 1 t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{n}_{1t}$$\end{document} contains the four frequencies of person t for the four word pair categories (i.e., k = 1). These frequencies are assumed to stem from a multinomial distribution

(2) f ( n kt | θ ) = N kt n k 1 t . . . n k J k t p k 1 t n k 1 t · p k 2 t n k 2 t p k J k t n k J k t = N kt n k 1 t . . . n k J k t j = 1 J k p kjt n kjt \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\varvec{n}_{kt}|\varvec{\theta }) = \begin{pmatrix} N_{kt} \\ n_{k1t} ... n_{kJ_{k}t} \end{pmatrix} p_{k1t}^{n_{k1t}} \cdot p_{k2t}^{n_{k2t}} \cdots p_{kJ_{k}t}^{n_{kJ_{k}t}} = \begin{pmatrix} N_{kt} \\ n_{k1t} ... n_{kJ_{k}t} \end{pmatrix} \prod _{j = 1}^{J_{k}} p_{kjt}^{n_{kjt}} \end{aligned}$$\end{document}

where N kt \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_{kt}$$\end{document} = n k 1 t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{k1t}$$\end{document} +... + n k J k t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{kJ_{k}t}$$\end{document} and p k 1 t + + p k J k t = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{k1t} + \cdots + p_{kJ_{k}t} = 1$$\end{document} . This definition requires the vector of all frequencies across all category systems K of person t, that is, n t = ( n 1 t , . . . , n Kt ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{n}_{t} = (\varvec{n}_{1t},..., \varvec{n}_{Kt})$$\end{document} , to follow a product multinomial distribution:

(3) f ( n t | θ ) = k = 1 K N kt n k 1 t . . . n k J k t j = 1 J k p kjt n kjt . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\varvec{n}_{t}|\varvec{\theta }) = \prod _{k = 1}^{K} \begin{pmatrix} N_{kt} \\ n_{k1t} ... n_{kJ_{k}t} \end{pmatrix} \prod _{j = 1}^{J_{k}} p_{kjt}^{n_{kjt}} . \end{aligned}$$\end{document}

The idea behind MPT models is that the probability of the occurrence of an event category can be modeled as a function of the cognitive process parameters. In the standard model (i.e., without person-level random effects), these process parameters θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} are unknown constants that do not vary between individuals. Thus, p kjt \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p_{kjt}$$\end{document} is written as

(4) p kjt = P ( C kj | θ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} p_{kjt} = P(C_{kj} | \varvec{\theta } ) \end{aligned}$$\end{document}

where θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} contains the S cognitive process parameters and is thus an element of [ 0 , 1 ] S \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0, 1]^S$$\end{document} , s = 1 , . . . , S \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s = 1,..., S$$\end{document} . In the example, θ = ( c , r , u , a ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta } = (c, r, u, a)$$\end{document} , and the probability of C 14 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{14}$$\end{document} (see above) is p 14 t = P ( C 14 | θ ) = c ( 1 - r ) + ( 1 - c ) ( 1 - u ) 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{14t} = P( C_{14} | \varvec{\theta } ) = c(1-r) + (1-c)(1-u)^2$$\end{document} . To write P ( C kj | θ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P( C_{kj} | \varvec{\theta } )$$\end{document} more generally, we define I kj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ I_{kj}$$\end{document} to be the number of branches of the tree that terminate in C kj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{kj}$$\end{document} , with i = 1 , . . . , I kj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i = 1,..., I_{kj}$$\end{document} indexing a specific branch B kji \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{kji}$$\end{document} that terminates in category j of category system k. As described by Hu and Batchelder (Reference Hu and Batchelder1994), we define:

(5) P ( B kji | θ ) = s = 1 S θ s a skji ( 1 - θ s ) b skji \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P( B_{kji} | \varvec{\theta }) = \prod _{s = 1}^{S} \theta _{s}^{a_{skji}}( 1 -\theta _{s})^{b_{skji}} \end{aligned}$$\end{document}

where a skji \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_{skji}$$\end{document} and b skji \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{skji}$$\end{document} indicate how often a process parameter θ s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{s}$$\end{document} and its complement 1 - θ s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-\theta _{s}$$\end{document} , respectively, appear on the branch B kji \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{kji}$$\end{document} . For instance, there are two paths that terminate in C 14 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{14}$$\end{document} . To illustrate, we just consider the first of these branches, B 141 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B_{141}$$\end{document} :

(6) P ( B 141 | θ ) = s = 1 4 θ s a s 141 ( 1 - θ s ) b s 141 = θ 1 a 1141 ( 1 - θ 1 ) b 1141 · θ 2 a 2141 ( 1 - θ 2 ) b 2141 · θ 3 a 3141 ( 1 - θ 3 ) b 3141 · θ 4 a 4141 ( 1 - θ 4 ) b 1141 = c 1 ( 1 - c ) 0 · r 0 ( 1 - r ) 1 · u 0 ( 1 - u ) 0 · a 0 ( 1 - a ) 0 = c ( 1 - r ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P( B_{141} | \varvec{\theta })&= \prod _{s = 1}^{4} \theta _{s}^{a_{s141}}( 1 -\theta _{s})^{b_{s141}} \nonumber \\&= \theta _{1}^{a_{1141}}( 1 -\theta _{1})^{b_{1141}} \cdot \theta _{2}^{a_{2141}}( 1 -\theta _{2})^{b_{2141}} \cdot \theta _{3}^{a_{3141}}( 1 -\theta _{3})^{b_{3141}} \cdot \theta _{4}^{a_{4141}}( 1 -\theta _{4})^{b_{1141}}\nonumber \\&= c^{1}(1-c)^{0} \cdot r^{0}(1-r)^{1} \cdot u^{0}(1-u)^{0} \cdot a^{0}(1-a)^{0} = c(1-r) \end{aligned}$$\end{document}

According to these definitions, the probability of C kj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{kj}$$\end{document} is

(7) P ( C kj | θ ) = i = 1 I k P ( B kji | θ ) = i = 1 I k s = 1 S θ s a skji ( 1 - θ s ) b skji \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} P(C_{kj} | \varvec{\theta } ) = \sum _{i = 1}^{I_k} P( B_{ kji} | \varvec{\theta }) = \sum _{i = 1}^{I_k} \prod _{s = 1}^{S} \theta _{s}^{a_{skji}}( 1 -\theta _{s})^{b_{skji}} \end{aligned}$$\end{document}

and the multinomial probability of the data for a single individual t is

(8) f ( n t | θ ) = k = 1 K N kt n k 1 t . . . n k J k t j = 1 J k i = 1 I kj s = 1 S θ s a skji ( 1 - θ s ) b skji n kjt . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\varvec{n}_{t}|\varvec{\theta }) = \prod _{k = 1}^{K} \begin{pmatrix} N_{kt} \\ n_{k1t} ... n_{kJ_{k}t} \end{pmatrix} \prod _{j = 1}^{J_{k}} \left[ \sum _{i = 1}^{I_{kj}} \prod _{s = 1}^{S} \theta _{s}^{a_{skji}}( 1 -\theta _{s})^{b_{skji}} \right] ^{n_{kjt}}. \end{aligned}$$\end{document}

Equation 8 can be used to obtain ML estimates of the process parameters θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} . For instance, Hu and Batchelder (Reference Hu and Batchelder1994) showed how the parameters can be estimated with an EM algorithm. Alternatively, methods based on the analytical gradient or finite difference methods can be used for ML parameter estimation. Both approaches are implemented in the R package MPTinR (Singmann and Kellen, Reference Singmann and Kellen2013; Singmann et al., Reference Singmann, Kellen, Gronau, Mueller and Bhel2020) or the package mpt (Wickelmaier and Zeileis, Reference Wickelmaier and Zeileis2011, Reference Wickelmaier and Zeileis2020). Also, a Bayesian approach can be used to estimate the parameters (e.g., Lee & Wagenmakers, Reference Lee and Wagenmakers2014).

3. MPT Models with Person-Level Random Effects

In the previous section, the process parameters contained in θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} were assumed not to differ between individuals. In the latent-trait MPT model, this assumption is relaxed because R S \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R {\le } S$$\end{document} elements of θ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} are written as a function of person-specific random effects that stem from a multivariate normal distribution

(9) b t M N V ( μ , Σ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{b}_t \sim MNV( \varvec{\mu }, \varvec{\Sigma }) \end{aligned}$$\end{document}

where μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} is an R × 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \times 1$$\end{document} vector of expectations and Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} is an R × R \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \times R$$\end{document} covariance matrix. Because each single θ st \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{st}$$\end{document} is a probability parameter, we need to make sure that the resulting coefficient remains in the unit interval [0, 1]. On the basis of the literature on the Generalized Linear Model (GLM), we use the logit-link function (see Coolin et al., Reference Coolin, Erdfelder, Bernstein, Thornton and Thornton2015)

(10) θ st = 1 1 + exp - ( β s + b st ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \theta _{st} = \frac{1}{1 + {exp}\left[ -(\beta _s + b_{st})\right] } \end{aligned}$$\end{document}

or the probit-link function (see Klauer, Reference Klauer2010)

(11) θ st = Φ ( β s + b st ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \theta _{st} = \Phi (\beta _s + b_{st}) \end{aligned}$$\end{document}

to map a real-valued person parameter b st \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{st}$$\end{document} on a probability parameter θ st \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{st}$$\end{document} , with β s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _s$$\end{document} serving as a parameter-specific intercept constant (in Eq. 11, Φ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Phi $$\end{document} denotes the cumulative normal distribution). The two approaches typically result in very similar parameter estimates (at least in the GLM framework). To allow for comparisons in the framework of random effects MPT models, we have implemented both link functions in the R package that implements the statistical methods proposed in this article.

We can now define the conditional distribution of the category frequencies of person t given the random effects b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{b}_t$$\end{document} :

(12) f ( n t | β , b t ) = k = 1 K N kt n k 1 t . . . n k J k t j = 1 J k i = 1 I kj s = 1 S θ st a skji ( 1 - θ st ) b skji n kjt \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\varvec{n}_{t}|\varvec{\beta }, \varvec{b}_t) = \prod _{k = 1}^{K} \begin{pmatrix} N_{kt} \\ n_{k1t} ... n_{kJ_{k}t} \end{pmatrix} \prod _{j = 1}^{J_{k}} \left[ \sum _{i = 1}^{I_{kj}} \prod _{s = 1}^{S} \theta _{st}^{a_{skji}}( 1 -\theta _{st})^{b_{skji}} \right] ^{n_{kjt}} \end{aligned}$$\end{document}

where θ st \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{st}$$\end{document} is defined as in Eqs. 10 or 11. Along with the multivariate normal density of the random effects b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{b}_t$$\end{document} , Eq. 12 can then be used to obtain ML estimates of the model parameters. We refer to this model as the random effects MPT model.

Before we proceed, we would like to point out that the mean structure in our model is not identified because the expectation of the random effects is μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} and the conditional distribution contains the vector of intercept terms β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} . To identify the mean structure, we define all elements of β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} to be zero when all process parameters are assumed to differ between individuals (i.e., R = S \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=S$$\end{document} , where R denotes the number of random effects parameters). In this case, our random effects model corresponds to the model described by Klauer (Reference Klauer2010). However, when only some (but not all) of the process parameters are assumed to be random, we estimate the respective entries in β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} for the ( S - R ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S-R)$$\end{document} fixed parameters while setting the remaining R entries (corresponding to random parameters) to zero and reducing the dimensions of μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} and Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} accordingly. Thus, by writing the model as proposed here, we achieve a high degree of flexibility to estimate MPT models with both random and fixed effects. This can be very useful, for example, when MPT models include guessing parameters that need to be estimated and are assumed to be constant across individuals.

Furthermore, we can also extend the model to include person-level covariates contained in the person-specific column vectors X t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{t}$$\end{document} . This can be achieved in two ways: First, when the person-level covariates are used to predict R process parameters that are assumed to be random, μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} is person-specific, and we write it as function of the predictor values:

(13) μ t = μ + Γ X t , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{\mu }_t = \varvec{\mu } + \varvec{\Gamma }\varvec{X}_{t}, \end{aligned}$$\end{document}

where Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} is an R × p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \times p$$\end{document} matrix of weights, and p is the number of person-level predictors used to predict the random process parameters. Note that when there is an assumption that a process parameter is not affected by a predictor in X t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_{t}$$\end{document} , the respective entry in Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} is simply set to zero. Furthermore, μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} represents the values of the person parameters when the covariates are zero. Second, when a fixed process parameter θ s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _s$$\end{document} is assumed to vary as a function of person-level covariates so that the intercept β st \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{st}$$\end{document} may differ between individuals t, we follow the procedure outlined by Coolin et al. (Reference Coolin, Erdfelder, Bernstein, Thornton and Thornton2015) and write

(14) β st = β s + γ s X t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \beta _{st} = \beta _{s} + \varvec{\gamma _{s} } \varvec{X}_{t} \end{aligned}$$\end{document}

where β s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{s}$$\end{document} is the value of the parameter-specific intercept when all person-level covariates are zero and γ s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma _{s}}$$\end{document} is a row vector containing the weights of the p covariates used to predict θ s \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _s$$\end{document} .

3.1. Marginal Maximum Likelihood Estimation

The goal is to estimate the free parameters contained in β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} , γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} , μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} , and Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} , where β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} and γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} denote the ( S - R ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S-R)$$\end{document} -dimensional vector of intercepts and the ( S - R ) × p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S-R) \times p$$\end{document} matrix of predictor weights, respectively, in the fixed-effects part of our model (corresponding to the model of Coolin et al., Reference Coolin, Erdfelder, Bernstein, Thornton and Thornton2015) while μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} , and Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} denote to the R-dimensional vector of parameter means, the R × p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \times p$$\end{document} matrix of predictor weights, and the R × R \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \times R$$\end{document} parameter covariance matrix, respectively, in the random-effects part of our model (corresponding to the model of Klauer, Reference Klauer2010). We employ a marginal ML approach that is based on the marginal density of the response frequencies. For a single individual t, this density is

(15) f ( n t ) = b t f ( n t | β , γ , b t ) f ( b t | μ , Γ , Σ ) d b t . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\varvec{n}_{t}) = \int _{\varvec{b}_t} f(\varvec{n}_{t}|\varvec{\beta },\varvec{\gamma }, \varvec{b}_t) f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) {d}\varvec{b}_t . \end{aligned}$$\end{document}

and for the entire sample, it is

(16) f ( n ) = t = 1 T f ( n t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} f(\varvec{n}) = \prod _{t = 1}^{T} f(\varvec{n}_t) \end{aligned}$$\end{document}

Eq. 16 can be used to define the likelihood

(17) L ( β , γ , μ , Γ , Σ ) = t = 1 T b t f ( n t | β , γ , b t ) f ( b t | μ , Γ , Σ ) d b t = t = 1 T b t L t d b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} L(\varvec{\beta },\varvec{\gamma },\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) = \prod _{t = 1}^{T} \int _{\varvec{b}_t} f(\varvec{n}_{t}|\varvec{\beta },\varvec{\gamma }, \varvec{b}_t) f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) {d}\varvec{b}_t = \prod _{t = 1}^{T} \int _{\varvec{b}_t} L_t {d}\varvec{b}_t \end{aligned}$$\end{document}

and the log-likelihood of the data

(18) l l ( β , γ , μ , Γ , Σ ) = t = 1 T log b t L t d b t . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} ll(\varvec{\beta },\varvec{\gamma }, \varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) = \sum _{t = 1}^{T} \text {log}\left( \int _{\varvec{b}_t} L_t {d}\varvec{b}_t\right) . \end{aligned}$$\end{document}

One problem with the marginal ML approach is that there is no analytical solution for the values of β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} , γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} , μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} , and Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} that maximize Eq. 18. Rather, numerical approximations have to be used to estimate the model parameters. For these approximations, we use the analytical gradient of the log-likelihood function given by

(19) l l τ k = t = 1 T 1 f ( n t ) b t L t log ( L t ) τ k d b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial ll}{\partial \tau _k} = \sum _{t = 1}^{T} \frac{1}{ f (\varvec{n}_{t}) } \int _{\varvec{b}_t} L_t \frac{\partial \text {log}(L_t)}{\partial \tau _k} {d}\varvec{b}_t \end{aligned}$$\end{document}

where τ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document} denotes an element of β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} , γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} , μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} , or Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} . To implement the gradient, one needs the first derivative of the log-likelihood function for a single person t with regard to a parameter. Note that log ( L t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {log}(L_t)$$\end{document} is

(20) log f n t | β , γ , b t + log f b t | μ , Γ , Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text {log}f\left( \varvec{n}_{t}|\varvec{\beta },\varvec{\gamma }, \varvec{b}_t\right) + \text {log}f \left( \varvec{b}_t|\varvec{\mu }, \varvec{\Gamma }, \varvec{\Sigma }\right) \end{aligned}$$\end{document}

Thus, for a single element σ l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{l}$$\end{document} contained in Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} , the derivative is

(21) log L t σ l = - 1 2 t r Σ - 1 Σ σ l + 1 2 b t - μ t Σ - 1 Σ σ l Σ - 1 ( b t - μ t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial \text {log}L_t}{\partial \sigma _l} = -\frac{1}{2} tr\left( \varvec{\Sigma }^{-1} \frac{\partial \varvec{\Sigma }}{ \partial \sigma _l }\right) + \frac{1}{2} \left( \varvec{b}_{t} - \varvec{\mu }_{t}\right) ^{'} \varvec{\Sigma }^{-1} \frac{\partial \varvec{\Sigma }}{ \partial \sigma _l } \varvec{\Sigma }^{-1} (\varvec{b}_{t} - \varvec{\mu }_{t}) \end{aligned}$$\end{document}

with

(22) Σ σ l = 1 l + 1 l - 1 l · 1 l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial \varvec{\Sigma }}{ \partial \sigma _l } = \varvec{1}_{l} + \varvec{1}_{l}^{'} - \varvec{1}_{l} \cdot \varvec{1}_{l}^{'} \end{aligned}$$\end{document}

and 1 l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{1}_{l}$$\end{document} is an R x R matrix that contains a 1 in the position of σ l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _l$$\end{document} and zeroes in all other positions. For elements in μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} or Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} , the derivative is

(23) log L t μ tl = ( b t - μ t ) Σ - 1 μ t μ tl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial \text {log}L_t}{\partial \mu _{tl}} = (\varvec{b}_t - \varvec{\mu }_t)^{'} \varvec{\Sigma }^{-1} \frac{\partial \varvec{\mu }_{t}}{ \partial \mu _{tl} } \end{aligned}$$\end{document}

where

(24) μ t μ l = 1 l and μ t γ l = 1 l X t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\varvec{\mu }_{t}}{ \partial \mu _l} = \varvec{1}_l \text { and } \frac{\varvec{\mu }_{t}}{ \partial \gamma _l} = \varvec{1}_l \varvec{X}_t \end{aligned}$$\end{document}

and 1 l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{1}_{l}$$\end{document} is an R x 1 vector or an R x p matrix that contains a 1 in the position of the parameter that is being estimated and zeroes in all other positions. Finally, the derivative of log ( L t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {log}(L_t)$$\end{document} with regard to the elements in β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} is

(25) log L t β l t = 1 T k = 1 K j = 1 J k n kjt P ( C kj | θ t ) i = 1 I kj a skji · θ st β l - 1 - b skji · 1 - θ st β l - 1 · P ( B kji | θ t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial \text {log}L_t}{\partial \beta _{l}} \sum _{t=1}^{T} \sum _{k=1}^{K} \sum _{j=1}^{J_{k}} \frac{ n_{kjt} }{ P(C_{kj} | \varvec{\theta }_t ) } \sum _{i = 1}^{I_{kj}} \left[ a_{skji} \cdot \left( \frac{\partial \theta _{st}}{\partial \beta _{l}}\right) ^{-1} - b_{skji} \cdot {\left( 1 - \frac{\partial \theta _{st}}{\partial \beta _{l}}\right) }^{-1} \right] \cdot P(B_{kji}|\varvec{\theta }_t) \end{aligned}$$\end{document}

where θ st β l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\partial \theta _{st}}{\partial \beta _{l}}$$\end{document} is the derivative of the chosen link function with respect to the intercept parameter β l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{l}$$\end{document} . The same formula can be used for the parameters in γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} , but the derivative of the link function has to be computed for the element γ sl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{sl}$$\end{document} rather than β l \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{l}$$\end{document} (i.e., θ st γ sl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{\partial \theta _{st}}{\partial \gamma _{sl}}$$\end{document} ).

A second problem with using the marginal ML approach is that the likelihood or log-likelihood function, respectively, and the gradient involve integrals that are not tractable. However, one can approximate the integrals using a number of numerical techniques (see Tuerlinckx et al., Reference Tuerlinckx, Rijmen, Verbeke and De Boeck2006; Nestler, Reference Nestler2020, Reference Nestler2021). In the R package that implements marginal ML estimation methods, we implemented three approaches: the Laplace approximation, the Adaptive Gauss–Hermite Quadrature (AGHQ), and Quasi Monte Carlo (QMC) Sampling.

Laplace Approximation The basic idea behind the Laplace approximation is that it can be used to replace the function within the integral with another function that has a closed-form expression. Imagine that the modes of the random effects b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{b}_t$$\end{document} of

(26) l ( b t ) = f ( n t | β , γ , b t ) f ( b t | μ , Γ , Σ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l( \varvec{b}_t) = f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \varvec{b}_t) f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) \end{aligned}$$\end{document}

are available for each of the T individuals. One can then show (see, e.g., Pinheiro & Bates, Reference Pinheiro and Bates1995) that the likelihood function given in Eq. 17 can be approximated by

(27) L ( β , γ , μ , Γ , Σ ) t = 1 T ( 2 π ) S / 2 | Ω ^ | 1 / 2 f ( n t | β , γ , b ^ t ) f ( b ^ t | Σ , μ , Γ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} L(\varvec{\beta },\varvec{\gamma },\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) \approx \prod _{t = 1}^{T} (2\pi )^{S/2}|\hat{\varvec{\Omega }}|^{1/2} f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \hat{\varvec{b}}_t) f( \hat{\varvec{b}}_t|\varvec{\Sigma },\varvec{\mu },\varvec{\Gamma }) \end{aligned}$$\end{document}

where b t ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{b}_t}$$\end{document} denotes the modes, and Ω ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Omega }}$$\end{document} is the asymptotic covariance matrix of these modes.

A problem with using Eq. 27 is that one has to estimate the modes b t ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{b}_t}$$\end{document} for all T individuals given the (unknown) parameters contained in β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} , γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} , μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} , Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} , and Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} . In practical implementations, this problem is circumvented by first estimating the modes given the current parameter estimates. Then, the model parameters are estimated by maximizing Eq. 27 given the current modes b ^ t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{b}}_t$$\end{document} . This two-step procedure is repeated until the algorithm converges. The Laplace approximation is very fast (compared with the other methods), and, in the well-known R package lme4 (Bates et al., Reference Bates, Mächler, Bolker and Walker2015), it is the default method for estimating the parameters of a Generalized Linear Mixed Model (GLMM).

Gauss–Hermite Quadrature The basic idea behind quadrature approaches is that they can be used to approximate the numerical value of the integral. When one assumes that the random effects are normally distributed (as we did), one first generates M vectors of size R x 1 of Gauss–Hermite (GH) nodes x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}$$\end{document} and weights w \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{w}$$\end{document} . For each of the node vectors (e.g., x m \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_m$$\end{document} ), one then computes f ( n t | β , γ , b t = x m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \varvec{b}_t = \varvec{x}_m )$$\end{document} . The integral for person t can then be approximated by a weighted sum

(28) L t ( β , γ , μ , Γ , Σ ) n 1 = 1 M n S = 1 M f ( n t | β , γ , b n 1 n R ) w n 1 w n R . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} L_{t}(\varvec{\beta },\varvec{\gamma },\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) \approx \sum _{n_{1} = 1}^{M} \cdots \sum _{n_{S} = 1}^{M} f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \varvec{b}_{n_{1} \cdots n_{R}} ) \varvec{w}_{n_{1}} \cdots \varvec{w}_{n_{R}} . \end{aligned}$$\end{document}

The main problem with this GH quadrature is that the number of points M increases exponentially with the size of R. For example, when M = 10 and R = 4, there are 10 4 = 1000 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^4 = 1000$$\end{document} single node vectors. This is problematic because f ( n t | β , γ , b t = x m ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \varvec{b}_t = \varvec{x}_m )$$\end{document} needs to be computed for each vector x m \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_m$$\end{document} and each person t. Thus, when M is large, the computational burden associated with approximating the integral is too large to be feasible. For instance, when M = 10 and R = 10, one would need to compute 10 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{10}$$\end{document} vector values for each person t. For this reason, we use Adaptive GH quadrature in our implementation (AGHQ, Rabe-Hesketh et al., Reference Rabe-Hesketh, Skrondal and Pickles2002; Tuerlinckx et al., Reference Tuerlinckx, Rijmen, Verbeke and De Boeck2006). In AGHQ, the GH quadrature points x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}$$\end{document} for each individual are first centered and scaled with the individual’s modes b ^ t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{b}}_t$$\end{document} . These scaled nodes and weights are then used to compute a weighted sum that is similar to the one provided in Eq. 28. By scaling the node vectors with the modes, fewer nodes are required to achieve a precise approximation of the integrals.

Quasi Monte Carlo Integration Even when using AGHQ, the computational burden is too high for high-dimensional random effect distributions. An alternative one could use is Monte Carlo (MC) integration. MC integration rests on the observation that the integral in Eq. 15 can be seen as an expectation of the function f ( n t | β , γ , b t ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\varvec{n}_{t}|\varvec{\beta },\varvec{\gamma }, \varvec{b}_t)$$\end{document} with respect to the random effect distribution f ( b t | μ , Γ , Σ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma })$$\end{document} :

(29) b t f ( n t | β , γ , b t ) f ( b t | μ , Γ , Σ ) d b t = E f ( n t | β , γ , b t ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \int _{\varvec{b}_t} f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \varvec{b}_t) f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) {d}\varvec{b}_t = \text {E} \left[ f(\varvec{n}_{t}|\varvec{\beta }, \varvec{\gamma }, \varvec{b}_t) \right] . \end{aligned}$$\end{document}

One can thus draw M random samples from f ( b t | μ , Γ , Σ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma })$$\end{document} to approximate the integral (Robert and Casella, Reference Robert and Casella2010) with

(30) L t ( β , γ , μ , Γ , Σ ) 1 M m = 1 M f ( n t | β , γ , b tm ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} L_{t}(\varvec{\beta },\varvec{\gamma }, \varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) \approx \frac{1}{M} \sum _{m = 1}^{M} f(\varvec{n}_{t}|\varvec{\beta },\varvec{\gamma }, \varvec{b}_{tm}) . \end{aligned}$$\end{document}

where one inserts the m-th random draw for b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{b}_{t}$$\end{document} . The advantage of MC integration is that the number of draws M does not have to increase when another random effect is added. However, one disadvantage of the technique is that M must be sufficiently large to be precise. Furthermore, because samples are randomly drawn from the random effect distribution, a (Monte Carlo) sampling error is introduced into the estimation. For these two reasons, here, we use Quasi Monte Carlo (QMC) integration, which builds on deterministic sequences of points instead of randomly drawn points in MC integration (hence the name “Quasi”). There are different ways to generate such sequences (see González et al., Reference González, Tuerlinckx, De Boeck and Cools2006, for an introduction). In accordance with the GLMM literature (e.g., Crowther, Reference Crowther2017; González et al., Reference González, Tuerlinckx, De Boeck and Cools2006), we use Halton sequences in our implementation because smaller numbers of draws M are required to achieve a precise approximation. To further decrease the computational burden, we additionally scale the Halton numbers with b t ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{b}_t}$$\end{document} and Ω ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\Omega }$$\end{document} .

3.2. Standard Errors, Goodness-of-Fit Tests, and Random Effects

Once the ML estimates have been determined, the estimates and their corresponding standard errors can be used to compute z statistics and confidence intervals. From standard ML theory, it follows that the parameters are asymptotically normally distributed with a covariance matrix that is obtained by calculating the inverse of the information matrix. This matrix is the negative of the matrix of second derivatives that is given by

(31) l l τ k τ j = t = 1 T 1 f ( n t ) 2 f ( n t ) d t τ j - d t d t T \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial ll}{\partial \tau _k\tau _j} = \sum _{t = 1}^{T} \frac{1}{ f(\varvec{n}_{t})^2 } \left[ f(\varvec{n}_{t}) \frac{\partial d_t}{\partial \tau _j} - d_{t} d^{T}_{t} \right] \end{aligned}$$\end{document}

where

(32) d t = b t L t log ( L t ) τ k d b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} d_{t} = \int _{\varvec{b}_t} L_t \frac{\partial \text {log}(L_t)}{\partial \tau _k} {d}\varvec{b}_t \end{aligned}$$\end{document}

and

(33) d t τ j = b t 2 log ( L t ) τ k τ j + log ( L t ) τ k log ( L t ) τ j L t d b t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \frac{\partial d_t}{\partial \tau _j} = \int _{\varvec{b}_t} \left[ \frac{\partial ^2 \text {log}(L_t)}{\partial \tau _k \tau _j} + \frac{\partial \text {log}(L_t)}{\partial \tau _k} \frac{\partial \text {log}(L_t)}{\partial \tau _j} \right] L_t {d}\varvec{b}_t \end{aligned}$$\end{document}

Again, due to the integrals involved in Eqs. 32 and 33, the matrix of second derivatives is hard to compute. In our implementation of the ML approach users can therefore choose whether standard errors are based on a numerical approximation of the Hessian with finite-difference methods using the analytical gradient (see Eq. 19) or on the exact hessian computed with Eq. 31.

Furthermore, before researchers interpret MPT parameter estimates, they need to show that their model actually fits the observed data. In addition, researchers very often want to compare the fit of a current model with the fit of a more restricted model that imposes psychologically motivated constraints on the parameters (e.g., equality constraints or parameter fixations). Goodness-of-fit tests and model comparisons can both be conducted by means of a likelihood ratio test

(34) L R = - 2 · ( l l r - l l u ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} LR = -2\cdot ( ll_r - ll_u ) \end{aligned}$$\end{document}

where l l r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ll_r$$\end{document} ( l l u \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ll_u$$\end{document} ) is the log-likelihood value of the restricted (unrestricted) model. Under certain regularity conditions (see Reed & Cressie, Reference Reed and Cressie1988), the LR test statistic asymptotically follows a central χ 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document} distribution with degrees of freedom equal to the difference between the parameters of the unrestricted and restricted models if the more restricted model actually holds. The likelihood ratio test can be used to compare the fit of two models that differ in the mean structure (e.g., u = a in the pair-clustering model) or in the covariance structure (e.g., σ ua \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{ua}$$\end{document} = 0).

The likelihood ratio test is based on the assumption that the models being compared are nested. For non-nested model comparisons, we suggest using the Akaike or the Bayesian Information Criterion (AIC or BIC, respectively):

(35) AIC = - 2 · l l + 2 · d f B I C = - 2 · l l + log ( n ) · d f , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} AIC&= -2 \cdot ll + 2 \cdot df \nonumber \\ BIC&= -2 \cdot ll + \text {log}(n)\cdot df, \end{aligned}$$\end{document}

where n = t = 1 T n t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = \sum _{t=1}^{T} n_t$$\end{document} and d f = R + ( S - R ) + p Γ + p γ + dim ( Σ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$df = R + (S - R) + p_{\varvec{\Gamma }} + p_{\varvec{\gamma }} + \text {dim}(\varvec{\Sigma })$$\end{document} . Here, R is the number of estimated cognitive process parameters set to be random (i.e., parameters in μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }$$\end{document} ), S - R \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S - R$$\end{document} is the number of fixed cognitive process parameters (i.e., parameters in β \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document} ), p Γ + p γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{\varvec{\Gamma }} + p_{\varvec{\gamma }}$$\end{document} gives the total number of to-be-estimated weights of the person-level covariates in the random and the fixed effects part of the model (i.e., parameters in Γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Gamma }$$\end{document} and γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\gamma }$$\end{document} ), respectively, and dim ( Σ ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {dim}(\varvec{\Sigma })$$\end{document} represents the number of estimated covariance parameters (see Bates et al., Reference Bates, Mächler, Bolker and Walker2015).

Finally, it can also be of interest to obtain estimates of the individual random effects for each participant t. As a random effects estimator, b ^ t \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{b}}_t$$\end{document} , it is possible to use a participant’s mode, which can be obtained by maximizing Eq. 26, while treating the final model parameter estimates as fixed. Fortunately, these values are already required when estimating the model so that they do not have to be estimated again. Another choice would be the empirical Bayes estimator (Bock and Aitken, Reference Bock and Aitken1981):

(36) b ^ t = t = 1 T 1 f ( n t ) b t b t log ( f ( n t | β , γ , b t ) ) f ( b t | μ , Γ , Σ ) d b t . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \hat{b}_{t} = \prod _{t = 1}^{T} \frac{1}{ f(\varvec{n}_{t}) } \int _{\varvec{b}_t} \varvec{b}_t \text {log}(f(\varvec{n}_{t}|\varvec{\beta },\varvec{\gamma }, \varvec{b}_t)) f(\varvec{b}_t|\varvec{\mu },\varvec{\Gamma },\varvec{\Sigma }) {d}\varvec{b}_t . \end{aligned}$$\end{document}

In this approach, we approximate the integrals using one of the integral approximation methods described above.

4. Simulation Study

We performed a simulation study to assess the frequentist properties of the suggested ML estimation approaches and also compare it with the performance of a Bayesian approach. Specifically, we examined the effect of the number of participants and the number of responses per participant on the bias of the parameter estimates and the coverage rate of the corresponding confidence or credibility intervals, respectively. In addition, we examined for the ML estimator how the number of quadrature points in the AGHQ approach and the size of the Halton sequence in the QMC method affect the adequacy of the model parameter estimates.

Population Model and Simulation Conditions. We used the pair-clustering model for this simulation. This MPT model comprises two category systems, including four and two categories, respectively, with response probabilities described by four process parameters c, r, u, and a. In our simulation study, the population covariance matrix of these four parameters (describing how they vary across participants) was set to

(37) Σ = 0.50 0.08 0.04 0.00 0.08 0.35 0.03 0.00 0.04 0.03 0.20 0.07 0.00 0.00 0.07 0.20 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{\Sigma } = \begin{pmatrix} 0.50 &{}\quad 0.08 &{}\quad 0.04 &{}\quad 0.00 \\ 0.08 &{}\quad 0.35 &{}\quad 0.03 &{}\quad 0.00 \\ 0.04 &{}\quad 0.03 &{}\quad 0.20 &{}\quad 0.07 \\ 0.00 &{}\quad 0.00 &{}\quad 0.07 &{}\quad 0.20 \\ \end{pmatrix} \end{aligned}$$\end{document}

where the non-zero covariance terms reflect correlations of 0.30, 0.20, and 0.10. The mean values of the parameters were set to μ c = 0.50 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{c} = 0.50$$\end{document} , μ r = 0.40 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{r} = 0.40$$\end{document} , μ u = 0.25 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{u} = 0.25$$\end{document} , and μ a = 0.15 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{a} = 0.15$$\end{document} (see Batchelder & Riefer, Reference Batchelder and Riefer1986).

We manipulated the number of simulated participants (75 vs. 125) and the number of simulated responses per participant (25 vs. 75 vs. 125). In accordance with Batchelder and Riefer (Reference Batchelder and Riefer1986), about 80% of the responses were assigned to the first category system (i.e., 20 vs. 60 vs. 100), leaving 20% for the second system (i.e., 5 vs. 15 vs. 25). The R package mvtnorm and the function rmultinorm were used to generate the samples. We drew 500 samples from the population for each of the six simulation conditions.

Estimators All the functions that were required to estimate the parameters with ML were implemented in R (R Core Team, 2020; a working version of the package can be downloaded from https://osf.io/w97m5/). To examine the properties of the ML estimation procedures, the number of quadrature points in the AGHQ method was set to 3, 4, or 5, resulting in 4 3 = 64 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4^3 = 64$$\end{document} , 4 4 = 256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4^4 = 256$$\end{document} or 4 5 = 1 , 024 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4^5 = 1,024$$\end{document} node vectors per participant. For the QMC method, the size of the Halton sequence was set to 500, 1000, or 2000. To obtain Bayesian estimates, we employed the TreeBUGS package (Heck et al., Reference Heck, Arnold and Arnold2018a). TreeBUGS uses the JAGS-MCMC sampler (Plummer, Reference Plummer2003) to approximate the posterior distribution of the model’s parameters. For each replication, we fitted the model with the traitMPT function using the default settings. That is, weakly informative priors were specified for all parameters (i.e., normal distributions for the means and a scaled Wishart distributions for the covariance matrix). Furthermore, three chains of 20,000 samples were generated from the posterior distributions, whereby the first 2000 samples were discarded for parameter estimation (i.e., burn-in period). Since TreeBUGS by default provides the means of the posterior distributions as parameter estimates, we decided to used them as the Bayes estimates.

Dependent Measures We used the relative bias (RB) of the parameter estimates and the coverage rate (CR) to investigate the statistical performances of the ML approaches and the Bayes estimator. For the relative bias, we first computed the average parameter estimate in a simulation condition. We then computed the difference between this average and the true parameter and thereafter divided the difference by the true parameter. We consider relative biases below 10% to be acceptable, biases of 10–20% to be substantial, and biases above 20% to be unacceptable (e.g., Forero et al., Reference Forero, Maydeu-Olivares and Gallardo-Pujol2009; Morris et al., Reference Morris, White and Crowther2019). For ML, the confidence interval with the standard errorFootnote 1 of an estimate was computed in each replication to determine the observed coverage of the 95% confidence intervals. The coverage was then coded 1 if the true parameter value was included in the interval and 0 if the true parameter was not. We used the same approach to determine the observed coverage of the 95% credibility intervals, but employed the 95% credibility intervals as provided by TreeBUGS as the basis for the coding.

Table 1 Relative frequencies of converged replications (CR) and average computation time (in s), depending on the estimator, the type of ML approximation method, the number of individuals T, and the number of responses N per individual.

AGHQ = Adaptive Gauss–Hermit quadrature with 3, 4, or 5 nodes; QMC = Quasi Monte Carlo integration with 500, 1000, or 2000 points. In case of Bayes, convergence rates were calculated using R ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{R}$$\end{document} values. Computation times were determined on an Intel Core i7-6700 with four cores and 16GB RAM.

Results We first examined the percentage of samples in which the estimation algorithm converged for the two estimators, the different approximation methods, the different numbers of simulated participants, and the different numbers of simulated responses per participant. In case of ML, a sample was counted as converged when there were neither inadmissible estimates nor undefined standard errors in the final solution. For Bayes, judging the convergence is a more difficult issue (Hoff, Reference Hoff2009). Here, we decided to use R ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{R}$$\end{document} > 1.05 as the criterion, because it can be used well in simulation studies and it is suggested in the literature (e.g., Lynch, Reference Lynch2007). However, we acknowledge that a more or less stringent cutoff may lead to different results and that this should be taken into account in the interpretation of our findings.

As can be seen in Table 1, convergence rates for both estimators increased with the number of participants and the number of responses. In case of ML, the larger the number of points used by a method to approximate the integrals, the better the convergence of the ML estimator. When the number of responses was 75 or 125, convergence rates were acceptable for the AGHQ and QMC methods. In conditions with 25 responses, convergence rates were highest for conditions with the largest size of points. In this case, convergence rates were also similar to the convergence rate of the Bayes estimator. With regard to the Laplace approximation, we found that convergence rates were always below 70% (and even near 0% when number of responses was 25). Further analyses showed that all convergence failures occurred because some (or all) of the standard errors were undefined for the final estimates. These estimates were also very biased. We think that this bias can be explained by noting that the precision of the Laplace approximation depends on whether the shape of the function that is being integrated resembles a multivariate normal distribution. Hence, the method does not perform well for highly non-normal cases (e.g., when the responses are Bernoulli distributed; see Engel, Reference Engel1998), and we suspect that similar performance problems occur for the MPT model. Finally, Table 1 also contains the average run times of the different methods. We note that comparing the computation times across ML and Bayes is difficult, because they depend on how the approaches are implemented in R (i.e., using C++ in the background or parallelization), how many chains are generated etc. In case of Bayes, all methods were slower for larger numbers of participants. Similarly, ML methods became slower the larger numbers of points used to approximate the integrals and the larger the number of participants.

Table 2 Relative bias of parameter estimates in percent, depending on the approximation method, the number of individuals T, and the number of responses N per individual.

AGHQ = Adaptive Gauss–Hermit quadrature with 3, 4, or 5 nodes; QMC = Quasi Monte Carlo integration with 500, 1000, or 2000 points.

In the following, we drop the Laplace approximation from further consideration when we discuss the precision of the ML estimates (i.e., RB) and the confidence intervals (i.e., CR). Furthermore, to facilitate the interpretation of the results, we decided to average the results for the indices per parameter group (i.e., for the cognitive process parameter mean values c, r, u, and a, called μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} ; the variance parameters in Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} , termed σ b 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{b}^{2}$$\end{document} ; and the covariance parameters in Σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} , termed σ bb \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{bb}$$\end{document} ). The resulting values for the relative bias are displayed in Table 2. When we consider relative biases below 10% to be acceptable and biases of 10–20% to be substantial (Forero et al., Reference Forero, Maydeu-Olivares and Gallardo-Pujol2009; Morris et al., Reference Morris, White and Crowther2019), we found for both ML methods that relative biases decreased as the numbers of node vectors, sample sizes, and numbers of responses per participant increased. Importantly, relative biases were generally low and acceptable in all simulation conditions for the cognitive process parameters and the variance parameters. For the latter, however, biases were somewhat larger but still acceptable in case of AGHQ when the number of responses was 25 and the number of persons was 75. For the covariance parameters, the relative biases were larger and more substantial the smaller the number of points used and the smaller the number of responses. For the Bayes estimator, the relative bias was negligible for the cognitive process parameters and the variance parameters in all conditions. However, replicating the results of Klauer (Reference Klauer2010), the covariance parameters were unacceptably and substantially biased when the number of responses was 25 or 75. When the number of responses was 125, the relative biases were still large but acceptable.

Table 3 Coverage rate of parameter estimates, depending on the approximation method, the number of individuals T, and the number of responses N per individual.

AGHQ = Adaptive Gauss–Hermit quadrature with 3, 4, or 5 nodes, QMC = Quasi Monte Carlo integration with 500, 1000, or 2000 points.

With regard to the CR (see Table 3), we found that for all three types of parameters, the CR moved closer to the nominal value as the sample size and the number of responses increased. For AGHQ, the CR was near its nominal value when the number of points was 4 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4^4$$\end{document} and the number of responses at least 75. When the number of responses is 25, the CR was near the nominal value when 4 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4^5$$\end{document} points were used. A similar pattern of results was observed for the QMC approximation, although the amount of undercoverage was greater than for the AGHQ approximation. Furthermore, when the number of responses was 25, undercoverage occurred irrespective of how many points were investigated in our simulation. In case of Bayes, we found that the coverage was nominal for the cognitive process parameters and the variance parameters. However, overcoverage occurred for the covariance parameters although this tendency disappeared the larger the number of responses.

To summarize, the simulation study reveals that the AGHQ method works better than the QMC method when the number of nodes is at least four. Relative biases were low even for small sample sizes and few responses per participant and confidence interval estimates were close to nominal. When the number of responses was 75, the QMC also provided acceptable results, at least when the number of points was 1000. Finally, when the number of responses was small (i.e., R = 25), AGHQ with five nodes yield estimates that are at least as good as the estimates of a Bayesian approach.

5. Illustrative Example

To illustrate the proposed ML approach, we analyzed neuropsychological data from Schilken (Reference Schilken1998), who employed the pair-clustering model to analyze and compare the memory performance of 22 epileptic patients with a right-temporal focus of epileptic seizures, 21 epileptic patients with a left-temporal focus, and 20 healthy controls matched with respect to age and intelligence. Each participant learned word lists consisting of 10 semantically strongly related word pairs (e.g., armchair-sofa) and 5 singletons that were not related to other words in the list (e.g., sunflower). All words were comparable in terms of difficulty and memorizability when considered in isolation. In addition, three words were added to the beginning and another three words to the end of the word list to absorb primacy and recency effects in free recall. Because these primacy and recency buffer words were excluded from further analysis, N = 15 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 15$$\end{document} responses per participant and study-test cycle remained for analysis-10 responses for the pairs and 5 for the singletons. The studying of the word list and the subsequent free recall test were repeated 6 times to examine learning effects on the c, r, and u parameters of the pair-clustering model, resulting in a total of six study-test trials per participant.

The same data set was also analyzed by Klauer (Reference Klauer2006, Reference Klauer2010), thus providing us with the opportunity to compare our results with Klauer’s, which were based on the Bayesian Latent-Trait model (cf. Klauer, 2010, pp. 86/87). To maximize heterogeneity between participants, we followed the procedure outlined by Klauer (Reference Klauer2006, Reference Klauer2010) and analyzed all T = 22 + 21 + 20 = 63 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T = 22 + 21 + 20 = 63$$\end{document} participants conjointly in our first model. Also following Klauer’s guidelines, we restricted our attention to the first two study-test cycles for each participant. This left us with T = 63 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T = 63$$\end{document} participants, K = 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K = 4$$\end{document} category systems (i.e., those for word pairs and singletons in Trials 1 and 2) with J 1 = J 3 = 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J_1 = J_3 = 4$$\end{document} and J 2 = J 4 = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J_2 = J_4 = 2$$\end{document} categories (for word pairs and singletons, respectively), and N 1 = N 3 = 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_1 = N_3 = 10$$\end{document} as well as N 2 = N 4 = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_2 = N_4 = 5$$\end{document} responses per participant within the four category systems. Finally, again in line with Klauer (Reference Klauer2010)’s suggestions, we imposed the restriction that unclustered words in a pair must match the singletons in terms of trial-specific storage and retrieval probabilities (i.e., u = a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u = a$$\end{document} ). Hence, there were three parameters to estimate per trial, c ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{(1)}$$\end{document} , r ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{(1)}$$\end{document} , u ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u^{(1)}$$\end{document} for Trial 1 and c ( 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{(2)}$$\end{document} , r ( 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{(2)}$$\end{document} , u ( 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u^{(2)}$$\end{document} for Trial 2.

Table 4 Parameter estimates of the mean structure for the illustrative data example.

Parameters were obtained with AGHQ with 4 nodes. Est = Parameter estimates; CI = confidence interval; TE = probability-transformed estimates; Bayes = Bayesian Estimates as reported in Table 2 in Klauer (Reference Klauer2010). D1 = First dummy variable, that is, 1 for right-temporal epileptic patients (all other participants are coded zero). D2 = Second dummy variable, that is, 1 for left-temporal epileptic patients (all other participants are coded zero). Δ D . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta _{D.}$$\end{document} is the estimate of the parameter for the group coded in the respective dummy variable (i.e., δ . , D . + μ . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta _{.,D_{.}} + \mu _{.}$$\end{document} ). For model 2, probability transformed estimates (column TE) indicate parameter means in the three groups (control group, right-temporal epileptics, left-temporal epileptics, respectively).

To test the goodness of fit, we estimated the model that we just specified and a more general model with four parameters c ( d ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{(d)}$$\end{document} , r ( d ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{(d)}$$\end{document} , u ( d ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u^{(d)}$$\end{document} , and a ( d ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a^{(d)}$$\end{document} per trial. For both models, we used the AGHQ approximation method with 4 nodes to fit the two models. The LR test statistic comparing the original and the more general model was LR = 9.02, which is not significantly different from zero, χ crit 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^{{2}}_{{crit}}$$\end{document} = 30.1, p =.98, df = 19. Table 4 shows the parameters of the mean structure (i.e., μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} ) in the restricted model. As can be seen, parameter values increased from Trial 1 to Trial 2, and the probability-transformed parameters were very similar to the probability-transformed parameters reported in Klauer (Reference Klauer2010). Variance and correlation parameters were also quite similar across the two approaches (see Table 5). A notable exception was the variance of r ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{(1)}$$\end{document} , where the ML estimate was considerably larger than the Bayesian estimates. Also, the correlations involving this parameter were smaller for ML compared with Bayes.

Table 5 Parameter estimates of the covariance structure for the illustrative example with real data.

The first two columns present variance estimates based on the AGHQ method with 4 nodes (Est) and the corresponding Bayesian estimates (Bayes) as reported by Klauer (Reference Klauer2010, Table 2), respectively. The correlation matrix displays the Bayesian estimates of Klauer (Reference Klauer2010) above the diagonal and the corresponding AGHQ estimates using 4 nodes below the diagonal.

We decided to go one step further than Klauer (Reference Klauer2010) and also analyze effects of the clinical group on the parameter estimates. For this purpose, we added two dummy variables as covariates to our model, the first one coded “1” for the right-temporal epileptic patients (patients in the two other groups were coded 0) and the second dummy variable coded “1” for the left-temporal epileptic patients (again, all other patients were coded “0”). This model provided us with the opportunity to estimate the average (negative) effect of each clinical group relative to the control group on each of the model parameters, that is, c ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{(1)}$$\end{document} , r ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{(1)}$$\end{document} , u ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u^{(1)}$$\end{document} for Trial 1 and c ( 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{(2)}$$\end{document} , r ( 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{(2)}$$\end{document} , u ( 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u^{(2)}$$\end{document} for Trial 2. The results are also shown in Table 4. The results indicate that estimates for the c ( . ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^{(.)}$$\end{document} parameters were lower for epileptic patients compared with control patients whereas this pattern is less clear for the remaining parameters.

6. Discussion

The aim of this article was to describe how the parameters in the latent-trait MPT model can be estimated with a marginal ML approach. Specifically, we introduced three methods to approximate the integrals that are involved when the goal is to maximize the marginal log-likelihood function, and we investigated the statistical properties of these methods in a simulation study. Finally, we presented an empirical example that illustrated the suggested approaches.

The results of the simulation study showed that AGHQ and QMC performed well with regard to the relative bias and the coverage rate. However, we also found that AGHQ performed somewhat better than QMC in most simulation conditions. We would therefore recommend that researchers use AGHQ as the default method for parameter estimation but switch to QMC when the number of random effects becomes too large to approximate the integral with AGHQ in a reasonable amount of time. Strictly speaking, however, our results refer to the range of simulation conditions examined here only. Hence, further simulation research is needed to examine the performance of our ML approaches for other MPT models or, for example, models with low variance components. In these simulation studies, one could then also examine some additional approximation methods we ignored in the current study, such as a Monte Carlo EM algorithm (Booth and Hobert, Reference Booth and Hobert1999), variational approximation (Ormerod and Wand, Reference Ormerod and Wand2010), or Laplace importance sampling (Kuk, Reference Kuk1999). The latter method is a modification to the Laplace approximation which we dropped from further consideration because of unsatisfactory performance in our simulation study. Laplace importance sampling is an interesting alternative that is definitely worth to investigate.

So far, the parameters of the latent-trait MPT model can be estimated with a Bayesian approach only (Heck et al., Reference Heck, Arnold and Arnold2018a; Klauer, Reference Klauer2010). Our simulation study suggests that the maximum likelihood approach introduced here-specifically, the AGHQ approximation method-provides estimates that are at least as good as Bayesian estimates (provided the number of nodes is sufficiently high). For covariance parameters, in particular, both relative estimation bias and coverage rates of confidence intervals appear to be clearly superior for AGHQ-based ML estimates compared to their Bayesian counterparts. These results are important for empirical applications, especially those focusing on parameter correlations. In addition, the maximum likelihood approach has some pragmatic advantages compared to the Bayesian approach, for example, because prior distributions are not required, the convergence of the estimation algorithm is easier to determine, and the asymptotic optimality properties of ML-estimated parameters have been well-known in the statistical literature for decades. This does not mean that we are rejecting a Bayesian approach; rather, we believe that the two approaches complement each other and that there likely are situations in which one approach is preferable to the other (Wasserman, Reference Wasserman2004).

We believe that further simulation research is needed to determine the best-performing method for a variety of situations. For example, on the one hand, estimation of covariances between MPT parameters may turn out to be a specific strength of ML methods. On the other hand, the Bayesian approach may outperform ML methods when the number of participants and/or the number of responses is small. In fact, the results of our simulation study suggest these tentative interpretations. However, our results require replication and definitely need to be extended to other MPT models before they can provide a basis for general recommendations. Furthermore, another interesting question for future research is whether there are circumstances under which the two methods produce discrepant results concerning model comparisons. Finally, we also think that it is interesting to investigate whether a combination of the two estimators (e.g., using the ML estimates as starting values for Bayesian MCMC estimation) has better asymptotic properties compared to each single approach alone.

Additionally, there are a number of further research questions that we think would be worthwhile to study. First, it would be interesting to extend the ML estimator proposed here to handle both random participant and random item effects (Matzke et al., Reference Matzke, Dolan, Batchelder and Wagenmakers2015). Implementing such a crossed random effects MPT model would be a challenging task for future research. Another challenging issue concerns recent extensions of MPT models to include continuous variables such as response times (Heck et al., Reference Heck, Erdfelder and Kieslich2018b; Klauer and Kellen, Reference Klauer and Kellen2018). These generalized MPT models could also be embedded in a hierarchical random effects framework and analyzed using the marginal ML methods proposed here.

A major problem involves convergence problems of marginal ML methods, for example, when the number of responses per participant is very small or when the true variance components are small. It would be interesting to investigate whether a penalized maximum likelihood estimator (Chung et al., Reference Chung, Rabe-Hesketh, Dorie, Gelman and Liu2013) can solve the convergence issues of the ML estimator. Finally, both the Bayesian approach and the ML approach proposed here assume that the random effects are multivariate normally distributed. From a statistical point of view, however, this assumption does not need to be true, and it would be interesting to examine how robust the two approaches are with respect to such a misspecification when the true underlying distribution is actually, for example, a finite mixture distribution (a reasonable assumption for the illustrative example). We note that one can specify arbitrary distributions for the random effects in ML with QMC sampling and that the implementation of these “robust” ML estimators would also be interesting for future research.

In summary, the present article shows how marginal maximum likelihood estimation can be used to obtain the parameters of a random effects MPT model with or without covariates. Using the pair-clustering model as a running example, we found for both simulated and real data that the ML approach is a reasonable alternative to Bayesian hierarchical MPT analyses that are based on the Latent-Trait Model (Heck et al., Reference Heck, Arnold and Arnold2018a; Klauer, Reference Klauer2010). Future research should extend these results to other, more complex MPT models and perhaps also explore alternative numerical methods of marginal ML estimation.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Footnotes

1 We used the numerically approximated Hessian matrix to compute the standard errors of the parameter estimates in a replication. Additional analyses with 100 replications from the simulation condition with 75 participants and 25 responses showed that differences between standard errors based on the numerically approximated Hessian and the analytically computed Hessian were very small. Therefore, we conclude that all unsatisfactory results regarding the coverage rate are not due to the numerical approximation of the Hessian matrix.

The research reported in the present paper was supported in part by a grant from the Deutsche Forschungsgemeinschaft to the Research Training Group Statistical Modeling in Psychology (SMiP, GRK 2277).

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Batchelder, W. H., &Riefer, D. M.(1980). Separation of storage and retrieval factors in free recall of clusterable pairs. Psychological Review, 87, 375397CrossRefGoogle Scholar
Batchelder, W. H.,& Riefer, D. M.(1986). The statistical analysis of a model for storage and retrieval processes in human memory. British Journal of Mathematical and Statistical Psychology, 39, 129149CrossRefGoogle Scholar
Batchelder, W. H., &Riefer, D. M.(1999). Theoretical and empirical review of multinomial processing tree modeling. Psychonomic Bulletin & Review, 6, 5786CrossRefGoogle Scholar
Bates, D., Mächler, M., Bolker, B., &Walker, S.(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software,CrossRefGoogle Scholar
Bock, R. D., &Aitken, M.(1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika,CrossRefGoogle Scholar
Booth, J. G., &Hobert, J. P.(1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61, 265285CrossRefGoogle Scholar
Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A., &Liu, J.(2013). A nondegenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika, 78, 685709CrossRefGoogle ScholarPubMed
Coolin, A., Erdfelder, E., Bernstein, D. M., Thornton, A. E., &Thornton, W. L.(2015). Explaining individual differences in cognitive processes underlying hindsight bias. Psychonomic Bulletin & Review, 22, 328348CrossRefGoogle ScholarPubMed
Coolin, A., Erdfelder, E., Bernstein, D. M., Thornton, A. E., &Thornton, W. L.(2016). Inhibitory control underlies individual differences in older adults’ hindsight bias. Psychology and Aging, 31, 224238CrossRefGoogle ScholarPubMed
Crowther, M. J. (2017). Extended multivariate generalised linear and non-linear mixed effects models. arXiv:1710.02223Google Scholar
Engel, B.(1998). A simple illustration of the failure of PQL, IRREML and APHL as approximate ML methods for mixed models for binary data. Biometrical Journal, 40, 1411543.0.CO;2-A>CrossRefGoogle Scholar
Erdfelder, E., Auer, T., Hilbig, B., Assfalg, A., Moshagen, M., &Nadarevic, L.(2009). Multinomial processing tree models: A review of the literature. Zeitschrift für Psychologie-Journal of Psychology, 217, 108124CrossRefGoogle Scholar
Erdfelder, E., &Buchner, A.(1998). A multinomial processing tree model for separating recollection and reconstruction in hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 387414Google Scholar
Erdfelder, E., Hu, X., Rouder, J., &Wagenmakers, E.(2020). Cognitive psychometrics: Recent contributions in honor of William H. Batchelder (1940–2018). Journal of Mathematical Psychology, 99, 17CrossRefGoogle Scholar
Forero, C. G., Maydeu-Olivares, A., &Gallardo-Pujol, D.(2009). Factor analysis with ordinal indicators: A Monte Carlo study comparing DWLS and ULS estimation. Structural Equation Modeling: A Multidisciplinary Journal, 16, 625641CrossRefGoogle Scholar
González, J., Tuerlinckx, F., De Boeck, P., &Cools, R.(2006). Numerical integration in logistic-normal models. Computational Statistics & Data Analysis, 51, 15351548CrossRefGoogle Scholar
Heck, D., Arnold, N., & Arnold, D. (2018). TreeBUGS: An R package for hierarchical multinomial-processing-tree modeling. Behavior Research Methods, 50, 264–284. https://doi.org/10.3758/s13428-017-0869-7CrossRefGoogle Scholar
Heck, D., Erdfelder, E., & Kieslich, P. (2018). Generalized processing tree models: Jointly modeling discrete and continuous variables. Psychometrika, 83, 893–918. https://doi.org/10.1007/s11336-018-9622-0CrossRefGoogle Scholar
Hoff, P. D.(2009). A first course in Bayesian statistical methods, New York: Springer.CrossRefGoogle Scholar
Hu, X., &Batchelder, W. H.(1994). The statistical analysis of general processing tree models with the EM algorithm. Psychometrika, 59, 2147CrossRefGoogle Scholar
Hütter, M., &Klauer, K. C.(2016). Applying processing trees in social psychology. European Review of Social Psychology, 27, 116159CrossRefGoogle Scholar
Kellen, D., &Klauer, K. C.(2020). Selecting amongst multinomial models: An apologia for normalized maximum likelihood. Journal of Mathematical Psychology, 97CrossRefGoogle Scholar
Klauer, K. C.(2006). Hierarchical multinomial processing tree models: A latent-class approach. Psychometrika, 71, 131CrossRefGoogle Scholar
Klauer, K. C.(2010). Hierarchical multinomial processing tree models: A latent-trait approach. Psychometrika, 75, 7098CrossRefGoogle Scholar
Klauer, K. C., &Kellen, D.(2018). RT-MPTs: Process models for response-time distributions based on multinomial processing trees with applications to recognition memory. Journal of Mathematical Psychology, 82, 111130CrossRefGoogle Scholar
Kuk, AYC(1999). Laplace importance sampling for generalized linear mixed models. Journal of Statistical Computation and Simulation, 63, 143158CrossRefGoogle Scholar
Lee, M. D., &Wagenmakers, E. -J.(2014). Bayesian cognitive modeling: A practical course, Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Lee, M. D., &Webb, M. R.(2005). Modeling individual differences in cognition. Psychonomic Bulletin & Review, 12, 605621CrossRefGoogle ScholarPubMed
Lynch, S. M.(2007). Introduction to applied Bayesian statistics and estimation for social scientists, New York: Springer.CrossRefGoogle Scholar
Matzke, D., Dolan, C., Batchelder, W., &Wagenmakers, E.(2015). Bayesian estimation of multinomial processing tree models with heterogeneity in participants and items. Psychometrika, 80, 205235CrossRefGoogle ScholarPubMed
Meiser, T., &Broder, A.(2002). Memory for multidimensional source information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(116–137),1Google ScholarPubMed
Morris, T. P., White, I. R.,& Crowther, M. J.(2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11),20742102CrossRefGoogle ScholarPubMed
Nestler, S.(2020). Modeling interindividual differences in latent within-person variation: The confirmatory factor level variability model. British Journal of Mathematical and Statistical Psychology, 73, 452473CrossRefGoogle ScholarPubMed
Nestler, S.(2021). Modeling intraindividual variability in growth with measurement burst designs. Structural Equation Modeling: A Multidisciplinary Journal, 28, 2839CrossRefGoogle Scholar
Nestler, S., &Egloff, B.(2009). Increased or reveresed: The effect of surprise on the hindsight bias depends on the hindsight component. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 15391544Google ScholarPubMed
Nestler, S., Egloff, B., Küfner, A., & Back, M. D.(2012). An integrative lens model approach to bias and accuracy in human inferences: The case of hindsight effects and knowledge updating in personality judgments. Journal of Personality and Social Psychology, 103, 698717CrossRefGoogle Scholar
Ormerod, J. T., &Wand, M. P.(2010). Explaining variational approximations. The American Statistician, 64, 140153CrossRefGoogle Scholar
Pinheiro, J. C., &Bates, D. M.(1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational Graphics and Statistics, 4, 1235CrossRefGoogle Scholar
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, 20–22 March 2003, p. 1–10.Google Scholar
R Core Team. (2020). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. http://www.R-project.org/Google Scholar
Rabe-Hesketh, S., Skrondal, A., &Pickles, A.(2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal, 2, 121CrossRefGoogle Scholar
Reed, TRC, &Cressie, NAC(1988). Goodness of fit statistics for discrete multivariate data, New York: SpringerCrossRefGoogle Scholar
Riefer, D., &Batchelder, W.(1995). A multinomial modeling analysis of the recognition-failure paradigm. Memory & Cognition, 23, 611630CrossRefGoogle ScholarPubMed
Riefer, D., &Batchelder, W. H.(1988). Multinomial modeling and the measurement of cognitive processes. Psychological Review, 95, 318339CrossRefGoogle Scholar
Robert, C., &Casella, G.(2010). Introducing Monte Carlo methods with R, New York: SpringerCrossRefGoogle Scholar
Sarafoglou, A., Kuhlmann, B. G., Aust, F., & Haaf, J. M. (2022). Theory-informed refinement of Bayesian hierarchical MPT modeling. https://doi.org/10.31234/osf.io/kvyt5CrossRefGoogle Scholar
Schilken, E. (1998). Speicherung und Abruf von verbalem Material bei Patienten mit Temporallappenepilepsie. Universität Bonn: Unpublished Diploma Thesis.Google Scholar
Schmidt, O., Erdfelder, E.,& Heck, D. W.(2023). Tutorial on multinomial processing tree modeling: How to develop, test, and extend MPT models. Psychological Methods,CrossRefGoogle Scholar
Singmann, H.,& Kellen, D.(2013). MPTinR: Analysis of multinomial processing tree models in R. Behavior Research Methods, 45(2),560575CrossRefGoogle ScholarPubMed
Singmann, H., Kellen, D., Gronau, Q., Mueller, C., & Bhel, A. S. (2020). MPTinR: Analyze multinomial processing tree models [Computer software manual]. https://CRAN.R-project.org/package=MPTinR (R package version 1.13-0).Google Scholar
Smith, J. B., &Batchelder, W. H.(2010). Beta-MPT: Multinomial processing tree models for addressing individual differences. Journal of Mathematical Psychology, 54, 167183CrossRefGoogle Scholar
Tuerlinckx, F., Rijmen, F., Verbeke, G., &De Boeck, P.(2006). Statistical inference in generalized linear mixed models: A review. British Journal of Mathematical and Statistical Psychology, 59, 225255CrossRefGoogle ScholarPubMed
Wasserman, L.(2004). All of statistics: A concise course in statistical inference, New York: SpringerCrossRefGoogle Scholar
Wickelmaier, F., & Zeileis, A. (2011). Multinomial processing tree models in R. https://www.R-project.org/conferences/useR-2011/TalkSlides/Contributed/17Aug_1705_FocusV_3-Psychometrics_1-Wickelmaier.pdf (Presented at the R User Conference 2011, August 16–18, Coventry, UK).Google Scholar
Wickelmaier, F., & Zeileis, A. (2020). MPT: Multinomial processing tree models [Computer software manual]. https://CRAN.R-project.org/package=mpt (R package version 0.6-2).Google Scholar
Xu, M., &Bellezza, F.(2001). A comparison of the multimemory and detection theories of know and remember recognition judgments. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27, 11971210Google ScholarPubMed
Figure 0

Figure 1. The pair-clustering MPT model (adapted from Riefer & Batchelder, 1988, p. 330, Figure 2). Rectangles indicate stimulus classes (left) and observable responses (right). Rectangles with rounded corners represent latent cognitive states. Parameters attached to the branches indicate transition probabilities from left to right, specifically, storing a word pair as a cluster (c), retrieving a stored cluster in free recall (r), storing and retrieving a word from a non-clustered pair in free recall (u), and storing and retrieving a singleton in free recall (a).

Figure 1

Table 1 Relative frequencies of converged replications (CR) and average computation time (in s), depending on the estimator, the type of ML approximation method, the number of individuals T, and the number of responses N per individual.

Figure 2

Table 2 Relative bias of parameter estimates in percent, depending on the approximation method, the number of individuals T, and the number of responses N per individual.

Figure 3

Table 3 Coverage rate of parameter estimates, depending on the approximation method, the number of individuals T, and the number of responses N per individual.

Figure 4

Table 4 Parameter estimates of the mean structure for the illustrative data example.

Figure 5

Table 5 Parameter estimates of the covariance structure for the illustrative example with real data.