Deep Generative Modeling for Cognitive Diagnosis via Exploratory DeepCDMs

Jia Liu; Yuqi Gu

doi:10.1017/psy.2025.10065

Deep Generative Modeling for Cognitive Diagnosis via Exploratory DeepCDMs

Published online by Cambridge University Press: 17 December 2025

Jia Liu and

Yuqi Gu

Show author details

Jia Liu: Affiliation:
Department of Statistics, Columbia University , USA
Yuqi Gu*: Affiliation:
Department of Statistics, Columbia University , USA
*: Email: yuqi.gu@columbia.edu

Article contents

Abstract
Introduction
Exploratory DeepCDMs framework
Proposed estimation algorithms
Simulation studies
Real data analysis
Discussion
Funding statement
Competing interests
References

Rights & Permissions

Abstract

Deep generative modeling is a powerful framework in modern machine learning, renowned for its ability to use latent representations to predict and generate complex high-dimensional data. Its advantages have also been recognized in psychometrics. In this article, we substantially extend the deep cognitive diagnostic models (DeepCDMs) in Gu (Psychometrika, 89:118–150, 2024) to challenging exploratory scenarios with deeper structures and all $\mathbf {Q}$-matrices unknown. The exploratory DeepCDMs can be viewed as an adaptation of deep generative models (DGMs) toward diagnostic purposes. Compared to classic DGMs, exploratory DeepCDMs enjoy critical advantages, including identifiability, interpretability, parsimony, and sparsity, which are all necessary for diagnostic modeling. We propose a novel layer-wise expectation–maximization (EM) algorithm for parameter estimation, incorporating layer-wise nonlinear spectral initialization and $L_1$ penalty terms to promote sparsity. From a parameter estimation standpoint, this algorithm reduces sensitivity to initial values and mitigates estimation bias that commonly affects classical approaches for deep latent variable models. Meanwhile, from an algorithmic perspective, our method presents an original layer-wise EM framework, inspired by modular training in DGMs but uniquely designed for the structural and interpretability demands of diagnostic modeling. Extensive simulation studies and real data applications illustrate the effectiveness and efficiency of the proposed method.

Keywords

deep cognitive diagnostic models (DeepCDMs)deep generative modeling exploratory cognitive diagnosis identifiability layer-wise EM algorithm

Information

Type: Theory and Methods
Information: Psychometrika , First View , pp. 1 - 26

DOI: https://doi.org/10.1017/psy.2025.10065 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society

1 Introduction

Over the past two decades, cognitive diagnosis models (CDMs) have become increasingly prominent in educational and psychological measurement (e.g., Chen et al., Reference Chen, Liu, Xu and Ying2015; de la Torre, Reference de la Torre2011; Henson et al., Reference Henson, Templin and Willse2009; Junker & Sijtsma, Reference Junker and Sijtsma2001; Rupp et al., Reference Rupp, Templin and Henson2010; von Davier, Reference von Davier2008; von Davier & Lee, Reference von Davier and Lee2019). CDMs are a class of psychometric models that use item response data to infer examinees’ mastery status on multiple discrete latent attributes, such as skills, subskills, or diagnostic criteria. In most applications, each attribute is assumed to be binary, representing the presence or absence of a specific cognitive ability or psychological trait. By estimating an individual’s profile across these attributes, CDMs facilitate detailed diagnostic reporting. This information enables practitioners and educators to identify students’ strengths and weaknesses at a granular level, supporting the design of targeted interventions and more individualized feedback.

Recently, interest in adopting higher-order structures for CDMs has grown, aiming to capture interdependencies between the latent attributes (de la Torre & Douglas, Reference de la Torre and Douglas2004; de la Torre & Song, Reference de la Torre and Song2009; Templin et al., Reference Templin, Henson, Templin and Roussos2008). Most existing models adopt a single layer of higher-order continuous latent traits to explain correlations among lower-level latent attributes (e.g., Bradshaw & Templin, Reference Bradshaw and Templin2014; de la Torre & Douglas, Reference de la Torre and Douglas2004; Liu et al., Reference Liu, Lee and Gu2025; Ma, Reference Ma2022; Templin et al., Reference Templin, Henson, Templin and Roussos2008). Although these single-layer higher-order models offer an interpretable and simplified representation of attribute dependencies, they may be limited in modeling deeper latent hierarchies or providing more granular cognitive diagnoses. To model deeper-level cognitive processes, the recent deep cognitive diagnostic models (DeepCDMs) proposed by Gu (Reference Gu2024) employ a deep architecture to capture probabilistic relationships across multiple discrete latent layers. DeepCDMs flexibly let each of these layers deliver diagnostic information at a distinct level of granularity. Despite this added depth, DeepCDMs remain parsimonious through compact parameterization and are mathematically identifiable under intuitive conditions.

In this article, we show that the advantages of DeepCDMs can be further realized by generalizing them to a challenging exploratory setting, where the attribute relationships between adjacent layers (i.e., all the $\mathbf {Q}$ -matrices) are unknown. The exploratory DeepCDMs can be viewed as an adaptation of deep generative models (DGMs) for psychometrics and educational measurement, with identifiability constraints imposed to serve diagnostic purposes. DeepCDMs share structural similarities with several existing DGMs, such as deep belief networks (DBNs; Hinton et al., Reference Hinton, Osindero and Teh2006) (see Section 2.3 for further discussion). This connection highlights the expressive power of DeepCDMs from the perspective of DGMs. In particular, DeepCDM’s layered architecture defines a hierarchical generative process, suitable for modeling students’ hierarchical and heterogeneous cognitive processes behind data. This structure enables DeepCDMs to approximate highly complex response distributions while maintaining a tractable form for layer-wise learning.

Although usual DGMs (Hinton et al., Reference Hinton, Osindero and Teh2006; Salakhutdinov & Hinton, Reference Salakhutdinov and Hinton2009) excel at predictive and generative performance, their architectures and estimation algorithms are often heuristically designed and lack rigorous statistical foundations. Importantly, whether the parameters underlying the latent representations are uniquely identifiable is largely unknown for DGMs. This gap motivates us to introduce exploratory DeepCDMs, which are built for diagnostic purposes and are fully identifiable. Exploratory DeepCDMs are identifiable under transparent conditions on the between-layer $\mathbf {Q}$ -matrices (Gu, Reference Gu2024). Identifiability ensures that no two distinct parameter sets yield the same marginal distribution of the observed responses, thereby guaranteeing consistent parameter estimation. As a consequence, DeepCDMs can provide statistically reliable personalized diagnoses of hierarchical latent abilities. The identifiability conditions naturally imply an interpretable shrinking-ladder-shaped sparse deep architecture, enabling the model to capture the latent skills from fine-grained (shallower and closer to the response data layer) to coarse-grained (deeper and more higher-order). Statistically, such architectures also induce parsimonious parameterizations, crucial for reflecting test design constraints in real-world educational assessments.

Parameter estimation is a challenging issue for exploratory DeepCDMs, as the parameters and $\mathbf {Q}$ -matrices across all layers need to be estimated. The commonly used estimation methods for related hierarchical models are Markov chain Monte Carlo (MCMC; Robert and Casella, Reference Robert and Casella2004) method and expectation–maximization (EM) algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977). Gu (Reference Gu2024) employed MCMC to estimate confirmatory DeepCDMs with known $\mathbf {Q}$ -matrices. For exploratory DeepCDMs, MCMC can, in principle, be developed by incorporating additional sampling steps for the $\mathbf {Q}$ -matrix entries. However, when the $\mathbf {Q}$ -matrices are unknown and the latent structure involves more than two layers—as in the settings considered in this work—significant practical challenges, such as initialization sensitivity, slower convergence, MCMC mixing difficulties, and increased computational cost, may limit its scalability and efficiency. The classical EM, as explained later in Section 3.4, though faster, suffers from (a) extreme sensitivity to initialization, since all parameters must be initialized simultaneously in a highly nonconvex, multi-layer parameter space and (b) cyclic bias accumulation, where errors in one layer’s estimation propagate through both the E- and M-steps into other layers over successive iterations. On the other hand, although many algorithms have been proposed for general DGMs in machine learning (e.g., Hinton et al., Reference Hinton, Osindero and Teh2006; Le Roux & Bengio, Reference Le Roux and Bengio2008; Ranganath et al., Reference Ranganath, Tang, Charlin and Blei2015; Salakhutdinov & Hinton, Reference Salakhutdinov and Hinton2009), they are not directly applicable to DeepCDMs, as their typically overparameterized architectures do not satisfy the parsimony and identifiability requirements of diagnostic modeling and are not designed to promote sparsity or interpretability.

In this work, we propose a novel layer-wise EM algorithm for regularized maximum likelihood estimation with a layerwise $ L_1 $ penalty in exploratory DeepCDMs. The algorithm estimates parameters and $\mathbf {Q}$ -matrices sequentially, starting from the bottom layer, where a one-layer EM algorithm is used to estimate both the between-layer coefficients and the proportion parameters of the latent attributes. These proportion parameters are then used to generate pseudo-samples of latent attributes, which serve as input for the next layer. We will continue this process one layer after another until all layers are estimated. This strategy is not only intuitive but also grounded in the model’s generative structure: marginalizing out deeper layers naturally yields a standard one-layer CDM at the bottom, justifying the use of a one-layer EM for its estimation. In deeper layers, each step builds on the most informative signals from the previous one—either as estimated distributions or generated pseudo-observations—thus respecting the model’s hierarchical nature. Interestingly, the identifiability proof shows that identifiability can be examined and established in a layer-by-layer manner, thanks to the formulation of the directed graphical model and the discrete nature of the latent attributes. This theoretical insight also supports the design of our proposed algorithm and provides a solid foundation for treating imputed attributes as if observed in each step. Additionally, our layerwise estimation strategy conceptually aligns with the modular training principles widely used in deep generative modeling, where complex models are progressively trained through simpler, localized components. A more detailed discussion on this is in Section 3.5.2.

Initialization plays a crucial role in EM-based estimation, particularly in exploratory settings where the $\mathbf {Q}$ -matrices are unknown and must be estimated. In such cases, the parameter space becomes more complex, and a well-informed initialization can greatly enhance convergence stability and estimation quality. To this end, we adopt a fast, non-iterative procedure based on universal singular value thresholding (USVT), which yields reliable starting values with theoretical guarantees under certain conditions (Chatterjee, Reference Chatterjee2015; Zhang et al., Reference Zhang, Chen and Li2020). The initialization is conducted in a sequential, layer-by-layer manner. For each layer, the input matrix is first denoised via truncated SVD, then linearized by applying the inverse link function, and then a second SVD followed by Varimax rotation is applied to recover a sparse coefficient matrix, promoting sparsity and identifiability. We adopt a penalized estimation framework where all $\mathbf {Q}$ -matrices are treated as unknown and estimated from data. At each layer, $\mathbf {Q}$ -matrix estimation is framed as a latent variable selection problem, with an $ L_1 $ penalty imposed on the coefficient parameters to encourage sparsity. The M-step of each layer’s EM update is solved via cyclical coordinate descent (Friedman et al., Reference Friedman, Hastie and Tibshirani2010; Tay et al., Reference Tay, Narasimhan and Hastie2023), efficiently maximizing the penalized log-likelihood. Additionally, as discussed in Section 3.6, although the algorithm is developed under an exploratory framework, it can be readily adapted for confirmatory applications. Our extensive simulation studies demonstrate the excellent performance of the proposed layer-wise EM in challenging scenarios involving three latent layers. Finally, we illustrate the practical utility of exploratory DeepCDM using data from the 2019 Trends in International Mathematics and Science Study (TIMSS) assessment.

The remainder of this article is organized as follows. Section 2 introduces the exploratory DeepCDMs framework, discusses its formulation as a DGM, and addresses the identifiability issues. Section 3 presents an efficient layer-wise algorithm for parameter estimation. Section 4 presents simulation studies to evaluate the performance of the proposed layer-wise EM algorithm for exploratory DeepCDMs under various measurement models. Section 5 applies the proposed method to real data from the TIMSS 2019 assessment. Finally, Section 6 gives concluding remarks.

2 Exploratory DeepCDMs framework

In this section, we present the exploratory DeepCDM framework. We will build on the concepts of confirmatory DeepCDMs in Gu (Reference Gu2024) and provide additional details for the exploratory setting. We then discuss how exploratory DeepCDMs adapt DGMs for psychometrics, highlighting their architectural similarities and the additional structural constraints for facilitating diagnostic feedback. Finally, we discuss the theoretical identifiability of the model, with formal results provided in the Supplementary Material.

2.1 Model setup

The DeepCDM framework is developed to address the need for diagnostic modeling at multiple granularities. It is defined using the terminology of probabilistic graphical models (Koller & Friedman, Reference Koller and Friedman2009; Wainwright & Jordan, Reference Wainwright and Jordan2008), particularly directed graphical models. These models employ graphs to compactly represent the joint distribution of high-dimensional random variables, where nodes correspond to variables and edges encode their direct probabilistic relationships.

We first review the definition of a directed acyclic graph (DAG), also referred to as a Bayesian network (Pearl, Reference Pearl1988). In a DAG, every edge has a direction, and no directed cycles are allowed. Consider M random variables, $X_1, \ldots , X_M$ , which correspond to M nodes in the graph. If a directed edge goes from $X_\ell $ to $X_m$ , we say that $X_\ell $ is a parent of $X_m$ , and $X_m$ is a child of $X_\ell $ . Let ${\mathrm{pa}}(m) \subseteq \{1, \dots , M\}$ denote the index set of all parents of $X_m$ . Define $\mathbb {P}(X_m \mid X_{{\mathrm{pa}}(m)})$ as the conditional distribution of $X_m$ given its parents $X_{{\mathrm{pa}}(m)}$ . Based on this DAG structure, the joint distribution of $X_1, \ldots , X_M$ factorizes as follows:

(1)

$$ \begin{align} \mathbb{P}(X_1, \ldots, X_M) = \prod_{m=1}^{M} \mathbb{P}(X_m \mid X_{{\mathrm{pa}}(m)}). \end{align} $$

We now present the general DeepCDM formulation. For a DeepCDM with D latent layers, we denote the d-th latent layer as $\mathbf A^{(d)} = (A^{(d)}_1, \dots , A^{(d)}_{K_d})$ for each $d = 1, 2, \dots , D$ , where larger d correspond to deeper layers. In DeepCDMs, all edges are directed top-down and occur only between adjacent layers, defining a generative process from high-level latent variables to observed responses. Specifically, the bottom layer consists of the observed response variables for the J items, denoted as $\mathbf R = (R_1, \dots , R_J)$ . The first latent layer, right above the bottom layer, captures the most fine-grained latent attributes, represented as $\mathbf A^{(1)}$ . These are generated from the second latent layer $ \mathbf A^{(2)} $ , and the process continues recursively up to the deepest layer $ \mathbf A^{(D)} $ . Figure 1 gives an example of a DeepCDM with three latent layers ( $D=3$ ). Given the variables in the layer right above, the variables within each layer of the DeepCDM are conditionally independent. This structure intuitively models how more specific latent skills are successively derived from more general, higher-level latent “meta-skills.” A natural assumption, supported by the model’s identifiability conditions, is that deeper layers should consist of fewer latent variables, i.e., $K_1> K_2 > \dots > K_D$ (see Theorems 1, 2, and 3 in the Supplementary Material for detailed identifiability conditions).

Figure 1 A ladder-shaped three-latent-layer DeepCDM.

Note: Gray nodes are observed variables, and white nodes are latent ones. Multiple layers of binary latent variables $\mathbf A^{(1)}$ , $\mathbf A^{(2)}$ , and $\mathbf A^{(3)}$ successively generate the observed binary responses $\mathbf R$ . Binary matrices $\mathbf Q^{(1)}$ , $\mathbf Q^{(2)}$ , and $\mathbf Q^{(3)}$ encode the sparse connection patterns between adjacent layers in the graph.

In traditional CDMs with a single layer of K latent attributes, the $\mathbf Q$ -matrix (Tatsuoka, Reference Tatsuoka1983) is a fundamental component that specifies the relationship between items and the latent attributes. Specifically, $\mathbf Q = (q_{j,k})_{J \times K}$ , where $q_{j,k} = 1$ if the item j measures the latent attribute k, and $q_{j,k} = 0$ otherwise. Since the edges in a graphical model reflect direct statistical dependencies, $q_{j,k} = 1$ or $0$ also conveys whether the k-th latent node is a parent of the j-th observed node. Consequently, the $\mathbf Q$ -matrix encodes the bipartite graph structure between the observed and latent layers. Extending this idea to DeepCDMs, with D latent layers, requires D matrices, denoted as $\mathbf Q^{(1)}, \mathbf Q^{(2)}, \dots , \mathbf Q^{(D)}$ , to capture the dependence relationships between any two adjacent layers. Specifically, $\mathbf Q^{(1)} = \left (q^{(1)}_{j,k}\right )$ has size $J \times K_1$ , similar to the single $\mathbf Q$ -matrix in traditional CDM, and describes the graph between the observed data layer and the shallowest latent layer. While for $d = 2, \dots , D$ , the matrix $\mathbf Q^{(d)} = \left (q^{(d)}_{k,\ell }\right )$ has size $K_{d-1} \times K_d$ and represents the dependencies between latent variables in the $(d-1)$ th and dth latent layers. The entry $q^{(d)}_{k,\ell } = 1$ or $0$ indicates whether the latent variable $A^{(d)}_{\ell }$ is a parent of $A^{(d-1)}_k$ . In this article, we consider the challenging setting of exploratory DeepCDMs, where all $\mathbf Q$ -matrices are unknown and need to be estimated.

Based on the general definition of DAGs in (1) and the DeepCDM setup, the joint distribution of all variables, including the latent ones, is given by

(2)

$$ \begin{align} \mathbb{P}(\mathbf R, \mathbf A^{(1)}, \dots, \mathbf A^{(D)}) &= \mathbb{P}(\mathbf R \mid \mathbf A^{(1)}, \mathbf Q^{(1)}) \cdot \prod_{d=2}^{D} \mathbb{P}(\mathbf A^{(d-1)} \mid \mathbf A^{(d)}, \mathbf Q^{(d)}) \cdot \mathbb{P}(\mathbf A^{(D)}), \end{align} $$

(3)

$$ \begin{align} \text{where} \quad \mathbb{P}(\mathbf R = \boldsymbol r \mid \mathbf A^{(1)}, \mathbf Q^{(1)}) &= \prod_{j=1}^{J} \mathbb P^{\mathrm{CDM}}(R_j = r_j \mid \mathbf A^{(1)}, \mathbf Q^{(1)}), \quad \text{and} \end{align} $$

(4)

$$ \begin{align} \mathbb{P}(\mathbf A^{(d-1)} = \boldsymbol \alpha^{(d-1)} \mid \mathbf A^{(d)}, \mathbf Q^{(d)}) &= \prod_{k=1}^{K_{d-1}} \mathbb P^{\mathrm{CDM}}(A^{(d-1)}_k = \alpha^{(d-1)}_k \mid \mathbf A^{(d)}, \mathbf Q^{(d)}), \end{align} $$

where $\boldsymbol r$ represents an observed response pattern and $\boldsymbol \alpha ^{(d-1)}$ represents a latent pattern for the $(d-1)$ th latent layer. The superscript “CDM” in the conditional distributions of (3) and (4) indicates that the conditional distribution within each layer of the generative process adheres to a CDM. By marginalizing out all latent layers $\mathbf A^{(1)}, \dots , \mathbf A^{(D)}$ in (2), we obtain the marginal distribution of the observed response vector $\mathbf R$ :

(5)

$$ \begin{align} \mathbb{P}(\mathbf R = \boldsymbol r) &= \sum_{\boldsymbol \alpha^{(1)}} \cdots \sum_{\boldsymbol \alpha^{(D)}} \mathbb{P}(\mathbf R = \boldsymbol r, \mathbf A^{(1)} = \boldsymbol \alpha^{(1)}, \dots, \mathbf A^{(D)} = \boldsymbol \alpha^{(D)}). \end{align} $$

This work focuses on binary observed and latent variables, where $\boldsymbol r \in \{0,1\}^J$ and $\boldsymbol \alpha ^{(d)} \in \{0,1\}^{K_d}$ . Each observed variable reflects whether a response is correct or incorrect, while each latent variable indicates the presence or absence of a specific skill or higher-level attribute. Similar to traditional CDMs, the latent variables $\mathbf A^{(D)}$ in the deepest layer of a DeepCDM are modeled using a categorical distribution:

(6)

$$ \begin{align} \mathbb P(\mathbf A^{(D)} = \boldsymbol \alpha_{\ell}) &= \pi^{(D)}_{\boldsymbol \alpha_{\ell}}, \quad \forall \boldsymbol \alpha_{\ell} \in \{0,1\}^{K_D}. \end{align} $$

Here, $K_D> 1$ . The proportion parameters $\pi ^{(D)}_{\boldsymbol \alpha _{\ell }}$ are subject to the constraint $\sum _{\boldsymbol \alpha _{\ell } \in \{0,1\}^{K_D}} \pi ^{(D)}_{\boldsymbol \alpha _{\ell }} = 1$ . With this, we complete the specification of a general DeepCDM.

We next discuss the distinction between the proposed DeepCDM and the attribute hierarchy method (AHM; Gierl et al., Reference Gierl, Leighton, Hunka and Cambridge2007; Templin & Bradshaw, Reference Templin and Bradshaw2014). Both approaches aim to capture dependencies among latent attributes. However, they differ fundamentally in how such dependencies are represented and inferred. In the AHM framework, hierarchies are defined as deterministic prerequisite relations among attributes within a single latent layer—for instance, mastering attribute B is a prerequisite for mastering attribute A. These relations constrain admissible attribute profiles to those consistent with the specified hierarchy, embedding learning sequences informed by substantive or curricular theory. In contrast, DeepCDM introduces a probabilistic hierarchy across multiple latent layers. Rather than specifying logical prerequisites within one layer, it models how attributes at one layer are statistically explained by latent traits at a deeper and higher-order layer. Higher-order traits capture shared variance among lower-level attributes, enabling a data-driven hierarchical representation learned directly from data. Overall, these distinctions indicate that AHM and DeepCDM reflect different modeling perspectives and are therefore suited to different types of applications.

2.2 Specific examples of DeepCDMs

This section presents concrete examples of DeepCDMs that fall under the general framework outlined in Section 2.1. For notational convenience, we also denote the observed response layer $\mathbf R$ as $\mathbf A^{(0)}$ , enabling a unified expression for the layerwise conditional distributions: $\mathbb {P}(\mathbf A^{(d-1)} \mid \mathbf A^{(d)}, \mathbf Q^{(d)})$ for $d = 1, \dots , D$ . We then define specific DeepCDM variants according to the diagnostic model adopted for each layerwise conditional.

Example 1 (Main-effect DeepCDMs).

We use the term “Main-effect DeepCDMs” to refer broadly to DeepCDMs in which each layerwise conditional distribution follows a main-effect diagnostic model. In this setup, the probability that $A^{(d-1)}_j = 1$ is governed by the main effects of its parent attributes, modeled via a link function $f(\cdot )$ :

(7)

$$ \begin{align} \mathbb{P}(A^{(d-1)}_j = 1 \mid \mathbf A^{(d)} = \boldsymbol \alpha, \mathbf Q^{(d)}, \boldsymbol \beta^{(d)}) = f\Big( \beta^{(d)}_{j,0} + {\sum}_{k=1}^{K_d} \beta^{(d)}_{j,k} \left\{q^{(d)}_{j,k} \alpha_k \right\} \Big). \end{align} $$

Here, $\beta _{j,k}^{(d)}$ is nonzero only when $q_{j,k}^{(d)} = 1$ . When f is the identity function, Equation (7) reduces to the additive CDM (ACDM; de la Torre, Reference de la Torre2011). If f is the inverse logit function, Equation (7) gives a logistic linear model (LLM; Maris, Reference Maris1999).

Example 2 (All-effect DeepCDMs).

We refer to DeepCDMs in which the layerwise conditionals follow an all-effect diagnostic model as “All-effect DeepCDMs.” In an all-effect diagnostic model, the probability of $A^{(d-1)}_j = 1$ depends on both the main effects and all possible interaction effects of the parent attributes:

(8)

$$ \begin{align} & \mathbb{P}(A^{(d-1)}_j = 1 \mid \mathbf A^{(d)} = \boldsymbol \alpha, \mathbf Q^{(d)}, \boldsymbol \beta^{(d)}) = f\Big( \beta^{(d)}_{j,0} + {\sum}_{k=1}^{K_d} \beta^{(d)}_{j,k} \left\{q^{(d)}_{j,k} \alpha_k \right\} \\ \notag &\qquad\qquad\qquad + {\sum}_{1 \leq k_1 < k_2 \leq K_d} \beta^{(d)}_{j,k_1 k_2} \left\{q^{(d)}_{j,k_1} \alpha_{k_1}\right\} \left\{q^{(d)}_{j,k_2} \alpha_{k_2}\right\} + \cdots + \beta^{(d)}_{j,12 \dots K_d} {\prod}_{k=1}^{K_d} \left\{q^{(d)}_{j,k} \alpha_k\right\} \Big). \end{align} $$

Similar to the main-effect model, not all $\beta $ -coefficients above are needed. If $\boldsymbol q^{(d)}_j$ , the j-th row of $\mathbf {Q}^{(d)}$ , contains $K_j$ entries of “1,” then $2^{K_j}$ parameters are required in (8). With the identity link function, (8) defines the generalized DINA model (GDINA; de la Torre, Reference de la Torre2011), while the inverse logit function yields the log-linear CDM (LCDM; Henson et al., Reference Henson, Templin and Willse2009).

Example 3 (DeepDINA).

The DINA model can be regarded as a special case of the all-effect CDM, where only the highest-order interaction term among the required attributes is retained, and all lower-order effects are constrained to zero:

(9)

$$ \begin{align} \mathbb{P}^{\mathrm{DINA}}(A^{(d-1)}_j = 1 \mid \mathbf A^{(d)} = \boldsymbol \alpha, \mathbf Q^{(d)}, \boldsymbol \beta^{(d)}) = f\Big(\beta_{j,0}^{(d)} + \beta_{j,\mathcal{K}^{(d)}_j} \prod_{k \in {\mathcal{K}^{(d)}_j}} q^{(d)}_{j,k}\alpha_k \Big), \end{align} $$

where ${\mathcal {K}^{(d)}_j} = \left \{k \in [K];\, q_{jk}^{(d)} = 1\right \}$ denotes the set of attributes measured by item j. The model assumes that students are capable of an item only if they master all required attributes for that item. So, $\beta _{j,\mathcal {K}^{(d)}_j}$ is the only non-zero coefficient for item j in layer d.

One can also specify a DeepDINO model, a specific type of DeepCDM where the DINO model is used to model each latent layer (Gu, Reference Gu2024). Due to the duality between DINA and DINO, the identifiability and algorithm applicable to DeepDINA are also applicable to DeepDINO. Therefore, we do not introduce it here and refer readers to Gu (Reference Gu2024) for details.

Example 4 (Hybrid DeepCDMs).

A key strength of the DeepCDM framework is its flexibility in allowing different diagnostic models (e.g., DINA, main-effect, and all-effect) to be applied across various layers. This is referred as Hybrid DeepCDMs, which strike a balance between model complexity and parsimony, offering flexibility in designing diagnostic models based on specific needs. For instance, in practical scenarios, the most general all-effect diagnostic model may be used at the bottom layer to model how fine-grained attributes affect the observed responses, while simpler models like main-effect or DINA could be applied in deeper layers to enhance interpretability and reduce complexity.

As demonstrated earlier, only particular coefficients, determined by the $\mathbf Q$ -matrices and the specified measurement models, in the generating DeepCDM should be non-zero. However, since all $\mathbf Q$ -matrices, $\mathbf Q^{(d)}, d = 1, \dots , D$ , are unknown, the sparsity pattern of the coefficient vectors is also unknown. Therefore, we assume all coefficients in the model as unknown and estimate them by maximizing a regularized log-likelihood. The $\mathbf Q$ -matrices can then be inferred by identifying the non-zero coefficients in $\boldsymbol {\beta }^{(d)}$ , $d=1,\ldots ,D$ . We defer the details of the mechanism for identifying the entries $q_{jk}^{(d)}$ to Section 3.2.

2.3 DeepCDMs as deep generative models

As previously mentioned, the DeepCDM framework can be viewed as an adaptation of DGMs for psychometrics and educational measurement, where additional structural constraints are introduced to enable diagnostic feedback. Exploratory DeepCDMs share architectural similarities with several existing DGMs. For example, when the activation function $ f $ is defined as the inverse logit, DeepCDMs resemble DBNs (Hinton et al., Reference Hinton, Osindero and Teh2006) with binary-valued hidden units. However, a key structural difference lies at the top of the network: DBNs assume an undirected graph between the top two layers—forming a restricted Boltzmann machine (RBM)—while DeepCDMs adopt a fully directed, top-down architecture across all layers. This design enables DBNs to use a heuristic greedy layer-wise pretraining procedure based on contrastive divergence, a technique specific to training undirected models, such as RBMs (Hinton & Salakhutdinov, Reference Hinton and Salakhutdinov2006; Hinton et al., Reference Hinton, Osindero and Teh2006). Consequently, such training strategies are not directly applicable to DeepCDMs due to its directed nature, which is more interpretable for modeling hierarchical skill generation.

DeepCDMs also share a top-down generative structure with DEFs (Ranganath et al., Reference Ranganath, Tang, Charlin and Blei2015), an unsupervised framework using exponential family distributions to model each layer’s conditional distribution. DEFs aim to capture compositional semantics through hierarchical latent representations. However, DEFs rely on black-box variational inference with neural network-based posterior approximations, which prevent recovery of interpretable parameters, such as $\mathbf {Q}$ -matrices, and thus cannot provide individualized diagnostic feedback, a central aspect of cognitive diagnosis.

Another related framework is the deep discrete encoders (DDEs; Lee & Gu, Reference Lee and Gu2025), a DGM designed for rich data types with discrete latent layers. While DDEs and DeepCDMs share architectural and identifiability properties, their goals differ. DDEs aim to address machine learning concerns like overparameterization and lack of interpretability, constructing general-purpose identifiable DGMs. In contrast, DeepCDMs are specifically designed for psychometrics, with each adjacent pair of latent layers constituting a CDM. This structure allows for diagnostic-specific measurement assumptions, addressing varied diagnostic goals and enhancing usability in real-world assessment settings.

Taken together, these distinctions highlight the unique features of exploratory DeepCDMs over existing DGMs. While most DGMs prioritize data generation or predictive performance and typically lack identifiability and sparsity constraints, DeepCDMs refocus the modeling effort on offering reliable individualized diagnostic feedback and discovering hierarchical latent skill structures. Furthermore, by enforcing sparsity in the coefficient matrices to reflect item–attribute relationships, DeepCDMs provide valuable insights into test design—an essential feature for practical use in educational and psychological assessment, often overlooked in classical DGM frameworks.

2.4 Identifiability

As noted earlier, a key strength of DeepCDMs lies in their formal identifiability guarantees, which apply to both confirmatory and exploratory settings (Gu, Reference Gu2024). These results are detailed in the Supplementary Material. In brief, the identification conditions impose explicit structural constraints on the between-layer $\mathbf Q$ -matrices, offering practical guidance for model design and implementation. Although the specific conditions vary across diagnostic models, they consistently require an increasingly shrinking latent structure for deeper layers. That is, the number of latent variables typically decreases with depth, often subject to constraints, such as $K_d> c \cdot K_{d+1}$ for some constant $c> 1$ depending on the model. This hierarchy reflects the principle of statistical parsimony in DeepCDMs. For instance, in a two-layer DeepDINA model with $K_1=7$ and $K_2=2$ , the number of nonzero parameters is $2K_1 + 2^{K_2} - 1 = 17$ , compared to $2^{K_1} - 1 = 127$ in a saturated attribute model without higher-order latent structure. Such substantial reductions in complexity make DeepCDMs especially attractive for applications with fine-grained latent attributes and limited sample sizes. In exploratory settings, while all parameters must be estimated, the identifiability conditions naturally promote sparsity in the true generating model, facilitating parameter recovery and interpretation of the latent attributes.

A central insight underlying these proofs is that the identifiability of DeepCDMs can be established in a layer-by-layer fashion, proceeding from the bottom (shallow) layer to the top (deepest) layer. This approach is justified by the directed nature of the graphical model and the discreteness of latent variables. Two core ideas facilitate this stepwise identifiability. First, in a multi-layer directed graphical model with only top-down connections between adjacent layers, marginalizing out deeper latent layers yields a restricted latent class model (RLCM) (Gu & Xu, Reference Gu and Xu2020; Xu, Reference Xu2017). Once the distribution of the shallowest latent layer is identified through this RLCM, it can be treated theoretically as if observed for identifying the next deeper layer. Second, identifiability in RLCMs holds for any marginal distribution of latent attributes, provided the $\mathbf Q$ -matrix meets specific conditions. This property allows identifiability to propagate upward through layers, even when deeper latent variables introduce complex dependencies.

3 Proposed estimation algorithms

In this section, we propose a novel layer-wise EM algorithm for estimating exploratory DeepCDMs. We begin by introducing some notation. Let $\mathbf R_{1:N}$ denote the response data matrix of size $N \times J$ , representing the observed responses of N students to J items. Let $\boldsymbol {\beta } = (\boldsymbol {\beta }^{(1)}, \ldots , \boldsymbol {\beta }^{(D)})$ and $\mathbf Q = (\mathbf Q^{(1)}, \ldots , \mathbf Q^{(D)})$ denote the sets of continuous parameters and $\mathbf {Q}$ -matrices across all layers, respectively. Let $\boldsymbol {\pi }^{(d)}$ denote a vector composed of the probability mass function $\boldsymbol {\pi }_{d,\boldsymbol \alpha ^{(d)}} = \mathbb {P}(\mathbf A^{(d)} = \boldsymbol \alpha ^{(d)})$ for all $\boldsymbol \alpha ^{(d)} \in \{0,1\}^{K_d}$ , $d = 1, 2, \ldots , D$ . The parameters to be estimated include all continuous parameters in $\boldsymbol {\beta }$ , all $\mathbf {Q}$ -matrices in $\mathbf Q$ , and the proportion parameter $\boldsymbol {\pi }^{(D)}$ for the deepest latent layer. Directly maximizing the marginal log-likelihood to estimate the $\mathbf {Q}$ -matrices is computationally prohibitive, even when the number of layers D and the dimensionalities $K_d$ ( $d = 0, 1, \ldots , D; K_0 = J$ ) are of moderate size. This challenge arises from the need to search over an enormous space of possible $\mathbf {Q}$ -matrix configurations—specifically, $\prod _{d=1}^D 2^{K_{d-1} \cdot K_d}$ combinations—each requiring evaluation of a profile likelihood. To avoid this combinatorial burden, we instead frame $\mathbf {Q}$ -matrix estimation as a structured model selection task, addressed via a regularized likelihood approach that encourages sparsity in the parameter space (Chen et al., Reference Chen, Liu, Xu and Ying2015).

The regularized marginal log-likelihood is given by

(10)

$$ \begin{align} \ell(\boldsymbol{\beta}, \boldsymbol{\pi}^{(D)}, \mathbf Q \mid \mathbf R_{1:N}) &= \sum_{i=1}^N \log \Bigg\{ \sum_{\boldsymbol \alpha^{(1)}} \cdots \sum_{\boldsymbol \alpha^{(D)}} \Big[ \mathbb{P}(\mathbf R_i \mid \mathbf A^{(1)}, \mathbf Q^{(1)}, \boldsymbol{\beta}^{(1)}) \nonumber \\ &\quad \cdot \prod_{d=2}^{D} \mathbb{P}(\mathbf A^{(d-1)} \mid \mathbf A^{(d)}, \mathbf Q^{(d)}, \boldsymbol{\beta}^{(d)}) \cdot \mathbb{P}(\mathbf A^{(D)}; \boldsymbol{\pi}^{(D)}) \Big] \Bigg\} - N \cdot P_{\boldsymbol{s}}(\boldsymbol{\beta}), \end{align} $$

where the $L_1$ penalty function $ P_{\boldsymbol {s}}(\boldsymbol {\beta }) $ enforces sparsity across layers and is defined as

(11)

$$ \begin{align} P_{\boldsymbol{s}}(\boldsymbol{\beta}) = \sum_{d=1}^{D} P_{s}(\boldsymbol{\beta}^{(d)}) = \sum_{d=1}^{D} s_d \sum_{\beta_{j,k}^{(d)} \in \boldsymbol{\beta}^{(d)}} \left| \beta_{j,k}^{(d)} \right|. \end{align} $$

Here, $ s_d $ denotes the regularization parameter for layer $ d $ , and each $ \boldsymbol {\beta }^{(d)} $ represents the set of model-specific coefficient parameters at that layer, with its structure determined by the chosen measurement model (e.g., main-effect, all-effect, or DINA), as detailed in Section 2.2.

The nested summation over multiple layers of latent attributes in (10) renders direct optimization infeasible. While the classical EM algorithm offers a principled framework for estimating exploratory DeepCDMs, it is not without limitations. In practice, its effectiveness can be hindered by the model’s structural complexity and the high dimensionality of the parameter space. These challenges, stemming from the layered latent architecture and the combinatorial nature of $\mathbf {Q}$ -matrix estimation, can increase sensitivity to initialization and compromise scalability and stability. A detailed discussion of these shortcomings is provided in Section 3.4. These considerations motivate our development of the layer-wise EM algorithm, introduced in the following section.

In the remainder of this section, we first introduce the classical EM algorithm and briefly discuss its limitations in Section 3.1. The proposed layer-wise EM algorithm is presented in Section 3.2, followed by the initialization strategy in Section 3.3. Section 3.4 highlights the advantages of the layer-wise EM over the classical EM algorithm. Section 3.5 discusses how the layer-wise concept connects to broader principles and algorithms, including identifiability derivation and related methods for DGMs. Finally, Section 3.6 discusses the extension of the layer-wise EM to the confirmatory DeepCDM setting.

3.1 The EM algorithm

Let $\mathbf A_{1:N} = (\mathbf A^{(1)}_{1:N}, \ldots , \mathbf A^{(D)}_{1:N})$ denote the set of latent variables, i.e., the attribute profiles of the N students across D latent layers. The complete data log-likelihood is:

(12)

$$ \begin{align} &l_c^{DeepCDM} (\boldsymbol{\beta}, \boldsymbol{\pi}^{(D)},\mathbf Q|\mathbf R_{1:N}, \mathbf A_{1:N})\nonumber\\ =& \log\left( \mathbb{P}(\mathbf R_{1:N}\mid \mathbf A^{(1)}_{1:N},\mathbf Q^{(1)},\boldsymbol{\beta}^{(1)}) \cdot \prod_{d=2}^{D} \mathbb{P}(\mathbf A^{(d-1)}_{1:N}\mid \mathbf A^{(d)}_{1:N},\mathbf Q^{(d)},\boldsymbol{\beta}^{(d)}) \cdot \mathbb{P}(\mathbf A^{(D)}_{1:N};\boldsymbol{\pi}^{(D)})\right). \end{align} $$

Let $\boldsymbol {\beta }^{(t-1)}=(\boldsymbol {\beta }^{(1,t-1)}, \ldots , \boldsymbol {\beta }^{(D,t-1)})$ , $\boldsymbol {\pi }^{(D,t-1)}$ , and $\mathbf Q^{(t-1)} = (\mathbf Q^{(1,t-1)}, \ldots , \mathbf Q^{(D,t-1)})$ denote the estimates of $\boldsymbol {\beta }$ , $\boldsymbol {\pi }^{(D)}$ , and $\mathbf Q$ obtained at iteration $t-1$ . In each iteration t of the EM algorithm, the following two steps are performed:

E-Step: Compute

(13)

$$ \begin{align} \widetilde Q^{(t)}=\mathbb{E}_{(\mathbf A^{(1)}, \ldots, \mathbf A^{(D)})}[l_c^{DeepCDM} (\boldsymbol{\beta},\boldsymbol{\pi}^{(D)},\mathbf Q|\mathbf R_{1:N}, \mathbf A_{1:N})|\mathbf R_{1:N};\boldsymbol{\beta}^{(t-1)},\boldsymbol{\pi}^{(D,t-1)},\mathbf Q^{(t-1)}], \end{align} $$

where the conditional expectation is with respect to $\mathbb {P}(\mathbf A^{(1)}, \ldots , \mathbf A^{(D)}|\mathbf R_{1:N};\boldsymbol {\beta }^{(t-1)},\boldsymbol {\pi }^{(D,t-1)},\mathbf Q^{(t-1)})$ .

M-Step: Update

(14)

$$ \begin{align} \left(\boldsymbol{\beta}^{(t)},\boldsymbol{\pi}^{(D,t)},\mathbf Q^{(t)}\right) &= \underset{\boldsymbol{\beta},\mathbf Q,{\boldsymbol{\pi}^{(D)}}} {\text{arg max}} \,\widetilde Q^{(t)}-N\cdot P_{\boldsymbol{s}}(\boldsymbol{\beta}), \end{align} $$

where $P_{\boldsymbol {s}}(\boldsymbol {\beta })$ is defined in Equation (11).

For each $d=1,2,\ldots ,D$ , define $\widetilde {\mathbf A}^{1:d}=(\mathbf A^{(1)},\ldots ,\mathbf A^{(d)})$ , which denote the latent variables shallower than the $(d+1)$ th latent layer. Define $\mathbf A^{(0)}=\mathbf {R}_{1:N}$ and $\mathbf A^{(D+1)}=\emptyset $ . According to the conditional independence, the expectation computation in the E-step can be re-expressed as $\widetilde Q^{(t)}= \sum _{d=1}^{D+1} \widetilde {Q}^{(d,t)},$ with

(15)

$$ \begin{align} \widetilde{Q}^{(d,t)}=&\mathbb E_{\widetilde{\mathbf A}^{1:d}} [\log{P(\mathbf A^{(d-1)}_{1:N}|\mathbf A^{(d)}_{1:N})}|\mathbf R_{1:N},\boldsymbol{\beta}^{(t-1)},\boldsymbol{\pi}_D^{(t-1)},\mathbf Q^{(t-1)}]\nonumber\\ &=\sum_{i=1}^N \mathbb E_{\widetilde{\mathbf A}^{1:d}} [\log{P(\mathbf A^{(d-1)}|\mathbf A^{(d)})}|\mathbf R_{i},\boldsymbol{\beta}^{(t-1)},\boldsymbol{\pi}_D^{(t-1)},\mathbf Q^{(t-1)}] \end{align} $$

That is, $\widetilde Q^{(t)}$ is decomposed as a summation over layers d and individuals i, where each term is the conditional expectation of $\log P(\mathbf A^{(d-1)} \mid \mathbf A^{(d)})$ with respect to the partial posterior distribution $P(\mathbf A^{(1)}, \ldots , \mathbf A^{(d)} \mid \mathbf R_i, \boldsymbol {\beta }^{(t-1)}, \boldsymbol {\pi }_D^{(t-1)}, \mathbf Q^{(t-1)})$ , which we denote by $\widetilde P^{1:d}_{i}$ for brevity.

Accordingly, the optimization in the M-step can be broken into the following parts:

(16)

$$ \begin{align} (\boldsymbol{\beta}^{(d,t)},\mathbf Q^{(d,t)})&= \underset{\boldsymbol{\beta}^{(d)},\mathbf Q^{(d)}}{\text{arg max}} \,\widetilde{Q}^{(d,t)}-N\cdot P_{\boldsymbol{s}}(\boldsymbol{\beta}^{(d)}),\quad d=1,\ldots,D. \end{align} $$

(17)

$$ \begin{align} \boldsymbol{\pi}^{(D,t)}&= {\text{arg max}} \sum_{\boldsymbol \alpha^{(D)}}\sum_{i=1}^N \log(\mathbb{P}(\mathbf A^{(D)}=\boldsymbol \alpha^{(D)})) \times\nonumber\\ \sum_{(\boldsymbol \alpha^{(1)},\ldots,\boldsymbol \alpha^{(D-1)})} & P\left(\mathbf A_i^{(1)}=\boldsymbol \alpha^{(1)}, \mathbf A_i^{(2)}=\boldsymbol \alpha^{(2)},\ldots, \mathbf A_i^{(D)}=\boldsymbol \alpha^{(D)}|\mathbf R_{i};\boldsymbol{\beta}^{(t-1)},\boldsymbol{\pi}_D^{(t-1)},\mathbf Q^{(t-1)}\right). \end{align} $$

This decomposition enables the parameters at each layer to be updated via their corresponding optimization problems in (16) and (17), thereby improving the tractability of the M-step. Next, we further look into the $\widetilde {Q}^{(d,t)}$ functions. Recall that $\widetilde P^{1:d}_{i}$ is defined as

(18)

$$ \begin{align} \widetilde P^{1:d}_{i} := P(\widetilde{\mathbf A}^{1:d} \mid \mathbf R_{i}, \boldsymbol{\beta}^{(t-1)}, \boldsymbol{\pi}_D^{(t-1)}, \mathbf Q^{(t-1)}) = \frac{P(\mathbf R_{i} \mid \mathbf A^{(1)}) P(\widetilde{\mathbf A}^{1:d})}{\sum_{\widetilde{\mathbf A}^{1:d}} P(\mathbf R_{i} \mid \mathbf A^{(1)}) P(\widetilde{\mathbf A}^{1:d})}, \end{align} $$

with

(19)

$$ \begin{align} P(\widetilde{\mathbf A}^{1:d}) = \sum_{\boldsymbol \alpha^{(d+1)},\ldots,\boldsymbol \alpha^{(D)}} &P\left(\mathbf A^{(1)}, \ldots, \mathbf A^{(d)} \mid \mathbf A^{(d+1)} = \boldsymbol \alpha^{(d+1)}, \boldsymbol{\beta}^{(t-1)} \right) \times \nonumber \\ &P\left(\mathbf A^{(d+1)} = \boldsymbol \alpha^{(d+1)}, \ldots, \mathbf A^{(D)} = \boldsymbol \alpha^{(D)}; \boldsymbol{\beta}^{(t-1)} \right). \end{align} $$

As shown, given the parameters from the previous iteration $(t-1)$ , the distribution $P(\widetilde {\mathbf A}^{1:d})$ is computed by marginalizing out the deeper latent variables $\mathbf A^{(d+1)}, \dots , \mathbf A^{(D)}$ from the joint distribution over all latent variables. This marginal, together with the observed response $\mathbf R_i$ , allows us to evaluate the partial posterior distribution $\widetilde {P}^{1:d}_{i}$ via Equation (18). With $\widetilde {P}^{1:d}_{i}$ providing the weights for each possible configuration of $\widetilde {\mathbf A}^{1:d}$ , the conditional expectation in $\widetilde {Q}^{(d,t)}$ can be computed as a weighted sum over $\log P(\mathbf A^{(d-1)} \mid \mathbf A^{(d)})$ . Specifically, Equation (15) can be written out as

(20)

$$ \begin{align} \widetilde{Q}^{(1,t)} = \sum_{i=1}^N \sum_{\mathbf{A}^{(1)}} \log P(\mathbf R_i|\mathbf{A}^{(1)}) \widetilde{P}^{1:1}_{i}; \end{align} $$

and

(21)

for $d=2,\ldots ,D$ . It turns out that, for each d, Equation (16) defines a regularized optimization problem whose objective includes a weighted log-likelihood component over $(\mathbf {A}^{(d-1)}, \mathbf {A}^{(d)})$ pairs. In this formulation, $\mathbf {A}^{(d-1)}$ serves as the outcome, $\mathbf {A}^{(d)}$ as the predictor, and the data point weights are given by $\sum _{i=1}^N \sum _{\widetilde {\mathbf A}^{1:d}\backslash \left \{\mathbf {A}^{(d-1)},\mathbf {A}^{(d)}\right \}} \widetilde {P}^{1:d}_{i}$ .

We focus on the case where the link function $ f(\cdot ) $ is the inverse logit, as it is the most commonly used choice for CDMs with binary responses. In this setting, the estimation problem corresponds to a generalized linear optimization problem with a logit link. For other choices of $ f(\cdot ) $ , the problem may fall into the broader categories of linear or generalized linear optimization, depending on the specific functional form. The solution of Equation (17) is that for $\forall \boldsymbol \alpha ^{(D)}\in \{0,1\}^{K_D}$ :

(22)

$$ \begin{align} \boldsymbol{\pi}_{\boldsymbol \alpha^{(D)}}^{(D,t)}=\sum_{i=1}^N \sum_{\mathbf A^{(1)},\ldots, \mathbf A^{(D-1)}}\frac{P(\mathbf R_{i}|\mathbf A^{(1)})P(\mathbf A^{(1)},\ldots,\mathbf A^{(D-1)},\mathbf A^{(D)}=\boldsymbol \alpha^{(D)})}{N\cdot\sum_{\mathbf A^{(1)},\ldots, \mathbf A^{(D)}}P(\mathbf R_{i}|\mathbf A^{(1)})P(\mathbf A^{(1)},\ldots,\mathbf A^{(D-1)},\mathbf A^{(D)})}. \end{align} $$

These derivations demonstrate that, due to conditional independence, the EM algorithm for exploratory DeepCDMs is both succinct and transparent. However, its practical feasibility is challenged by several issues—particularly sensitivity to initialization and the accumulation of estimation bias across layers and iterations. To address these challenges, we next propose a layer-wise EM algorithm below.

3.2 A novel layer-wise EM algorithm

To elucidate the underlying rationale of the proposed algorithm, we first provide a detailed mathematical derivation of the layer-wise EM procedure. This step-by-step formulation highlights how the algorithm naturally arises from the hierarchical structure of DeepCDMs.

Suppose we have a set of parameters $ \boldsymbol {\beta }, \boldsymbol {\pi }^{(D)}, \mathbf Q $ that maximize the regularized marginal log-likelihood in Equation (10). Based on the generative formulation of DeepCDMs, we can marginalize out the deeper latent variables $ \mathbf A^{(2)}, \ldots , \mathbf A^{(D)} $ to derive the implied distribution of the bottom-layer attributes:

$$\begin{align*}\boldsymbol{\pi}^{(1)} = \mathbb{P}(\mathbf A^{(1)}; \boldsymbol{\pi}^{(1)}) = \sum_{\boldsymbol \alpha^{(2)}} \cdots \sum_{\boldsymbol \alpha^{(D)}} \Big[ \prod_{d=2}^{D} \mathbb{P}(\mathbf A^{(d-1)} \mid \mathbf A^{(d)}, \mathbf Q^{(d)}, \boldsymbol{\beta}^{(d)}) \cdot \mathbb{P}(\mathbf A^{(D)}; \boldsymbol{\pi}^{(D)}) \Big]. \end{align*}$$

This recursive marginalization induces a set of shallow-layer parameters $ (\boldsymbol {\beta }^{(1)}, \boldsymbol {\pi }^{(1)}, \mathbf Q^{(1)}) $ , which, when substituted into the original model, must also maximize a re-expressed form of the target function:

$$ \begin{align*} \ell(\boldsymbol{\beta}^{(1)}, \boldsymbol{\pi}^{(1)}, \mathbf Q \mid \mathbf R_{1:N}) = \sum_{i=1}^N \log \Bigg\{ \sum_{\boldsymbol \alpha^{(1)}} \Big[ \mathbb{P}(\mathbf R_i \mid \mathbf A^{(1)}, \mathbf Q^{(1)}, \boldsymbol{\beta}^{(1)}) \cdot \mathbb{P}(\mathbf A^{(1)}; \boldsymbol{\pi}^{(1)}) \Big] \Bigg\} - N \cdot P_{\boldsymbol{s}}(\boldsymbol{\beta}^{(1)}). \end{align*} $$

This observation forms the conceptual basis of our proposed layer-wise EM algorithm. Rather than estimating all parameters jointly over the entire deep latent architecture, we decompose the problem into a sequence of simpler subproblems, each involving a one-layer structure, and solve them in a bottom-up manner using EM. Although the resulting algorithm appears intuitive, it is grounded in a rigorous use of the model’s generative structure. Specifically, each layer-wise step leverages the most reliable information from its immediate lower layer—either in the form of estimated distributions or pseudo-observations—making the estimation process both computationally efficient and statistically reliable.

Focusing on the first layer ( $ d = 1 $ ), this decomposition implies that the estimates $ \boldsymbol {\beta }^{(1)} $ and $ \mathbf Q^{(1)} $ obtained by maximizing Equation (10) are identical to those obtained under a standard one-layer CDM. Once these first-layer parameters are estimated, the task reduces to estimating the parameters for the remaining $ D-1 $ latent layers. Unlike the first layer, however, there are no observed realizations of the latent variable $ \mathbf A^{(1)} $ . Let $ \mathbf A^{(1)}_i $ denote the $ i $ -th row of $ \mathbf A^{(1)}_{1:N} $ . A straightforward way to impute $ \mathbf A^{(1)}_i $ is via the maximum a posteriori (MAP) estimate: $ \widehat {\mathbf A}^{(1)}_i = \arg \max _{\boldsymbol \alpha ^{(1)} \in \{0,1\}^{K_1}} \mathbb {P}(\mathbf A^{(1)}_i = \boldsymbol \alpha ^{(1)} \mid \mathbf R_i, \boldsymbol {\beta }^{(1)}, \mathbf Q^{(1)}), $ which depends on both the likelihood $ \mathbb {P}(\mathbf R_i \mid \mathbf A^{(1)}_i, \boldsymbol {\beta }^{(1)}, \mathbf Q^{(1)}) $ and the prior $ \mathbb {P}(\mathbf A^{(1)}_i) $ . Alternatively, one can sample from the estimated marginal distribution $ \mathbb {P}(\mathbf A^{(1)}) $ to generate pseudo-observations for the next layer. Compared to MAP, this sampling-based approach introduces less bias and leads to more reliable parameter estimation in deeper layers, particularly during initialization. With these pseudo-observations in hand, the remaining $ D-1 $ layers form a structurally similar DeepCDM, and the same one-layer EM algorithm can be recursively applied to estimate $ \boldsymbol {\beta }^{(2)} $ and $ \mathbf Q^{(2)} $ , and so on, until the top layer is reached.

The above derivation implies that the estimator obtained from the layer-wise procedure is statistically equivalent to the estimator obtained by jointly maximizing the full DeepCDM likelihood, and thus achieves the same asymptotic efficiency. This procedure offers practical computational advantages, as discussed in Section 3.4. Although the estimation at layer d uses only the latent variables from the immediately deeper layer $(d{+}1)$ , this does not discard the dependencies encoded in all deeper layers. Indeed, that information is summarized in the aggregated prior $\boldsymbol {\pi }^{(d+1)}$ , which captures the deeper dependency structure and allows the procedure at layer d to focus on recovering the relationship between $\mathbf A^{(d)}$ and $\mathbf A^{(d+1)}$ . By iterating this process upward through the hierarchy, the estimated parameters $(\boldsymbol {\beta }^{(d)}, \mathbf {Q}^{(d)}, \boldsymbol {\pi }^{(d)})$ across all layers collectively recover the full cross-layer dependence structure.

Algorithm 1 summarizes the proposed layer-wise EM algorithm. Proceeding from the bottom up, the algorithm estimates model parameters layer by layer. For each layer $ d> 1 $ , Step 1 imputes the missing data $ \mathbf A^{(d-1)} $ by drawing samples from the estimated marginal distribution $ P(\mathbf A^{(d-1)}) $ , which is computed in Step 3 of the previous layer using Equation (23). In contrast, for the first latent layer ( $ d = 1 $ ), Step 1 is not required because the observed response data $ \mathbf R_{1:N} $ (i.e., $ \mathbf A^{(0)} $ ) serve directly as the input. Using the initial values obtained in Step 2, Step 3 then applies a one-layer EM algorithm with $ K_d $ attributes to estimate the parameters $ \boldsymbol {\beta }^{(d)} $ , $ \mathbf Q^{(d)} $ , and $ \boldsymbol {\pi }^{(d)} $ . The initialization procedure is introduced separately in Section 3.3. In Step 3, each M1-step is solved using the coordinate descent algorithm of Friedman et al. (Reference Friedman, Hastie and Tibshirani2010), known for its flexibility and effectiveness in handling regularized optimization problems. The algorithm continues this layer-wise procedure until reaching the deepest layer $ D $ .

In M2-step, the mechanism for identifying the entries $q_{jk}^{(d)}$ in $\mathbf Q^{(d)}$ varies across different measurement models, according to the measurement model utilized in layer d. For the main-effect-model, $\mathbf Q^{(d)}$ can be recovered using the rule $ q_{jk}^{(d)} = \boldsymbol {1}(\beta _{j,k}^{(d)} \neq 0) $ , where $ \boldsymbol {1} $ is the indicator function. For the all-effect model, theoretically, each row of $\mathbf Q^{(d)}$ should be identified by the highest-order non-zero coefficient. Specifically, if $ \exists S \subseteq [K] $ such that $ \beta _{j,S}^{(d)} \neq 0 $ and $ \beta _{j,S'}^{(d)} = 0 $ for all $ S' \subseteq [K], S \subset S' $ , then $ q_{jk}^{(d)} = 1 $ for $ k \in S $ ; otherwise, $ q_{jk}^{(d)} = 0 $ . However, this strict rule may not always be applicable because some estimated ${\beta }$ -coefficients may be close to zero but not exactly zero due to residual regularization noise. In practice, a more effective approach is either to choose the largest non-zero interaction coefficient or to truncate the coefficients before identifying $\mathbf Q^{(d)}$ . For the latter approach, we recommend practitioners set the truncation thresholds based on the general magnitude of their estimated coefficients. For the DINA model, since there should be only one non-zero coefficient for each item $ j $ , the largest non-zero interaction coefficient can be selected, and the corresponding $ q_{jk}^{(d)} $ can be identified as equal to one. We note that seemingly poor $\mathbf Q$ -matrix recovery may appear if such residual regularization noise is not properly addressed, as it can obscure the algorithm’s underlying recovery performance.

3.3 Initialization algorithm

As shown in Algorithm 1, the initialization of our DeepCDM method is performed sequentially in D steps, where parameters from the d-th latent layer are initialized after estimating the $(d-1)$ -th layer. This approach leverages the information from the estimated distribution $P(\mathbf {A}^{(d-1)})$ , obtained during the fitting of the $(d-1)$ -th layer, to provide better initial values for the d-th layer. Furthermore, by imputing realizations of $\mathbf {A}^{(d-1)}$ sampled from the estimated $P(\mathbf {A}^{(d-1)})$ , the initializations of all layers reduce to the problem of initializing a one-layer CDM. Specifically, the response data $\mathbf {R}_{1:N}$ are used for the first layer ( $d=1$ ), while realizations of $\mathbf {A}^{(d)}$ (i.e., the generated pseudo-samples) are used for subsequent layers ( $d>1$ ). For exploratory DeepCDMs, good initial values should not only be close to the true values but also exhibit a sparse structure similar to the true ones. To achieve this, we apply a USVT-based method (Chatterjee, Reference Chatterjee2015; Zhang et al., Reference Zhang, Chen and Li2020) to estimate the design matrix, followed by a Varimax rotation to promote sparsity. USVT captures the dominant low-rank structure, while Varimax produces a sparse loading pattern that informs the initial $\mathbf {Q}$ -matrix. This combination provides informative starting values tailored to the exploratory framework of DeepCDMs. The initialization procedure for each layer d is detailed in Algorithm 2.

Algorithm 2 applies SVD twice. The first application, combined with the inverse transformation (Steps 2–5), is used to denoise and linearize the data. The second application of SVD (Steps 6 and 7) is performed on the linearized data. Since SVD solutions are rotation-invariant while the true coefficient matrix $\boldsymbol {\beta }_1^{(d)}$ is sparse, we address this invariance by seeking a sparse representation of the denoised data that is consistent with the model’s identifiability conditions and expect it to share the same sparsity pattern as $\boldsymbol {\beta }_1^{(d)}$ . We apply the Varimax rotation for this purpose in Step 8, as it induces sparse loadings and possesses well-established statistical inference properties (Rohe & Zeng, Reference Rohe and Zeng2023). To further refine sparsity, a thresholding step is applied to eliminate near-zero loadings resulting from sampling variability (Step 9). Rather than using the conventional threshold $1/\sqrt {N}$ , we adopt a more relaxed threshold $1/(2\sqrt {K_{d-1}})$ , which is more adaptive to the multilayer structure of DeepCDMs—since pseudo-data in deeper layers tend to be noisier, a more relaxed threshold helps mitigate such effects. If the rotated Varimax loadings were used directly to estimate $\mathbf A^{(d)}$ and $\boldsymbol {\beta }_1^{(d)}$ , the estimates could differ from the true ones in scale, as Varimax does not constrain the scaling of rows or columns. For $\mathbf A^{(d)}$ , although this scaling issue remains, the binary structure can still be identified by noting that $A(i,k)=0$ if and only if $A_0(i,k)<0$ in Step 10. We then exploit the discreteness of the latent variables in $\mathbf A^{(d)}$ to rescale $\boldsymbol {\beta }_1^{(d)}$ in Step 11, yielding correctly scaled estimates. This initialization procedure is non-iterative, making it computationally efficient. Moreover, it avoids convergence issues and possesses favorable statistical consistency properties (Zhang et al., Reference Zhang, Chen and Li2020).

Additionally, as pointed out by one reviewer, one can also directly apply the Varimax rotation to the matrix $ \frac {1}{\sqrt {N}} (\widehat {\tau }_1 \widehat {v}_1, \ldots , \widehat {\tau }_{K_{d-1}} \widehat {v}_{K_{d-1}}). $ A simulation study comparing the performance of these two approaches is reported in the Supplementary Material.

3.4 Advantages of layer-wise EM compared to EM

As discussed in Hinton et al. (Reference Hinton, Osindero and Teh2006), an effective way to learn complex models is by sequentially combining simpler models. The layer-wise EM algorithm applies this principle by decomposing the optimization into manageable subproblems, each targeting one layer at a time. This strategy helps overcome key challenges faced by the classical EM algorithm when applied to DeepCDMs.

One major challenge in applying the classical EM algorithm to DeepCDMs is the difficulty of initialization. The presence of multiple nonlinear latent layers creates a highly nonconvex optimization landscape, often with an exponential number of local optima. As a result, EM is sensitive to starting values—poor initialization can easily lead to convergence at suboptimal local maxima of the penalized log-likelihood function. In the classical EM algorithm, initial values for all parameters across all layers must be specified simultaneously. As the model depth increases, this becomes increasingly challenging, even for moderately deep architectures, due to compounded uncertainty and parameter interactions across layers. In contrast, our layer-wise EM algorithm addresses initialization sequentially by solving one-layer CDMs one at a time. At each stage, it initializes the parameters of the current layer using either observed responses or sampled latent attributes from the estimated marginal distribution informed by the shallower layers. Since these samples are based on estimated proportion parameters which are proved to be identifiable, the initialization for deeper layers is more stable and reliable.

Another major issue of the classical EM algorithm is the accumulation of estimation bias as the algorithm progresses through multiple layers. As the number of latent layers $ D $ increases, the computation of quantities like $ \sum _{i=1}^N \widetilde {P}^{1:d}_{i} $ for $ d = 1, \ldots , D $ becomes more error-prone. These quantities play a crucial role in the M-step, and inaccuracies in their estimation directly affect parameter updates. Furthermore, the iterative nature of the classical EM algorithm introduces a cyclic dependency across all layers, where errors at one layer can propagate forward and backward, reinforcing one another over iterations. This compounding effect often leads the algorithm to converge to suboptimal local maxima, even when reasonably good initial values are provided. In contrast, our proposed layer-wise EM algorithm mitigates this issue by breaking the dependency cycle. Parameters are estimated sequentially from the bottom layer up, so each layer $ d $ is only influenced by the estimates from shallower layers $ d' < d $ . Although some bias may still accumulate, it originates solely from estimation errors in previous layers—not from compounded initialization errors across all layers. This directional, non-cyclic structure significantly reduces the accumulation of error and enhances the robustness of the estimation process.

3.5 Connections to broader principles and algorithms

3.5.1 Connections between the layer-wise EM and identifiability

Interestingly, the derivation of the layer-wise EM algorithm aligns with and supports the technical insights in the identifiability proofs of DeepCDMs. The identifiability results in the Supplementary Material and in Gu (Reference Gu2024) demonstrate that the identifiability of a DeepCDM can be examined and established in a layer-by-layer manner, proceeding from the bottom up. This follows from the probabilistic formulation of the directed graphical model and the discrete nature of the latent layers. For each layer $ d \in \left \{0,1,\ldots , D-1\right \} $ , as long as $ \mathbf Q^{(d+1)} $ satisfies the corresponding identifiability conditions for the one-layer CDM implied by

(24)

$$ \begin{align} \mathbb{P}(\mathbf A^{(d)}) = \sum_{\boldsymbol \alpha^{(d+1)} \in \{0,1\}^{K_{d+1}}} \mathbb{P}(\mathbf A^{(d)} \mid \mathbf A^{(d+1)} = \boldsymbol \alpha^{(d+1)}, \mathbf Q^{(d+1)}, \boldsymbol{\theta}^{(d+1)}) \cdot \mathbb{P}(\mathbf A^{(d+1)} = \boldsymbol \alpha^{(d+1)}), \end{align} $$

the parameter set $ (\boldsymbol {\beta }^{(d+1)}, \mathbf Q^{(d+1)}, \boldsymbol {\pi }^{(d+1)}) $ is identifiable. The identifiability of the marginal distribution of $ \mathbf A^{(d+1)} $ (i.e., $ \boldsymbol {\pi }^{(d+1)} $ ) allows it to be treated theoretically as if it were observed when examining the identifiability for $ \mathbf Q^{(d+2)} $ , $ \boldsymbol {\beta }^{(d+2)} $ , and the marginal distribution of $ \mathbf A^{(d+2)} $ . Starting from the observed data layer ( $ d = 0 $ ) and proceeding one layer at a time, the identifiability of DeepCDMs can thus be inductively established.

This layer-wise identifiability rationale supports the procedure of generating samples of $ \mathbf A^{(d+1)} $ when estimating parameters of the $ d $ -th layer, for each d. Specifically, the identifiability of the model at layer $ d $ , as shown in Equation (24), implies that the distribution $ \mathbb {P}(\mathbf A^{(d+1)}) $ , from which samples of $ \mathbf A^{(d+1)} $ are drawn, can be uniquely identified. This theoretical guarantee justifies treating the sampled $ \mathbf A^{(d+1)} $ as observed data in the EM algorithm. The same procedure is applied recursively to the remaining $ D-1 $ layers.

3.5.2 Connections to related algorithms for DGMs

The idea of training deep models in a layer-wise fashion has been widely explored across the deep generative modeling literature, and our layer-wise EM algorithm draws on this foundational principle. For instance, in DBNs, greedy layer-wise pretraining trains each layer locally as an RBM using contrastive divergence (Hinton et al., Reference Hinton, Osindero and Teh2006). This approach improves optimization stability and scalability, particularly in deep models where global joint training is challenging. While DeepCDMs differ from DBNs in having a fully directed architecture and an emphasis on interpretability and identifiability, our method similarly decomposes model training into a sequence of tractable subproblems. DEFs (Ranganath et al., Reference Ranganath, Tang, Charlin and Blei2015) also adopt a hierarchical top-down structure and rely on recursive variational inference across layers. Although our inference uses EM rather than variational methods, both approaches share the advantage of progressing layer by layer, with each layer conditioned on information derived from the previous one. Our idea also resonates with the framework of modular Bayesian learning (Joshi et al., Reference Joshi, De Smet, Marchal, Van de Peer and Michoel2009; Segal et al., Reference Segal, Pe’er, Regev, Koller, Friedman and Jaakkola2005), in which large models are decomposed into smaller, interpretable modules that can be estimated sequentially. By aligning with these broader principles, the proposed layer-wise EM adapts a widely used idea—local, modular, progressive learning—for the structured and interpretable setting of cognitive diagnosis, where both identifiability and individualized feedback are essential.

3.6 Extending to confirmatory DeepCDMs

The layer-wise EM algorithm can be easily modified to fit confirmatory DeepCDMs, which assume all $\mathbf {Q}$ -matrices are known. In this case, each M-step solves the following optimization:

(25)

$$ \begin{align} \boldsymbol{\beta}^{(d,t)} &= \underset{\boldsymbol{\beta}^{(d)}}{\text{arg max}} \,\widetilde{Q}^{(d,t)}, \quad d=1,\ldots,D. \end{align} $$

Compared to Equation (16), the terms $N \cdot P_{\boldsymbol {s}}(\boldsymbol {\beta }^{(d)})$ are dropped, as all the $\mathbf {Q}$ -matrices are known and do not need to be estimated. This makes the confirmatory case simpler than the exploratory case we have focused on. The layer-wise algorithm in Algorithm 1 can be readily used for parameter estimation, incorporating the known $\mathbf {Q}$ -matrices during the implementation of coordinate descent in the M-step.

The adaptation of the layer-wise EM algorithm for the confirmatory case is also a novel contribution, representing the first EM-type algorithm for confirmatory DeepCDMs. Unlike the Bayesian approach in Gu (Reference Gu2024), which fits models using an MCMC algorithm that can involve slower convergence and the need for prior specification, the layer-wise EM algorithm offers a computationally more efficient alternative by directly seeking the maximum likelihood estimator. We note that the layer-wise EM method could also be extended to incorporate prior distributions for computing maximum posterior estimates, should rich prior information be available, following the approach outlined in Gu (Reference Gu2024).

4 Simulation studies

In this section, we conduct simulation studies to evaluate the performance of the proposed layer-wise EM algorithm for exploratory DeepCDMs. Specifically, we consider a three-layer DeepCDM ( $D=3$ ) with the configuration $(J, K_1, K_2, K_3) = (30, 8, 4, 2)$ , which represents a challenging scenario due to the model depth and the need to estimate all three unknown $\mathbf {Q}$ -matrices across different layers. Three different measurement models are considered for the shallowest layer ( $d=1$ ): the main-effect model, the all-effect model, and the DINA model. The more parsimonious main-effect model is used to model the two deeper layers ( $d=2,3$ ). We denote these three simulation cases as the Main-effect case, All-effect case, and DINA case, respectively. Under each case, three sample sizes— $N = 1,000$ , $1,500$ , and $2,000$ —are considered. The true $\mathbf {Q}$ -matrices are specified in Equation (26) and satisfy the minimal identifiability conditions required for DeepCDMs:

(26)

The USVT-based estimator described in Algorithm 2 is used for initialization. The coordinate descent algorithm is implemented using the R package glmnet (Hastie et al., Reference Hastie, Qian and Tay2021) for optimization. The true coefficient values of $\boldsymbol {\beta }^{(1)}$ , $\boldsymbol {\beta }^{(2)}$ , and $\boldsymbol {\beta }^{(3)}$ are presented in the following sections for each simulation case, where $\boldsymbol {\beta }^{(1)}$ is set to be larger than those of the two deeper layers, $\boldsymbol {\beta }^{(2)}$ and $\boldsymbol {\beta }^{(3)}$ . In CDM, larger coefficient values indicate a stronger relationship between latent attributes and responses. Therefore, by specifying smaller coefficients for the deeper layers, we introduce more randomness into this relationship. This design reflects the increased uncertainty and complexity involved in mastering higher-level cognitive attributes in psychological or educational assessments. In this simulation, the convergence thresholds are set as $(\epsilon _1, \epsilon _2, \epsilon _3) = (4^{-2}, 10^{-3}, 10^{-3})$ . A more relaxed threshold is adopted for the first layer to accommodate its relatively larger number of parameters, as this layer involves both a larger $\mathbf {Q}$ -matrix ( $\mathbf {Q}^{(1)}_{30 \times 8}$ ) and larger coefficient magnitudes in $\boldsymbol {\beta }^{(1)}$ compared to the deeper layers. The specification of $\epsilon _1$ follows the same order of magnitude as the threshold of 0.06 used in sparse exploratory MIRT models in Li et al. (Reference Li, Gibbons and Ková2025). The specification of these thresholds is guided by the model scale and the numerical characteristics observed at each layer during estimation, with the aim of balancing estimation precision and computational efficiency. In practice, practitioners may adjust these thresholds according to the scale of $\mathbf {Q}$ and the magnitude of estimated parameters, or employ smaller values for stricter convergence when computational resources permit. For each simulation case, 100 independent replications are conducted. In each replication, the layer-wise EM algorithm is applied to fit the model across a range of regularization parameters, and the one that yields the smallest BIC value is selected for final model fitting. The specific regularization sequences are provided in the Supplementary Material. Root mean-squared errors (RMSE) and absolute biases (aBias) are computed to assess estimation accuracy.

Note that the true parameter values differ in magnitude across the three layers, making the RMSE and aBias values not directly comparable. To address this, we report RMSE and aBias for two additional metrics that evaluate the performance of the layer-wise EM algorithm from different perspectives and provide indices that are comparable across layers. The first index is the latent class proportion distribution $\boldsymbol {\pi }^{(d)}$ , whose parameter space is defined as $ \Delta ^{2^{K_d}-1} = \left \{ \pi ^{(d)}_{\boldsymbol \alpha _{\ell }} : \sum _{\ell =1}^{2^{K_d}} \pi ^{(d)}_{\boldsymbol \alpha _{\ell }} = 1,\, \pi ^{(d)}_{\boldsymbol \alpha _{\ell }}> 0 \right \}, $ where $\pi ^{(d)}_{\boldsymbol \alpha _{\ell }} = P(\mathbf A^{(d)} = \boldsymbol \alpha _{\ell })$ and $\boldsymbol \alpha _{\ell } \in \{0,1\}^{K_d}$ for each layer $d = 1, \ldots , D$ . This metric assesses how accurately the model recovers the true distribution of latent attributes at each layer. The second index is the correct response probability for each layer d, denoted as $P_{CR}^{(d)}$ , and defined by Equations (7)–(9) according to the measurement model employed. This metric evaluates how well the model predicts correct responses at the population level for each layer. In the case of a single latent layer, this probability is commonly represented in the literature as $\theta _{j,\boldsymbol {\alpha }} = P(R_j = 1 \mid \mathbf A = \boldsymbol {\alpha })$ .

4.1 Simulation studies for main-effect DeepCDMs

Throughout this section, let $k_{d-1}$ and $k_d$ be the indices for the $(d{-}1)$ -th and d-th layers, respectively. For the main-effect models, the coefficients $\beta ^{(d)}_{k_{d-1},k_{d}}$ are specified as

(27)

$$ \begin{align}\beta^{(d)}_{k_{d-1},0} = c^{(d)}_0, \quad \beta^{(d)}_{k_{d-1},k_{d}} = \frac{c^{(d)}_1}{\sum_{k_{d}=1}^{K_d} q_{k_{d-1},k_{d}}^{(d)}}, \quad \forall\, k_{d} \in [K_{d}],\ d=1,2,3, \end{align} $$

where $K_0 = J$ , and the constants $(c^{(d)}_0, c^{(d)}_1)$ are set to $(6, -3)$ , $(3, -1.5)$ , and $(3, -1.5)$ for $d=1, 2, 3$ , respectively. The RMSE and aBias are reported in Table 1. All indices decrease as the sample size increases, indicating improved estimation accuracy. For each sample size, the estimation accuracy is lower in the deeper layers than in the shallower layers, as reflected by the values of $\boldsymbol {\pi }^{(d)}$ and $P_{CR}^{(d)}$ . This result is intuitively reasonable, as estimating deeper layers is fundamentally more challenging due to stochastic latent layers. In addition, all RMSE and aBias values remain reasonably small, providing empirical support for the identifiability of the model.

Table 1 RMSE and aBias for the main-effect DeepCDM

To examine the recovery of the $\mathbf {Q}$ -matrices, we report the proportion of correctly estimated rows and entries in each $\mathbf {Q}^{(d)}$ for $d = 1, 2, 3$ , as shown in Table 2. Note that these indices are not directly comparable across layers at each sample size, as the proportions depend on both the size of the $\mathbf {Q}$ -matrix and the difficulty of parameter estimation. Due to the shrinkage ladder structure, which is supported by the identifiability conditions, the shallower layers contain larger $\mathbf {Q}$ -matrices, making their structures more challenging to recover. However, these layers also benefit from more informative signals, as they are closer to the observed data layer. In contrast, the deeper layers involve smaller $\mathbf {Q}$ -matrices that are structurally easier to recover, but they suffer from less informative signals due to the greater amount of uncertainty introduced at deeper levels. Despite these differences, when comparing $\mathbf {Q}$ -matrix recovery across sample sizes, it is evident that for each layer d, the estimation accuracy of $\mathbf {Q}^{(d)}$ improves as the sample size increases.

Table 2 Proportion of correctly recovered rows ( $P_{\text {Row-wise}}$ ) and entries ( $P_{\text {Entry-wise}}$ ) for the main-effect DeepCDM

4.2 Simulation studies for all-effect DeepCDMs

Denote the true coefficients as $\beta ^{(d)}_{k_{d-1},S_d}$ , $\forall S_d \subseteq [K_d] \backslash \emptyset $ and $\beta ^{(d)}_{k_{d-1},0} = c^{(d)}_0$ . In the all-effect DeepCDMs case, denote $\mathcal {K}_{k_{d-1}} = \left \{k_{d} \in [K_{d}];\, q_{k_{d-1},k_{d}}^{(d)} = 1\right \}$ , and $K_0 = J$ , and the first layer ( $d=1$ ) is modeled using the all-effect model with the parameters given below:

$$\begin{align*}\beta^{(d)}_{k_{d-1},S_d}=\left\{ \begin{array}{rcl} \frac{c_1^{(d)}}{2^{|\mathcal{K}_{k_{d-1}}|}} & & \prod_{l \in S_d} q_{k_{d-1},l}^{(d)} = 1 \\ 0 & & \prod_{l \in S_d} q_{k_{d-1},l}^{(d)} = 0. \\ \end{array} \right. \end{align*}$$

The two deeper layers ( $d=2,3$ ) are then modeled by the main-effect model, with parameters specified according to Equation (27). The constants $(c^{(d)}_0, c^{(d)}_1)$ are specified as $(c^{(1)}_0, c^{(1)}_1) = (6, -3)$ , $(c^{(2)}_0, c^{(2)}_1) = (3, -1.5)$ , and $(c^{(3)}_0, c^{(3)}_1) = (3, -1.5)$ . The $\mathbf Q^{(1)}$ -matrix is recovered by identifying, for each item, the highest-order nonzero interaction coefficient.

The RMSE and aBias values are summarized in Table 3. As expected, all indices decrease with increasing sample sizes, indicating improved estimation accuracy. RMSE and aBias values remain reasonably small across all layers, providing empirical support for the identifiability of the model. To evaluate the recovery of the $\mathbf {Q}$ -matrices, we report the proportions of correctly estimated rows and entries in each $\mathbf {Q}^{(d)}$ for $d = 1, 2, 3$ , as shown in Table 4. As in previous cases, these proportions are not directly comparable across layers, as they are influenced by differences in $\mathbf {Q}$ -matrix size and estimation difficulty. Nevertheless, for each layer d, the estimation accuracy of $\mathbf {Q}^{(d)}$ improves as the sample size increases.

Table 3 RMSE and aBias for the all-effect DeepCDM

Table 4 Proportion of correctly recovered rows ( $P_{\text {Row-wise}}$ ) and entries ( $P_{\text {Entry-wise}}$ ) for the all-effect DeepCDM

4.3 Simulation studies for DINA-effect DeepCDMs

Denote the true coefficients as $\beta ^{(d)}_{k_{d-1},S_d}$ for all $S_d \subseteq [K_d] \setminus \emptyset $ , with $\beta ^{(d)}_{k_{d-1},0} = c^{(d)}_0$ . In DINA DeepCDMs, the first layer ( $d=1$ ) is modeled using the DINA formulation where $\beta _{k_{d-1},S_d} = c_1^{(d)}$ if $S_d = \mathcal {K}_{k_{d-1}}$ , and zero otherwise, with $\mathcal {K}_{k_{d-1}}$ defined in Section 4.2. The DINA model can be viewed as a special case of the all-effect model, in which non-zero coefficients are assigned only to the interaction of all attributes required by the $(d{-}1)$ -th unit in each d-th layer model. This formulation allows the same estimation framework used for the all-effect model to be applied to the DINA model. The two deeper layers ( $d = 2, 3$ ) are modeled using the main-effect specification, with parameters defined as in Equation (27). The constants $(c^{(d)}_0, c^{(d)}_1)$ are specified as $(6, -3)$ , $(3, -1.5)$ , and $(3, -1.5)$ for $d = 1, 2, 3$ , respectively.

The RMSE and aBias results are reported in Table 5. Again, these values exhibit a clear decreasing trend with increasing sample size, indicating improved estimation accuracy. To assess the recovery of the $\mathbf {Q}$ -matrices, Table 6 presents the proportions of correctly estimated rows and entries for each $\mathbf {Q}^{(d)}$ , $d = 1, 2, 3$ . Overall, for all layers, the estimation accuracy of $\mathbf {Q}^{(d)}$ improves as the sample size increases.

Table 5 RMSE and aBias for the DINA DeepCDM

Table 6 Proportion of correctly recovered rows ( $P_{\text {Row-wise}}$ ) and entries ( $P_{\text {Entry-wise}}$ ) for the DINA DeepCDM

5 Real data analysis

To demonstrate the applicability of the exploratory DeepCDM, we analyze student response data from the TIMSS 2019 assessment using a two-layer DeepCDM. Specifically, we examine responses from $N = 1,595$ eighth-grade students in the United Arab Emirates who completed Booklet No.1, which includes both mathematics and science items. The dataset comprises responses to $J = 54$ items. Responses were preprocessed into binary indicators of correctness: multiple-choice responses were coded as 1 if correct and 0 otherwise; constructed responses were coded as 1 only if they received the maximum score, and 0 otherwise. According to the TIMSS 2019 Item Information - Grade 8, items are classified into two primary domains: mathematics (items 1–28) and science (items 29–54). Each domain further includes four subdomains: mathematics encompasses Number, Algebra, Geometry, and Data & Probability; science includes Biology, Chemistry, Physics, and Earth Science. This hierarchical structure aligns naturally with the two-layer CDM, where the first layer consists of $K_1=8$ subdomain attributes, and the second layer consists of $K_2=2$ main domain attributes. The item-subdomain-domain assignment structure specified in the TIMSS documentation naturally gives rise to a set of provisional $ \mathbf {Q} $ matrices, which are presented in the Supplementary Material. They indeed satisfy the strict identifiability conditions for general DeepCDMs.

To better understand the internal structure of this assessment, we applied a two-layer exploratory DeepCDM to fit the data, using convergence thresholds $(\epsilon _1, \epsilon _2) = (4^{-2}, 10^{-3})$ . Given that our primary interest lies in uncovering the hierarchical structure of attributes rather than modeling complex attribute interactions, we selected the main-effect model as our measurement framework. Aligning the number of attributes with $(K_1,K_2)=(8,2)$ to those specified in the provisional $\mathbf {Q}$ -matrices serves two purposes. First, it allows the exploratory approach to empirically verify the provisional attribute–item mappings, providing evidence for the validity of the test’s original design. Second, the exploratory model maintains sufficient flexibility to identify alternative attribute–item structures, potentially revealing subtle item–attribute relationships not fully anticipated during initial test construction. This dual functionality offers valuable insights for both test validation and future item development.

The attributes in the first layer are numerically labeled from 1 to 8, whereas those in the second layer are labeled as A and B. After estimating the coefficient matrices, we reordered the first-layer attributes to better visualize the underlying block structures. Figure 2 presents heatmaps of the estimated parameters for both layers, from which a distinct structure emerges, aligning closely with the intended test design. Specifically, Figure 2 (left) reveals two clearly defined blocks of non-zero coefficients: the first block associates items 1–28 with Attributes 1, 4, 5, and 7, and the second block links items 29–54 to Attributes 2, 3, 6, and 8. This block structure mirrors TIMSS’s explicit distinction between mathematics and science items. Based on this clear division, we infer that Attributes 1, 4, 5, and 7 represent subdomains belonging to a common domain, while Attributes 2, 3, 6, and 8 form subdomains within another domain. This hierarchical interpretation is further supported by the second-layer heatmap shown in Figure 2 (right), which exhibits sparsity: Attributes 1, 4, 5, and 7 exclusively load onto Meta-Attribute B; Attributes 3, 6, and 8 exclusively load onto Meta-Attribute A; and Attribute 2 uniquely cross-loads onto both Meta-Attributes A and B. This inferred hierarchical structure closely matches the provisional second-layer matrix $\mathbf {Q}^{(2)}$ , except for the cross-loading behavior of Attribute 2. Given the item content and structure of $\mathbf {Q}^{(2)}$ , we conclude that Meta-Attribute A corresponds to the science domain, and Meta-Attribute B to the mathematics domain. The cross-loading of Attribute 2 suggests that this subdomain, despite being associated with science, may also involve mathematical competence during the reasoning process.

Figure 2 Heatmaps of estimated coefficients from exploratory DeepCDM: First layer (left) and second layer (right).

Although the exploratory results align closely with the provisional test design, the estimated first-layer attribute structure exhibits some deviations from the provisional $\mathbf {Q}^{(1)}$ -matrix. Before examining these deviations in detail, we first evaluate the plausibility of our exploratory findings by assessing model fit. Specifically, we compare the exploratory DeepCDM against a confirmatory DeepCDM that directly employs the provisional $\mathbf {Q}$ -matrices. These two approaches represent fully data-driven and strictly design-driven modeling, respectively. Consistent with the procedure outlined in our simulation study, model fit is quantified using the BIC. The exploratory DeepCDM achieves a BIC of 89,656, markedly lower than the confirmatory model’s BIC of 94,946. This improvement suggests that exploratory modeling can be beneficial in uncovering item–attribute relationships not fully captured by the original test design.

Next, we investigate the potential item–attribute relationships. Although the secure item content is not publicly available, TIMSS provides detailed metadata for each item, including descriptive labels and topic areas. The item labels offer concise summaries of each item’s content focus, while the topic areas reflect broader curricular domains defined by the TIMSS content framework. These metadata serve as valuable proxies, enabling us to infer the cognitive processes and skills required to answer each item. The complete metadata are provided in the Supplementary Material.

To facilitate interpretation, we focus on the five highest-loading items for each attribute, balancing between representativeness and clarity. Each item is assigned to only one attribute group: specifically, the one for which it has the highest loading value among all attributes, ensuring that item groupings are mutually exclusive and reflect their most salient associations. For each attribute group, we carefully review the content of its assigned items, identify shared cognitive processes, and distill the latent ability the attribute is likely to capture. The distilled attribute names, along with their representative item groups and corresponding cognitive processes, are presented in Table 7.

Table 7 Summary of extracted attributes, representative items, and cognitive processes

To clarify and illustrate our interpretive process, we present two representative examples: Attribute 1 from the mathematics domain and Attribute 2 from the science domain. The detailed interpretation for all eight attributes is provided in the Supplementary Material. Attribute 1 is primarily associated with Items 5, 18, 19, 20, and 21. Based on TIMSS metadata, these items appear to involve tasks, such as expressing the area of a rectangle algebraically, evaluating expressions by substituting values, identifying equivalent algebraic expressions, deriving a formula for stopping distance, and solving for an unknown variable given the perimeter of a triangle. Although these items vary in surface content, they seem to share a common cognitive emphasis on algebraic manipulation and symbolic reasoning. This pattern suggests procedural fluency in algebra, which includes mastering algebraic structures, applying operations accurately, and recognizing equivalent mathematical forms. Accordingly, we interpret Attribute 1 as Algebraic Fluency, reflecting the ability to manipulate algebraic expressions and apply fundamental algebraic procedures.

Attribute 2 is primarily associated with Items 48, 49, 50, 51, and 53. According to TIMSS metadata, these items are likely to involve tasks, such as explaining the behavior of gas molecules in an expanding balloon, evaluating appropriate conditions in a heat conduction experiment, reasoning about the effects of planetary gravity on vehicle weight, predicting the behavior of sound in a vacuum, and interpreting evidence related to global warming. While these items span different scientific topics, they appear to share a cognitive focus on reasoning through empirical or hypothetical scenarios, interpreting observations, and evaluating experimental setups. Based on this pattern, we interpret Attribute 2 as Scientific Reasoning in Physical Contexts, reflecting systematic reasoning about physical phenomena, empirical data, and conditions relevant to scientific inquiry. As scientific reasoning often draws on mathematical competence, this also supports the observed cross-loading of Attribute 2 onto both science and mathematics domains.

It is important to emphasize that the attribute structure identified through our analysis does not represent the only possible or ideal solution, as the interpretation relies on available metadata rather than direct access to detailed item content. Nevertheless, this empirical analysis illustrates how test data can be examined using an exploratory approach and demonstrates how the resulting attribute structure can be interpreted using accessible metadata. This reflects a common practical scenario, where exploratory results may not fully align with the original test design and detailed item content may be unavailable. By analyzing the derived results through metadata or with expert input, practitioners may uncover findings that offer new perspectives or supplementary insights into the test design.

6 Discussion

This article builds a conceptual and methodological bridge between deep generative modeling and cognitive diagnosis. By significantly generalizing the DeepCDMs proposed by Gu (Reference Gu2024) to the challenging exploratory settings, we introduce a new class of models—exploratory DeepCDMs—that retain the expressive capacity of DGMs while incorporating the structural constraints, interpretability, and identifiability essential for diagnostic assessment. To enable estimation in this more complex, multi-layer setting with multiple unknown $\mathbf {Q}$ -matrices, we proposed a novel layer-wise EM algorithm for regularized maximum likelihood estimation. This algorithm advances the literature by offering a principled, modular framework for learning complex hierarchical latent structures in diagnostic models. Both the algorithm derivation and the identifiability theory of DeepCDMs support a bottom-up, layer-by-layer estimation strategy, making the procedure not only efficient but also theoretically grounded.

A promising direction for future work is to relax the assumption of known latent dimensions across layers. One approach is to incorporate layer-wise dimension selection into the USVT-based initialization, using the largest spectral ratio of singular values, as proposed by Lee & Gu (Reference Lee and Gu2025). This procedure can be applied recursively, using estimated or sampled latent attributes from one layer as input to the next, enabling automatic, data-driven dimension selection. Alternative methods, such as the extended BIC (EBIC; Chen & Chen, Reference Chen and Chen2008) and the method of sieves (Shen & Wong, Reference Shen and Wong1994), may also support layer-wise model selection. Another important direction is to determine the number of layers in practice. This may be explored using classical model selection criteria such as BIC. In addition, theoretical guidance may be obtained from the latent structure implied by the model formulation and identifiability conditions. In particular, if the deepest layer is assumed to consist of independent latent variables, one may apply the layer-wise EM procedure upward from the bottom layer and assess whether the estimated latent variables at layer d are approximately independent, provided that layer d is identifiable. Once a layer is reached at which the estimated latent variables exhibit independence, this layer can be regarded as the deepest layer of the model. This provides a principled, data-driven stopping criterion for determining the appropriate number of layers.

It would also be valuable to extend the model to accommodate polytomous responses and attributes (Chen & de la Torre, Reference Chen and de la Torre2013; Gao et al., Reference Gao, Ma, Wang, Cai and Tu2021). A similar layer-wise EM algorithm could be developed by applying a one-layer EM procedure for polytomous data at each layer, with corresponding identifiability conditions established for the between-layer $\mathbf {Q}$ -matrices. Furthermore, continuous latent variables may be introduced at deeper layers to represent broader cognitive abilities. Implementing such an extension would require modifying the current DeepCDM framework to address the computation and identifiability of continuous latent variables, along with the incorporation of continuous inputs for deeper layers. This can be achieved by integrating the methodology of Liu et al. (Reference Liu, Lee and Gu2025), which proposes a general diagnostic model with higher-order continuous latent structures, along with corresponding identifiability results and an EM-type estimation algorithm. Another useful extension is to develop a stochastic version of the layer-wise EM algorithm. Such variants may offer computational advantages for large-scale, high-dimensional data and serve as a flexible alternative when full E-step computations are costly. More broadly, this work is motivated by the goal of integrating DGMs—and machine learning methods more generally—into cognitive diagnostic modeling. The simulation and empirical results suggest that this integration is a fruitful direction. Moving forward, exploring identifiability in existing DGMs and adapting their algorithms to promote sparsity could benefit not only psychometrics but also other domains where interpretability is crucial.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/psy.2025.10065.

Funding statement

This research is partially supported by the NSF grant DMS-2210796.

Competing interests

The authors declare none.

References

Bradshaw, L., & Templin, J. (2014). Combining item response theory and diagnostic classification models: A psychometric model for scaling ability and diagnosing misconceptions. Psychometrika, 79(3), 403–425.10.1007/s11336-013-9350-4CrossRef Google Scholar PubMed

Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1), 177–214.10.1214/14-AOS1272CrossRef Google Scholar

Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771.10.1093/biomet/asn034CrossRef Google Scholar

Chen, J., & de la Torre, J. (2013). A general cognitive diagnosis model for expert-defined polytomous attributes. Applied Psychological Measurement, 37(6), 419–437.10.1177/0146621613479818CrossRef Google Scholar

Chen, Y., Liu, J., Xu, G., & Ying, Z. (2015). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110(510), 850–866.CrossRef Google Scholar

de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179–199.10.1007/s11336-011-9207-7CrossRef Google Scholar

de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333–353.10.1007/BF02295640CrossRef Google Scholar

de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33(8), 620–639.10.1177/0146621608326423CrossRef Google Scholar

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–38.10.1111/j.2517-6161.1977.tb01600.xCrossRef Google Scholar

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.10.18637/jss.v033.i01CrossRef Google Scholar PubMed

Gao, X., Ma, W., Wang, D., Cai, Y., & Tu, D. (2021). A class of cognitive diagnosis models for polytomous data. Journal of Educational and Behavioral Statistics, 46(3), 297–322.10.3102/1076998620951986CrossRef Google Scholar

Gierl, M. J., Leighton, J. P., & Hunka, S. M. (2007). Using the attribute hierarchy method to make diagnostic inferences about respondents’ cognitive skills. In Cambridge, U. K. (Ed.), Cognitive diagnostic assessment for education: Theory and applications (pp. 242–274). Cambridge University Press.10.1017/CBO9780511611186.009CrossRef Google Scholar

Gu, Y. (2024). Going deep in diagnostic modeling: Deep cognitive diagnostic models (DeepCDMS). Psychometrika, 89(1), 118–150.10.1007/s11336-023-09941-6CrossRef Google Scholar PubMed

Gu, Y., & Xu, G. (2020). Partial identifiability of restricted latent class models. Annals of Statistics, 48(4), 2082–2107.10.1214/19-AOS1878CrossRef Google Scholar

Hastie, T., Qian, J., & Tay, K. (2021). An introduction to glmnet. CRAN R Repositary, 5, 1–35.Google Scholar

Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210.10.1007/s11336-008-9089-5CrossRef Google Scholar

Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.10.1162/neco.2006.18.7.1527CrossRef Google Scholar PubMed

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.10.1126/science.1127647CrossRef Google Scholar PubMed

Joshi, A., De Smet, R., Marchal, K., Van de Peer, Y., & Michoel, T. (2009). Module networks revisited: Computational assessment and prioritization of model predictions. Bioinformatics, 25(4), 490–496.10.1093/bioinformatics/btn658CrossRef Google Scholar PubMed

Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272.10.1177/01466210122032064CrossRef Google Scholar

Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT Press.Google Scholar

Le Roux, N., & Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6), 1631–1649.CrossRef Google Scholar PubMed

Lee, S., & Gu, Y. (2025). Deep discrete encoders: Identifiable deep generative models for rich data with discrete latent layers. Journal of the American Statistical Association, 1–25.10.1080/01621459.2025.2587922CrossRef Google Scholar

Li, J., Gibbons, R., & Ková, V. R. (2025). Sparse Bayesian multidimensional item response theory. Journal of the American Statistical Association, 00(0), 1–14.Google Scholar

Liu, J., Lee, S., & Gu, Y. (2025). Exploratory general-response cognitive diagnostic models with higher-order structures. Psychometrika, 0, 1–28. Published online.10.1017/psy.2025.15CrossRef Google Scholar

Ma, W. (2022). A higher-order cognitive diagnosis model with ordinal attributes for dichotomous response data. Multivariate Behavioral Research, 57(2–3), 408–421.10.1080/00273171.2020.1860731CrossRef Google Scholar PubMed

Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212.10.1007/BF02294535CrossRef Google Scholar

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann.Google Scholar

Ranganath, R., Tang, L., Charlin, L., & Blei, D. (2015). Deep exponential families. In Lebanon, G., & Vishwanathan, S. V. N. (Eds.), Artificial intelligence and statistics (pp. 762–771). PMLR.Google Scholar

Robert, C. P., & Casella, G. (2004). Monte Carlo statistical methods. Springer Science & Business Media.10.1007/978-1-4757-4145-2CrossRef Google Scholar

Rohe, K., & Zeng, M. (2023). Vintage factor analysis with varimax performs statistical inference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(4), 1037–1060.10.1093/jrsssb/qkad029CrossRef Google Scholar

Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.Google Scholar

Salakhutdinov, R., & Hinton, G. (2009). Deep Boltzmann machines. In van Dyk, D., & Welling, M. (Eds.), Artificial intelligence and statistics (pp. 448–455). PMLR.Google Scholar

Segal, E., Pe’er, D., Regev, A., Koller, D., Friedman, N., & Jaakkola, T. (2005). Learning module networks. Journal of Machine Learning Research, 6(19), 557–588.Google Scholar

Shen, X., & Wong, W. H. (1994). Convergence rate of sieve estimates. The Annals of Statistics, 22(2), 580–615.10.1214/aos/1176325486CrossRef Google Scholar

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345–354.10.1111/j.1745-3984.1983.tb00212.xCrossRef Google Scholar

Tay, J. K., Narasimhan, B., & Hastie, T. (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106(1), 1–31.10.18637/jss.v106.i01CrossRef Google Scholar PubMed

Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2), 317–339.10.1007/s11336-013-9362-0CrossRef Google Scholar PubMed

Templin, J. L., Henson, R. A., Templin, S. E., & Roussos, L. (2008). Robustness of hierarchical modeling of skill association in cognitive diagnosis models. Applied Psychological Measurement, 32(7), 559–574.10.1177/0146621607300286CrossRef Google Scholar

von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287–307.CrossRef Google Scholar PubMed

von Davier, M., & Lee, Y.-S. (2019). Handbook of diagnostic classification models. Springer International Publishing.10.1007/978-3-030-05584-4CrossRef Google Scholar

Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends

${}^\circledR$ Machine Learning, 1(1–2), 1–305.Google Scholar

Xu, G. (2017). Identifiability of restricted latent class models with binary responses. Annals of Statistics, 45, 675–707.10.1214/16-AOS1464CrossRef Google Scholar

Zhang, H., Chen, Y., & Li, X. (2020). A note on exploratory item factor analysis by singular value decomposition. Psychometrika, 85(2), 358–372.10.1007/s11336-020-09704-7CrossRef Google Scholar PubMed

Figure 1 A ladder-shaped three-latent-layer DeepCDM.Note: Gray nodes are observed variables, and white nodes are latent ones. Multiple layers of binary latent variables $\mathbf A^{(1)}$, $\mathbf A^{(2)}$, and $\mathbf A^{(3)}$ successively generate the observed binary responses $\mathbf R$. Binary matrices $\mathbf Q^{(1)}$, $\mathbf Q^{(2)}$, and $\mathbf Q^{(3)}$ encode the sparse connection patterns between adjacent layers in the graph.