Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-01-26T01:03:36.964Z Has data issue: false hasContentIssue false

Biclustering Models for Two-Mode Ordinal Data

Published online by Cambridge University Press:  01 January 2025

Eleni Matechou*
Affiliation:
University of Kent
Ivy Liu
Affiliation:
Victoria University of Wellington
Daniel Fernández
Affiliation:
Victoria University of Wellington
Miguel Farias
Affiliation:
Coventry University
Bergljot Gjelsvik
Affiliation:
University of Oxford University of Oslo
*
Correspondence should be made to Eleni Matechou, School of Mathematics, Statistics and Actuarial Science, University of Kent, Cornwallis Building, Canterbury, CT2 7NF UK. Email: e.matechou@kent.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

The work in this paper introduces finite mixture models that can be used to simultaneously cluster the rows and columns of two-mode ordinal categorical response data, such as those resulting from Likert scale responses. We use the popular proportional odds parameterisation and propose models which provide insights into major patterns in the data. Model-fitting is performed using the EM algorithm, and a fuzzy allocation of rows and columns to corresponding clusters is obtained. The clustering ability of the models is evaluated in a simulation study and demonstrated using two real data sets.

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
Copyright © 2016 The Author(s). This article is published with open access at Springerlink.com

1. Introduction

Measurement data with ordinal categories occur frequently and in many fields of application. For example in medicine, a continuous clinical response is often categorised into ordered subtypes based on histological or morphological terms. In a questionnaire, Likert scale responses might be “better”, “unchanged” or “worse”. When analysing such data, it is of interest to link the ordinal responses to a set of explanatory variables.

Despite being introduced more than 3 decades ago, the proportional odds model (PO, McCullagh, Reference McCullagh1980) is still frequently employed in analysing ordinal response data in, for example, agriculture (Lanfranchi, Giannetto, & Zirilli, Reference Lanfranchi, Giannetto and Zirilli2014), medicine (Skolnick et al., Reference Skolnick, Maas, Narayan, van der Hoop, MacAllister, Ward, Nelson and Stocchetti2014; Tefera & Sharma, Reference Tefera and Sharma2015) and socioeconomic studies (Pechey, Monsivais, Ng, & Marteau, Reference Pechey, Monsivais, Ng and Marteau2015).

One motivation for the PO model assumes that the ordinal response has an underlying continuous variable (Anderson & Philips, Reference Anderson and Philips1981), called a latent variable, that follows a logistic distribution. The extensive use of the PO model is due to its parsimony for modelling the effect of covariates on the response, compared to other similar models such as the baseline-category logit model, thanks to the use of the proportional odds property (Agresti, Reference Agresti2010, Sect. 3.3.1). Additionally, the model parameters are invariant to the way the categories for the ordinal response are formed (Agresti, Reference Agresti2010, Sect. 3.3.3).

In the analysis of two-mode data matrices, with the modes being for example subjects and questions and with all of the elements being ordered categorical responses, one might be interested in modelling the effect of both the rows and columns on the response. An example of such data is an n by p matrix that summarises the responses of n individuals to p questions, each with q possible (ordered) responses. In this case, the PO model can be fitted to identify, for example, individuals and questions that tend to be linked with higher values of the ordinal response.

However, the number of parameters in the PO model increases as the number of rows or columns in the data set increases. As a result, interpretation becomes problematic for large data sets. Identifying patterns related to the heterogeneity of the data, for example clusters of rows or columns that have similar effect on the response, is challenging. Therefore, the formulation of model approaches taking into account the row and column cluster structure of the data is needed.

The work in this paper has been motivated by this need to model potential heterogeneity among the, assumed independent, ordinal responses in two-mode data by identifying row and/or column clusters. As well as a single-mode clustering, our proposed model provides a two-mode clustering, or biclustering, for fuzzy allocation of the rows and/or columns to corresponding clusters. This way, the number of parameters can be reduced considerably as rows and/or columns are clustered in corresponding homogeneous groups assumed to have the same effect on the response. The results provide insights into major patterns in the data, and row/column clusters can be compared and ranked according to their effect on the ordinal response.

A number of model-based or distance-minimising biclustering methods exist that allocate, probabilistically or not, the rows and columns of a data set containing continuous, binary or count data to corresponding clusters. Examples include the double k-means method of Vichi (Reference Vichi2001) and Rocci and Vichi (Reference Rocci and Vichi2008) which, as the name suggests, resembles the k-means algorithm (Hartigan & Wong, Reference Hartigan and Wong1979), and the block mixture models of Govaertand and Nadif (Reference Govaert and Nadif2003, Reference Govaert and Nadif2010). Pledger and Arnold (Reference Pledger and Arnold2014) have recently proposed a group of likelihood-based models fitted using the Expectation–Maximisation algorithm (EM) (Dempster, Laird, & Rubin, Reference Dempster, Laird and Rubin1977) for simultaneous fuzzy clustering of the rows and columns of binary or count data.

The cluster analysis given by Pledger and Arnold (Reference Pledger and Arnold2014) can be considered as a multivariate approach using latent modelling. For both ordered and unordered categorical variables, Desantis, Houseman, Coull, Stemmet-Rachamimiv, and Betensky (Reference Desantis, Houseman, Coull, Stemmet-Rachamimiv and Betensky2008) proposed a one-mode clustering method based on latent modelling, which has been widely applied in many fields (e.g. Desantis, Andrés Houseman, Coull, Nutt, & Betensky, Reference Desantis, Andrés Houseman, Coull, Nutt and Betensky2012; Eluru, Bagheri, & Miranda-Moreno, Reference Eluru, Bagheri, Miranda-Moreno and Fu2012; Molitor, Papathomas, Jerrett, & Richardson, Reference Molitor, Papathomas, Jerrett and Richardson2010; Scharoun-Lee et al., Reference Scharoun-Lee, Gordon-Larsen, Adair, Popkin, Kaufman and Suchindran2011).

In this paper, we generalise the Pledger and Arnold (Reference Pledger and Arnold2014) work to the case of ordinal categorical response data, specifically using the PO model parameterisation. The proposed model structure is an extension of the one-mode clustering model given by Desantis et al. (Reference Desantis, Houseman, Coull, Stemmet-Rachamimiv and Betensky2008).

Section 2 describes the model structure. The performance of several model selection criteria in selecting the true number of clusters in the data when our proposed model is used is assessed in Sect. 3.1. The reliability of the clustering resulting from our proposed model is evaluated, using simulation, in Sect. 3.2. Finally, applications to two real data sets are shown in Sects. 4.1 and 4.2 and the resulting clusters are compared to those obtained by double k-means (Vichi, Reference Vichi2001).

2. Materials and Methods

2.1. Background: Proportional Odds Model

Consider the data set as an n × p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\times p$$\end{document} matrix Y \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf Y$$\end{document} with entry y i j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{ij}$$\end{document} the realisation of a categorical distribution with q cells and θ i j 1 , , θ i j q \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{ij1},\ldots ,\theta _{ijq}$$\end{document} probabilities, k = 1 q θ i j k = 1 , i , j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{k=1}^{q}\theta _{ijk}=1, \forall i, j$$\end{document} . Let the set of model parameters be denoted by ϕ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pmb \phi $$\end{document} .

Under the PO model, and in the case where the additive effect of rows and columns on the response is considered

(1) θ i j k = exp ( μ k - α i - β j ) 1 + exp ( μ k - α i - β j ) , k = 1 exp ( μ k - α i - β j ) 1 + exp ( μ k - α i - β j ) - exp ( μ k - 1 - α i - β j ) 1 + exp ( μ k - 1 - α i - β j ) , 1 < k < q 1 - k = 1 q - 1 θ i j k , k = q \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \theta _{ijk}=\left\{ \begin{array}{ll} \frac{\exp (\mu _{k}-\alpha _i-\beta _j)}{1+\exp (\mu _{k}-\alpha _i-\beta _j)}, &{} \quad k=1\\ &{} \\ \frac{\exp (\mu _{k}-\alpha _i-\beta _j)}{1+\exp (\mu _{k}-\alpha _i-\beta _j)}-\frac{\exp (\mu _{k-1}-\alpha _i-\beta _j)}{1+\exp (\mu _{k-1}-\alpha _i-\beta _j)}, &{}\quad 1< k < q\\ &{} \\ 1-\sum _{k=1}^{q-1}\theta _{ijk}, &{}\quad k=q\\ \end{array}\right. \end{aligned}$$\end{document}

or alternatively,

(2) logit P ( Y i j k ) = μ k - α i - β j , 1 k < q + , k = q , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text{ logit }\left[ P(Y_{ij}\le k)\right] =\left\{ \begin{array}{ll} \mu _{k}-\alpha _i-\beta _j,&{} \quad 1\le k<q\\ +\infty , &{} \quad k=q,\\ \end{array}\right. \end{aligned}$$\end{document}

where μ k \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _{k}$$\end{document} is the kth cut-off point, with μ 1 < μ 2 < < μ q - 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _1<\mu _2<\cdots <\mu _{q-1}$$\end{document} , and α i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _i$$\end{document} , β j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _j$$\end{document} are, respectively, the effect of row i, column j on the response, with α 1 = β 1 = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _1=\beta _1=0$$\end{document} . The total number of model parameters is equal to ν = ( q - 1 ) + ( n - 1 ) + ( p - 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu =(q - 1) + (n-1) + (p-1)$$\end{document} .

2.2. Biclustering: Simultaneous Clustering of Rows and Columns

Suppose that the rows come from a finite mixture with R components or row clusters while the columns come from a finite mixture with C components or column clusters. Rows that belong to the same row cluster, r, are assumed to have the same effect on the response, modelled using parameter α r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _r$$\end{document} . Similarly, columns that belong to the same column cluster c have the same effect on the response modelled by parameter β c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{c}$$\end{document} . If cell i, j belongs to row group r and column group c then, under the PO model and assuming an additive effect of the clusters on the response,

(3) logit P ( Y i j k ) = μ k - α r - β c if 1 k < q and + otherwise . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text{ logit }\left[ P(Y_{ij}\le k)\right] = \mu _{k}-\alpha _r-\beta _{c}\; \text {if}\; 1\le k<q\; \text {and}\;+ \infty \; \text {otherwise}. \end{aligned}$$\end{document}

The proportion of rows in row group r is π r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _r$$\end{document} and the proportion of columns in column group c is κ c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{c}$$\end{document} , with r = 1 R π r = c = 1 C κ c = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{r=1}^R{\pi }_r=\sum _{c=1}^C{\kappa }_{c}=1$$\end{document} . As the rows and columns in the same row and column cluster, respectively, share the same parameters, α r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _r$$\end{document} and β c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _{c}$$\end{document} , respectively, there are now ( q - 1 ) + 2 ( R - 1 ) + 2 ( C - 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(q-1)+2(R-1)+2(C-1)$$\end{document} parameters in the model, where R n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R\le n$$\end{document} and C p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C \le p$$\end{document} . Choosing R n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \ll n$$\end{document} and C p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C \ll p$$\end{document} ensures that the number of independent parameters in this model is lower than the number of parameters in the proportional odds model formulated in expression (2).

However, cluster membership is typically unknown and hence the (incomplete data) likelihood sums over all possible partitions of rows into R clusters and over all possible partitions of columns into C clusters

(4) ( ϕ , π , κ | Y ) = log c 1 = 1 C c p = 1 C κ c 1 κ c p r 1 = 1 R r n = 1 R π r 1 π r n i = 1 n j = 1 p k = 1 q θ r i c j k I ( y i j = k ) , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \ell ({{\pmb \phi }}, {\pmb \pi },{\pmb \kappa } |\mathbf {Y})=\log \left[ \sum _{c_{1}=1}^{C} \cdots \sum _{c_{p}=1}^{C} \kappa _{c_{1}} \cdots \kappa _{c_{p}} \sum _{r_{1}=1}^{R} \cdots \sum _{r_{n}=1}^{R} \pi _{r_{1}} \cdots \pi _{r_{n}} \prod _{i=1}^{n}\prod _{j=1}^{p}\prod _{k=1}^{q}{{\theta }_{r_ic_jk}^{I(y_{ij}=k)}} \right] , \end{aligned}$$\end{document}

where π r i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _{r_i}$$\end{document} and κ c j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa _{c_j}$$\end{document} is the proportion of rows and columns, respectively, that belong to row group r, column group c for the particular partition i, j, of rows and columns into R and C clusters, respectively.

Here, following Pledger and Arnold (Reference Pledger and Arnold2014, Sect. 2.2.2), we adopt a finite mixture model which, assuming row-based conditional independence, we can describe using the following (incomplete data) log-likelihood

(5) ( ϕ , π , κ | Y ) = log c 1 = 1 C c p = 1 C κ c 1 κ c p i = 1 n r = 1 R π r j = 1 p k = 1 q θ r c j k I ( y i j = k ) , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \ell ({{\pmb \phi }}, {\pmb \pi },{\pmb \kappa } |\mathbf {Y})=\log \left[ \sum _{c_1=1}^C\ldots \sum _{c_{p}=1}^C {\kappa }_{c_1}\ldots {\kappa }_{c_{p}}\prod _{i=1}^n \left\{ \sum _{r=1}^R\pi _r\prod _{j=1}^p\prod _{k=1}^q{\theta }_{rc_jk}^{I(y_{ij}=k)}\right\} \right] , \end{aligned}$$\end{document}

which sums over the possible column cluster partitions only. Equation (5) is obtained from Eq. (4) by taking terms of the i product through the r sums.

The additive model shown in Eq. (3) can be extended to a model which allows for an interaction between the row and column cluster effects, denoted by parameters γ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} , by modelling the logits of the cumulative probabilities as

(6) logit P ( Y i j k ) = μ k - α r - β c - γ r c if 1 k < q and + otherwise , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text{ logit }\left[ P(Y_{ij}\le k)\right] =\mu _{k}-\alpha _r-\beta _{c}-\gamma _{rc}\; \text {if}\; 1\le k<q\; \text {and}\;+\infty \;\text {otherwise}, \end{aligned}$$\end{document}

and, assuming constraints r γ r c = 0 c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _r\gamma _{rc}=0 \; \forall c$$\end{document} and c γ r c = 0 r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{c}\gamma _{rc}=0 \; \forall r$$\end{document} , increasing the number of parameters by ( R - 1 ) ( C - 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(R-1)(C-1)$$\end{document} compared to the additive case.

The model can also be altered to consider one-mode clustering, and the set of different models that can be fitted are shown in Table 1 with details given in Appendix A. The first two columns in Table 1, labelled as “R” and “C”, denote, respectively, the number of row and column clusters assumed in the model when R = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=1$$\end{document} and C = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=1$$\end{document} all rows/columns are homogeneous forming a single row/column cluster, when R = n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=n$$\end{document} and C = p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=p$$\end{document} all rows/columns are heterogeneous, each forming its own row/column cluster, when R = r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=r$$\end{document} and C = c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=c$$\end{document} there are r and c homogeneous row/column clusters, respectively. Additionally, models incorporating an interaction term are indicated by the associated parameters γ l k \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _{lk}$$\end{document} with l indexing the row clusters and k the column clusters.

Table 1. Model set with corresponding number of parameters ν . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu .$$\end{document}

The following constraints are placed, where appropriate: α 1 = 0 , β 1 = 0 , k γ k l = 0 , l , l γ k l = 0 , k , r = 1 R π r = 1 , c = 1 C κ c = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _1=0,\; \beta _1=0,\; \sum _k\gamma _{kl}=0,\; \forall l,\; \sum _l\gamma _{kl}=0,\; \forall k,\; \sum _{r=1}^R\pi _r=1,\; \sum _{c=1}^C\kappa _{c}=1$$\end{document} . R = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=1$$\end{document} : a single row cluster, R = r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=r$$\end{document} : r row clusters, R = n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=n$$\end{document} : each row is in its own cluster. Similarly, C = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=1$$\end{document} : a single column cluster, C = c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=c$$\end{document} : c column clusters and C = p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=p$$\end{document} : each column is in its own cluster. For example, when R = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=1$$\end{document} , C = c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=c$$\end{document} , the rows form one cluster, while the columns form c clusters and the logits of the cumulative probabilities in the PO model for column cluster c and 1 k < q \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\le k<q$$\end{document} are logit P ( Y i j k ) = μ k - β c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ logit }\left[ P(Y_{ij}\le k)\right] =\mu _{k}-\beta _{c}$$\end{document} , for all rows. If on the other hand R = n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=n$$\end{document} , C = c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=c$$\end{document} , the cumulative probabilities for row i, column cluster c are, assuming an interaction between row and column effects and 1 k < q \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\le k<q$$\end{document} , logit P ( Y i j k ) = μ k - α i - β c - γ i c . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ logit }\left[ P(Y_{ij}\le k)\right] =\mu _{k}-\alpha _i-\beta _{c}-\gamma _{ic}.$$\end{document}

We denote by Z i r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ir}$$\end{document} and X j c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{jc}$$\end{document} the indicator random variables for group membership of row i in row group r and column j in column group c, respectively. We use the EM algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977) by treating cluster membership as the missing data and derive estimates of the posterior probability of allocation of row i to row cluster r and of column j to column cluster c, given respectively by E ( Z i j ) = z ^ i r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E(Z_{ij})=\widehat{z}_{ir}$$\end{document} and E ( X j c ) = x ^ j c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E(X_{jc})=\widehat{x}_{jc}$$\end{document} , for i = 1 , , n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,n$$\end{document} , j = 1 , , p \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\ldots ,p$$\end{document} , r = 1 , , R \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=1,\ldots ,R$$\end{document} and c = 1 , , C \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=1,\ldots ,C$$\end{document} with r = 1 R z ^ i r = c = 1 C x ^ j c = 1 , i , j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{r=1}^R\widehat{z}_{ir}=\sum _{c=1}^C\widehat{x}_{jc}=1,\ \forall i,j$$\end{document} .

The lack of a posteriori independence of the Z i r \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ir}$$\end{document} and X j c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_{jc}$$\end{document} makes the evaluation of the expected value of their product computationally expensive as it requires a sum either over all possible allocations of rows to row groups, or over all possible allocations of columns to column groups. The variational approximation (Govaert & Nadif, Reference Govaert and Nadif2005) which we employ (see Appendix A.3.1. for details) is a solution to this problem.

We give details of the EM algorithm steps in Appendix A for all models listed in Table 1.

All the computer code is written in R (R Core Team, 2014), and the (complete data) log-likelihood (given in Appendix A) is maximised using the Newton–Raphson algorithm provided as an option in optim to estimate parameters μ 1 , , μ q - 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _1,\ldots ,\mu _{q-1}$$\end{document} and the effects of row and column clusters, as well as their interaction, if these exist in the model being fitted. Since the likelihood surface is multimodal, the EM algorithm is started from a number of different points and the iteration with the highest obtained likelihood value is retained (Everitt, Landau, Leese, & Stahl, Reference Everitt, Landau, Leese and Stahl2011). The R code to fit the models is available upon request from the first author.

3. Simulation Studies

We have performed two simulation studies: one to evaluate the performance of 10 model selection criteria in recovering the true number of clusters when our proposed models are used (Sect. 3.1) and one to evaluate the reliability of our proposed models (Sect. 3.2).

3.1. Model Selection

Since these are likelihood-based models, likelihood-based model selection criteria, such as AIC (Akaike, Reference Akaike1973), its small-sample modification ( AIC c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ AIC }_{c}$$\end{document} , Akaike, Reference Akaike1973; Burnham & Anderson, Reference Burnham and Anderson2002; Hurvich & Tsai, Reference Hurvich and Tsai1989), BIC (Schwarz, Reference Schwarz1978) and its Integrated Classification Likelihood version (ICL-BIC, Biernacki, Celeux, & Govaert, Reference Biernacki, Celeux and Govaert2000), can be used to select amongst them.

Following Fernández, Arnold, and Pledger (Reference Fernández, Arnold and Pledger2014), we set up a simulation study to empirically establish a relationship between our likelihood-based models for ordinal data, specifically using the PO model, and the performance of 10 information criteria (Table 2) in recovering the true number of cluster components.

Table 2. Information criteria summary table.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell $$\end{document} is the maximised incomplete-data log-likelihood (see Eq. 5); ( 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (1)$$\end{document} is the maximised incomplete-data log-likelihood \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell $$\end{document} without clustering structure; and c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{c}$$\end{document} is the maximised complete-data log-likelihood given in Appendix A. The third column categorises the criteria according to whether they were proposed for model selection in a regression setting or for clustering. The last column indicates whether the penalty depends on the number of parameters, ν \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document} , the total sample size which is the number of elements in the response matrix Y, np, and/or the entropy function, EN ( · ) = - c . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {EN}(\cdot )=\ell - \ell _{c}.$$\end{document}

We set n = 150 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=150$$\end{document} , p = 15 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=15$$\end{document} , q = 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=4$$\end{document} , R = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=3$$\end{document} and C = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=2$$\end{document} . We specified five scenarios by varying the row and column mixing proportions: a data set with similar dimensions ( n = 150 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=150$$\end{document} and p = 15 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=15$$\end{document} ) to the data analysed in the example in Sect. 4.2 (Scenario 1), balanced row and column mixing proportions (Scenario 2), balanced column mixing proportions but unbalanced row proportions (Scenario 3), unbalanced row and column mixing proportions (Scenario 4) and one of the row mixing proportions close to zero (Scenario 5).

For each scenario, we simulated 100 data sets and noted the selected model using each of the 10 criteria out of models with R = 1 , 2 , 3 , 4 , 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=1,2,3,4,5$$\end{document} and C = 1 , 2 , 3 , 4 , 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=1,2,3,4,5$$\end{document} . For each simulated data set, the EM algorithm was repeated 10 times with random starting points and the best ML estimates (those that led to highest log–likelihood value) were kept.

Figure 1 displays the percentage of cases in which each information criterion correctly recovered the true number of row and column clusters, i.e. the true model that generated the data, averaged across the five scenarios. AIC3 has the best performance (selecting the correct model in 78 % of cases), followed by BIC (75 %), AIC, AIC c \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ AIC }_{\text{ c }}$$\end{document} , AIC u \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ AIC }_{\text{ u }}$$\end{document} and CAIC (73 %).

Figure 1. Simulation study to assess the performance of model selection criteria in recovering the true number of clusters for our proposed biclustering finite mixture PO (POFM) model. Bars depict the percentage of cases when the true model is correctly identified by each criterion, averaged across the five scenarios.

Our results are in accordance with Fonseca and Cardoso (Reference Fonseca and Cardoso2007) for the categorical case. ICL–BIC is underestimating the number of clusters (selecting a smaller number of clusters in 32 % of cases) and CLC is overestimating the number of clusters in 29 % of cases. A very poor performance is obtained by AWE and NEC (selecting the correct model in 46 and 24 % of cases, respectively).

It is important to highlight that these results are simply evaluating the ability of model selection criteria in selecting the right number of clusters in the mixture, but not necessarily in providing the best clustering structure for the data.

3.2. Model Evaluation

In this section, we evaluate the performance of our proposed method in (i) biclustering, varying the cluster sizes and the sample size and (ii) one-dimensional row clustering, compared to that of double k-means (Vichi, Reference Vichi2001) and standard k-means, respectively.

(i) We set R = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=3$$\end{document} , C = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=2$$\end{document} and q = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=3$$\end{document} or 5. The cutpoint values are obtained such that the response categories have equal probabilities for the baseline row and column cluster. That is, P ( Y i j = 1 ) = P ( Y i j = 2 ) = = P ( Y i j = q ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(Y_{ij}=1)=P(Y_{ij}=2)=\cdots =P(Y_{ij}=q)$$\end{document} when row i belongs to the first row cluster and column j belongs to the first column cluster. The cutpoint values are { μ 1 = log ( 1 / 2 ) , μ 2 = log ( 2 ) } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\mu _1=\log (1/2),\,\mu _2=\log (2)\}$$\end{document} when q = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=3$$\end{document} , and { μ 1 = log ( 1 / 4 ) , μ 2 = log ( 2 / 3 ) , μ 3 = log ( 3 / 2 ) , μ 4 = log ( 4 ) } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\mu _1=\log (1/4),\,\mu _2=\log (2/3),\,\mu _3=\log (3/2),\,\mu _4=\log (4)\}$$\end{document} when q = 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=5$$\end{document} . We consider ( α 1 , α 2 , α 3 ) = ( 0 , 1 , 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha _1,\,\alpha _2,\,\alpha _3)=(0,\,1,\,2)$$\end{document} , ( β 1 , β 2 ) = ( 0 , - 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\beta _1,\,\beta _2)=(0,\,-1)$$\end{document} and π 1 = π 2 = π 3 = 1 / 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _1=\pi _2=\pi _3=1/3$$\end{document} . We vary n, p, q and ( κ 1 , κ 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _1,\,\kappa _2)$$\end{document} as n = ( 9 , 30 , 99 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=(9,\,30,\,99)$$\end{document} , p = ( 10 , 20 , 100 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=(10,\,20,\,100)$$\end{document} , q = ( 3 , 5 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=(3,\,5)$$\end{document} and ( κ 1 , κ 2 ) = ( 0.5 , 0.5 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _1,\,\kappa _2)=(0.5,\,0.5)$$\end{document} , ( 0.4 , 0.6 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0.4,\,0.6)$$\end{document} , ( 0.3 , 0.7 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0.3,\,0.7)$$\end{document} , ( 0.2 , 0.8 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0.2,\,0.8)$$\end{document} . The case with balanced column clusters assumes ( κ 1 , κ 2 ) = ( 0.5 , 0.5 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _1,\,\kappa _2)=(0.5,\,0.5)$$\end{document} . For an unbalanced case, the scenarios are from ( 0.4 , 0.6 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0.4,\,0.6)$$\end{document} to ( 0.2 , 0.8 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0.2,\,0.8)$$\end{document} .

The response { Y i j } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{Y_{ij}\}$$\end{document} values are generated from a categorical distribution with size 1 and probabilities constrained as in expression (1). We assign the first 1/3 of rows to row cluster 1, the second 1/3 to row cluster 2 and the last 1/3 to row cluster 3. Similarly, the first 1 / κ 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1/\kappa _1$$\end{document} of columns are assigned to column cluster 1, and the rest of the columns to column cluster 2. We simulate 100 data sets for each scenario.

Table 3 shows the mean of parameter estimates obtained for α 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _2$$\end{document} , α 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _3$$\end{document} and β 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _2$$\end{document} from 100 simulated data sets. We are aware of the bias in the estimated parameters when n or p are small. This is due to the fact that the clusters are not fixed and hence their effect on the response is not fixed either. For example, a group of subjects who belong to a certain cluster in the true model might be allocated into a different cluster for a simulated data set. Or, they might be separated into different clusters. However, when both n and p are large, the means are close to the true parameters, because it is less likely to allocate a large number of subjects to a wrong cluster and, hence, the clusters themselves are more similar to the true clusters.

Regardless of the bias, the overall result shows that for balanced cases with ( κ 1 , κ 2 ) = ( 0.5 , 0.5 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _1,\,\kappa _2)=(0.5,\,0.5)$$\end{document} , the estimates of the column effects are closer to the truth than for highly unbalanced cases ( κ 1 , κ 2 ) = ( 0.2 , 0.8 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _1,\,\kappa _2)=(0.2,\,0.8)$$\end{document} when n is small. The unbalanced column clusters do not affect the quality of the row cluster effect estimates. In general, when both n and p increase, the quality of row cluster effect estimates improves. The standard errors are between 0.05 to 0.5 for the cases of p = 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=10$$\end{document} . For the other cases, they range from 0.001 to 0.08.

To evaluate the clustering ability of our proposed method, we calculate the average proportion of times that the pairwise grouping is correct (Rand index, Rand, Reference Rand1971) over 100 simulated data sets. For example, if two rows are in the same cluster for the true model, but the proposed method allocates them to different clusters, then this pair is mis-clustered and vice-versa. We report the average Rand index for all row/column pairs in Table 4 when ( κ 1 , κ 2 ) = ( 0.5 , 0.5 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\kappa _1,\,\kappa _2)=(0.5,\,0.5)$$\end{document} and ( 0.2 , 0.8 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0.2,\,0.8)$$\end{document} for both our proposed approach and the double k-means algorithm (Vichi, Reference Vichi2001). The two approaches have similar performance which improves as n and p increase and when the column clusters are balanced. For our approach, the largest standard error is 0.03 for the highly unbalanced cases and most standard errors are between 0.001 to 0.01.

Table 3. The average estimate obtained for each parameter over 100 simulations.

Table 4. The average Rand index for 100 simulated data sets based on our proposed (POFM) and double k-means (dkm) methods.

(ii) We set R = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=3$$\end{document} and C = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=1$$\end{document} , i.e. logit P ( Y i j k ) = μ k - α r if 1 k < q and + otherwise \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text{ logit }\left[ P(Y_{ij}\le k)\right] =\mu _{k}-\alpha _r\; \text {if}\; 1\le k<q\; \text {and}\; +\infty \; \text {otherwise}$$\end{document} . The cutpoint values are calculated as in simulation setting (i) above. We vary n and p as n = ( 9 , 30 , 99 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=(9,\,30,\,99)$$\end{document} , p = ( 10 , 20 , 100 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=(10,\,20,\,100)$$\end{document} and π 1 = π 2 = π 3 = 1 / 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _1=\pi _2=\pi _3=1/3$$\end{document} with ( α 1 , α 2 , α 3 ) = ( 0 , 1 , 2 ) , ( 0 , 2 , 4 ) , ( 0 , 1 , 4 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha _1,\,\alpha _2,\,\alpha _3)=(0,\,1,\,2),\,(0,\,2,\,4),\,(0,\,1,\,4)$$\end{document} and q = ( 3 , 5 , 7 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=(3,\,5,\,7)$$\end{document} .

When p is large, there are more data points for each row. When q is large, the ordered categorical response has a finer scale. For the row cluster effects { α r , r = 1 , 2 , 3 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\alpha _r,\,r=1,2,3\}$$\end{document} , the last setting ( 0 , 1 , 4 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(0,\,1,\,4)$$\end{document} gives an unbalanced effect where the difference between the first two clusters is small, but the first two clusters are quite different from the third cluster.

Table 5 shows the average Rand index for 1000 simulated data sets for each of the scenarios, comparing the proposed method (POFM) with k-means. All standard errors for the index are less than 0.0026. Most of them are around 0.001. POFM performs better than k-means when the cluster effects are balanced. In general, the greater n, p, q or the cluster effects are, the better the performance. The only case when k-means considerably outperforms POFM is when ( α 1 , α 2 , α 3 ) = ( 0 , 1 , 4 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha _1,\,\alpha _2,\,\alpha _3)=(0,\,1,\,4)$$\end{document} and p is large. For this particular case, POFM fails to distinguish between Clusters 1 and 2, and partitions the individuals into only two clusters, leaving one of the clusters empty. However, the quality of the row clustering is still satisfactory, with the average Rand index greater than 70 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$70~\%$$\end{document} in all cases.

Table 5. The average Rand index based on our proposed (POFM) and double k-means (dkm) methods for 1000 simulated data sets.

4. Results: Case-Studies

4.1. Religious beliefs

We consider part of the data set from a study first published by Wiech et al. (Reference Wiech, Farias, Kahane, Shackel, Tiede and Tracey2008). Twelve individuals, self-classified as religious, replied to 16 questions, shown in Appendix B, all rated on a 6-point Likert scale, (1) “Strongly disagree”, ..., (6) “Strongly agree”. The questions were designed to assess an individual’s beliefs on the level of control that god (first 8 questions) and powerful other individuals (last eight questions) have on their lives.

The biclustering model proposed in Sect. 2 was fitted to the 12 by 16 matrix by considering R , C = 2 , , 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R, C = 2,\ldots ,4$$\end{document} . The model with the greatest support by AIC3 has R = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=3$$\end{document} , C = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=2$$\end{document} and an interaction between row group effects and column group effects.

The two column clusters separate the questions into the two categories (god and others) almost perfectly. Cluster 1 includes questions {1, 2, 3, 4, 5, 6, 8, 10}, while Cluster 2 includes questions {7, 9, 11, 12, 13, 14, 15, 16}. The three row clusters are {3, 4, 5, 6, 8, 9, 10, 12}, {1, 2, 11} and {7}. Double k-means (Vichi, Reference Vichi2001) gives the same row clusters and similar column clusters {2, 3, 4, 5, 6, 8}, {1, 7, 9, 10, 11, 12, 13, 14, 15, 16}.

The estimated probabilities of replying 3 or above to each of the two question clusters for the three row clusters are shown in Figure 2. All row groups tend to agree more with god-related questions than with questions related to the effect of other powerful people. The estimated probabilities of agreeing with the god-related questions do not vary considerably between the three row clusters. However, that is not the case for the second column group since Row Cluster 1 and particularly Row Cluster 3, which consists of Individual 7 alone, tend to give lower scores than individuals in Row Cluster 2. Note that in addition, Individual 7 strongly agrees with questions in Cluster 1, demonstrating more extreme views than individuals belonging to the other clusters, who tend to be more moderate in their answers.

Figure 2. Estimated probabilities of replying 3 or above to each of the 2 column clusters for all 3 row clusters, as derived by the biclustering model with R = 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=3$$\end{document} , C = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=2$$\end{document} .

4.2. Attempted Suicides

The data set was collected as part of a study of patients admitted for deliberate self-harm (DSH) at the Acute Medical Departments of three major hospitals in Eastern Norway. We consider the answers of 151 individuals to 13 questions, shown in Appendix C, that were designed to assess the level of depression of the respondent by means of the Beck Depression Inventory-Short Form (BDI-SF) (Furlanetto, Mendlowicz, & Romildo Bueno, Reference Furlanetto, Mendlowicz and Romildo Bueno2005). Response options range from 1 to 4, with higher scores indicating higher levels of depression (Beck, Schuyler, & Herman, Reference Beck, Schuyler and Herman1974).

We fitted biclustering models with R = 2 , , 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=2,\ldots ,5$$\end{document} and C=2 or 3. The model supported by AIC3 has R = 5 , C = 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R=5, C=2$$\end{document} and an additive effect of row and column groups on the response.

The two column clusters are {1, 2, 3, 4, 5, 7, 8, 10, 13} and {6, 9, 11, 12}, with the first cluster receiving higher scores than the second ( β ^ 2 = - 0.99 ( 0.10 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\beta }_2=-0.99 (0.10)$$\end{document} ), suggesting that the nine questions of Cluster 1 are, possibly, markers of more severe forms of depression. The allocation of individuals to the five row groups is in proportion to 0.211, 0.266, 0.208, 0.030, 0.285. Double k-means (Vichi, Reference Vichi2001) gives the following column clusters: {2, 3, 4, 5, 6, 7, 8} and {1, 9, 10, 11, 12, 13}. For row clusters, we present the proportion of individuals from each of our clusters that are allocated to each of the double k-means clusters in Table 6, where it can be seen that with the exception of Cluster 4, the highest proportions appear in the diagonal of the table.

Table 6. Percent of individuals from the five POFM clusters, represented in the rows, that are clustered in the corresponding five double k-means (Vichi, Reference Vichi2001) clusters.

The fourth row cluster, which consists of four individuals, is believed to show the most signs of depression since α ^ 4 = 1.8 ( 0.32 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\alpha }_4=1.8(0.32)$$\end{document} . The first cluster follows with α ^ 1 = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\alpha }_1=0$$\end{document} since it is the baseline, followed by Clusters 5 ( α ^ 5 = - 1.14 ( 0.12 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\alpha }_5=-1.14 (0.12)$$\end{document} ), 2 ( α ^ 2 - 2.37 ( 0.13 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\alpha }_2 -2.37(0.13)$$\end{document} ), and 3 ( α ^ 3 = - 3.79 ( 0.16 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\alpha }_3=-3.79 (0.16)$$\end{document} ). In fact, no one in Cluster 4 contacted someone for help after their attempt, while the corresponding proportions for the other four clusters are all greater than 25 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$25~\%$$\end{document} , which demonstrates the greater determination of individuals in Cluster 4 to succeed in their attempt. Of course, the size of Cluster 4 is possibly too small to make meaningful comparisons of this type. However, the proportion of individuals in Clusters 1, 5, 2 and 3 that had at least one episode of DSH within three months after the study is, respectively, equal to 30, 24, 16 and 3.4 % \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.4~\%$$\end{document} . DSH is one of the most robust predictors of subsequent death by suicide (Hawton, Casanas, Comabella, Haw, & Saunders, Reference Hawton, Casanas, Comabella, Haw and Saunders2013). The risk of suicide among DSH patients treated at hospital is 30- to 200-fold in the year following an episode compared to individuals with no history of DSH (Cooper et al., Reference Cooper, Kapur, Webb, Lawlor, Guthrie, Mackway-Jones and Appleby2005; Hawton et al., Reference Hawton, Bergen, Kapur, Cooper, Steeg, Ness and Waters2012; Owens, Horrocks, & House, Reference Owens, Horrocks and House2002). Our model has successfully ordered the groups in terms of their risk of DSH within three months since the data we considered were collected.

5. Discussion

Our biclustering models identify homogeneous groups of both rows and columns in two-mode data sets of ordinal responses, reducing the number of parameters needed to adequately describe the data and therefore easing interpretation. They fully account for the ordinal nature of the responses, while, being likelihood-based, give access to tools for selecting between possible models.

We have performed an extensive simulation study to compare the performance of a number of model selection criteria in identifying the correct number of mixture components for models and data such as the ones we considered in our applications, conditional on using the EM algorithm and the variational approximation of Govaert and Nadif (Reference Govaert and Nadif2005). The variational approximation is known to produce local optima, and hence it is recommended to use different random starting values for several runs of the EM algorithm. Recently, Keribin, Brault, Celeux, and Govaert (Reference Keribin, Brault, Celeux and Govaert2014) developed latent block models for categorical data, considering a Bayesian approach, which do not require the aforementioned approximation. The potential to develop such models for the PO parameterization is a matter of future research.

In the two real data applications considered, both including questionnaire-type data designed to gain knowledge about the participants’ personality, feelings and way of thinking, the clusters identified by the model agree with our knowledge of the system and provide useful insight of the characteristics of the participants. Especially in the example of Sect. 4.2, the way the participants were clustered agrees with information collected three months after the study was conducted.

In the analysis presented in Sect. 4.2 we have considered only individuals with complete records, excluding participants with missing data. Missing data are often present in similar studies; and, hence, future work could extend the models to deal with such issues. Fitting the models using a Bayesian approach could provide a way of dealing with the missing data and also of choosing the right number of clusters, as, for example, in van Dijk, van Rosmalen, and Paap (Reference van Dijk, van Rosmalen and Paap2009) and Wyse and Friel (Reference Wyse and Friel2012), or of appropriately averaging over models, for example using reversible jump MCMC (Green, Reference Green1995).

Substantial developments in specialised methods for ordinal data have recently been made (see Liu & Agresti, Reference Liu and Agresti2005, for an overview). For instance, Fernández et al. (Reference Fernández, Arnold and Pledger2014) have recently developed one- and two-dimensional clustering models for ordinal data having a likelihood-based foundation. They did this by using the assumption of the ordinal stereotype model, which allows the determination of a new spacing of the ordinal categories, as dictated by the data. The models presented in this paper may be extended to other ordinal models such as the adjacent-categories logit models, continuation-ratio logit models, and mean response models (see Agresti, Reference Agresti2012, for details on these models). Similarly, incorporating covariates into the model, when these are available, is straightforward by adjusting the linear predictor accordingly.

We have presented the case when q, i.e. the number of levels, is the same for all variables. However, the models are easily extended to allow for a set of cutpoints to be calculated for each unique value of q observed in the data set.

The area of application of these models is extremely wide and includes market research, where questions of the type “How likely are you to buy this product in the future” have possible responses “Very likely to buy”, “Likely to buy”, “May or may not buy”, etc. Additionally, the models are useful for services, such as websites, that review products, such as books, music albums, hotels. and provide recommendations to the users according to their own past reviews, as they can simultaneously cluster the individuals according to their taste, but also the products according to the reviews they have received from all users.

Future research will develop a graphical method for matrix visualisation, taking the resulting probabilities of allocation for each individual data point into account. The existing graphical methods rely on the use of ad hoc distance metrics and similarity measures which, as we have noted above, do not respect the full ordinal nature of the data.

Acknowledgments

We are grateful to Shirley Pledger and Richard Arnold for the discussions about the Pledger and Arnold (Reference Pledger and Arnold2014) paper and to Maurizio Vichi for sharing his double k-means Matlab code.

Footnotes

Electronic supplementary material The online version of this article (doi: https://doi.org/10.1007/s11336-016-9503-3) contains supplementary material, which is available to authorized users.

References

Agresti, A., (2010). Analysis of Ordinal Categorical Data 2New Jersey: WileyCrossRefGoogle Scholar
Agresti, A., (2012). Categorical data analysis New Jersey: WileyGoogle Scholar
Akaike, H., (1973). Information theory and an extension of the maximum likelihood principle. B. N. Petrov, and F. Caski, (eds.) Proceeding of the Second International Symposium on Information Theory. Akademiai Kiado, Budapest, pp. 267281.Google Scholar
Anderson, J. A., & Philips, P. R., (1981). Regression, discrimination and measurement models for ordered categorical variables. Applied Statistics, 30 2231 CrossRefGoogle Scholar
Banfield, J. D., & Raftery, A. E., (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49 803821 CrossRefGoogle Scholar
Beck, A. T., Schuyler, D., & Herman, I. (1974). Development of suicidal intent scales. In A. T. Beck, H. L. Resnik, & D. J. Lettieri (Eds.), The prediction of suicide, . : Charles Press.Google Scholar
Biernacki, C., Celeux, G., & Govaert, G., (1999). An improvement of the NEC criterion for assessing the number of clusters in mixture model. Pattern Recognition Letters, 20 267272 CrossRefGoogle Scholar
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on pattern analysis and machine intelligence 22, No. 7.CrossRefGoogle Scholar
Biernacki, C., & Govaert, G., (1997). Using the classification likelihood to choose the number of clusters. Computing Science and Statistics, 29 451457Google Scholar
Bozdogan, H., (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psycometrika, 52 345370 CrossRefGoogle Scholar
Bozdogan, H., (1994). Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, 1 69113Google Scholar
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference, . : Springer.Google Scholar
Cooper, J., Kapur, N., Webb, R., Lawlor, M., Guthrie, E., Mackway-Jones, K., & Appleby, L., (2005). Suicide after deliberate self-harm: a 4-year cohort study. American Journal of Psychiatry, 162 2297303 CrossRefGoogle ScholarPubMed
Dempster, A. P., Laird, N. M., & Rubin, D. B., (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39 138CrossRefGoogle Scholar
Desantis, S. M., Andrés Houseman, E., Coull, B. A., Nutt, C. L., & Betensky, R. A., (2012). Supervised bayesian latent class models for high-dimensional data. Statistics in medicine, 31 13421360 CrossRefGoogle ScholarPubMed
Desantis, S. M., Houseman, E. A., Coull, B. A., Stemmet-Rachamimiv, A. S., & Betensky, R. A., (2008). A penalized latent class model for ordinal data. Biostatistics, 9 249262 CrossRefGoogle ScholarPubMed
Eluru, N., Bagheri, M., Miranda-Moreno, L. F., & Fu, L., (2012). A latent class modeling approach for identifying vehicle driver injury severity factors at highway-railway crossings. Accident Analysis & Prevention, 47 119127 CrossRefGoogle ScholarPubMed
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis. : Wiley.CrossRefGoogle Scholar
Fernández, D., Arnold, R., & Pledger, S., (2014) Mixture-based clustering for the ordered stereotype model. Computational Statistics and Data Analysis.Google Scholar
Fonseca, JRS, & Cardoso, M., (2007). Mixture-model cluster analysis using information theoretical criteria. Intelligent Data Analysis, 11 155173CrossRefGoogle Scholar
Furlanetto, L. M., Mendlowicz, M. V., & Romildo Bueno, J., (2005). The validity of the Beck Depression Inventory-Short Form as a screening and diagnostic instrument for moderate and severe depression in medical inpatients. Journal of Affective Disorders, 86 8791 CrossRefGoogle ScholarPubMed
Govaert, G., & Nadif, M., (2003). Clustering with block mixture models. Pattern Recognition, 36 463473 CrossRefGoogle Scholar
Govaert, G., & Nadif, M. (2005). An EM algorithm for the block mixture model. Speech and Signal Processing on Pattern Analysis and Machine Intelligence: IEEE Transactions on Acoustics. 27.Google Scholar
Govaert, G., & Nadif, M., (2010). Latent block model for contingency table. Communications in Statistics - Theory and Methods, 39 416425 CrossRefGoogle Scholar
Green, P. J., (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82 711732 CrossRefGoogle Scholar
Hartigan, J. A., & Wong, M. A., (1979). A k-means clustering algorithm. Applied Statistics, 28 100108Google Scholar
Hawton, K., Bergen, H., Kapur, N., Cooper, J., Steeg, S., Ness, J., & Waters, K., (2012). Repetition of self-harm and suicide following self-harm in children and adolescents: findings from the Multicentre Study of Self-harm in England. Journal of Child Psychology and Psychiatry, 53 1212121219 CrossRefGoogle ScholarPubMed
Hawton, K., Casanas, I., Comabella, C., Haw, C., & Saunders, K., (2013). Risk factors for suicide in individuals with depression: A systematic review. Journal of Affective Disorders, 147 1–31728 CrossRefGoogle ScholarPubMed
Hurvich, C. M., & Tsai, C. L., (1989). Regression and time series model selection in small samples. Biometrika, 76 297307 CrossRefGoogle Scholar
Keribin, C., Brault, V., Celeux, G., & Govaert, G., (2014). Estimation and selection for the latent block model on categorical data. Statistics and Computing, 116.Google Scholar
Lanfranchi, M., Giannetto, C., & Zirilli, A., (2014). Analysis of demand determinants of high quality food products through the application of the cumulative proportional odds model. Applied Mathematical Sciences, 8 32973305CrossRefGoogle Scholar
Liu, I., & Agresti, A., (2005). The analysis of ordered categorical data: an overview and a survey of recent developments. Test, 14 173 CrossRefGoogle Scholar
McCullagh, P., (1980). Regression models for ordinal data. Journal of the Royal Statistical Society. Series B., 42 109142Google Scholar
McQuarrie, A., Shumway, R., & Tsai, C. L., (1997). The model selection criterion AICu. Statistics and Probability Letters, 34 285292 CrossRefGoogle Scholar
Molitor, J., Papathomas, M., Jerrett, M., & Richardson, S., (2010). Bayesian profile regression with an application to the national survey of children’s health. Biostatistics, 11 484498CrossRefGoogle Scholar
Owens, D., Horrocks, J., & House, A., (2002). Fatal and non-fatal repetition of self-harm. Systematic review. Br J Psychiatry, 181 193199 CrossRefGoogle ScholarPubMed
Pechey, R., Monsivais, P., Ng, Y. L., & Marteau, T. M., (2015). Why don’t poor men eat fruit? Socioeconomic differences in motivations for fruit consumption. Appetite, 84 271279 CrossRefGoogle ScholarPubMed
Pledger, S., & Arnold, R., (2014). Multivariate methods using mixtures: Correspondence analysis, scaling and pattern-detection. Computational Statistics & Data Analysis, 71 241261 CrossRefGoogle Scholar
R Core Team, 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.R-project.org/.Google Scholar
Rand, W. M., (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66 846850 CrossRefGoogle Scholar
Rocci, R., & Vichi, M., (2008). Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52 19842003 CrossRefGoogle Scholar
Scharoun-Lee, M., Gordon-Larsen, P., Adair, L. S., Popkin, B. M., Kaufman, J. S., & Suchindran, C. M., (2011). Intergenerational profiles of socioeconomic (dis) advantage and obesity during the transition to adulthood. Demography, 48 625651 CrossRefGoogle ScholarPubMed
Schwarz, G., (1978). Estimating the dimension of a model. Annals of Statistics, 6 461464 CrossRefGoogle Scholar
Skolnick, B. E., Maas, A. I., Narayan, R. K., van der Hoop, R. G., MacAllister, T., Ward, J. D., Nelson, N. R., & Stocchetti, N., (2014). A clinical trial of progesterone for severe traumatic brain injury. New England Journal of Medicine, 371 24672476 CrossRefGoogle ScholarPubMed
Tefera, M., & Sharma, M., (2015). Determinants of immunization among children aged 12–23 months in ethiopia: A proportional odds model approach. International Journal of Statistics in Medical Research, 4 140155 CrossRefGoogle Scholar
van Dijk, B., van Rosmalen, J., & Paap, R. (2009). A Bayesian approach to two-mode clustering. Econometric Institute Research Papers: Technical Report.Google Scholar
Vichi, M., (2001). Double k-means clustering for simultaneous classification of objects and variables, in: Borra, S., Rocci, R., Vichi, M., Schader, M. (Eds.), Advances in Classification and Data Analysis. Springer Berlin Heidelberg. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 4352.Google Scholar
Wiech, K., Farias, M., Kahane, G., Shackel, N., Tiede, W., & Tracey, I., (2008). An fMRI study measuring analgesia enhanced by religion as a belief system. PAIN, 139 2467476 CrossRefGoogle ScholarPubMed
Wyse, J., & Friel, N., (2012). Block clustering with collapsed latent block models. Statistics and Computing, 22 415428 CrossRefGoogle Scholar
Figure 0

Table 1. Model set with corresponding number of parameters ν.\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nu .$$\end{document}

Figure 1

Table 2. Information criteria summary table.

Figure 2

Figure 1. Simulation study to assess the performance of model selection criteria in recovering the true number of clusters for our proposed biclustering finite mixture PO (POFM) model. Bars depict the percentage of cases when the true model is correctly identified by each criterion, averaged across the five scenarios.

Figure 3

Table 3. The average estimate obtained for each parameter over 100 simulations.

Figure 4

Table 4. The average Rand index for 100 simulated data sets based on our proposed (POFM) and double k-means (dkm) methods.

Figure 5

Table 5. The average Rand index based on our proposed (POFM) and double k-means (dkm) methods for 1000 simulated data sets.

Figure 6

Figure 2. Estimated probabilities of replying 3 or above to each of the 2 column clusters for all 3 row clusters, as derived by the biclustering model with R=3\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R=3$$\end{document}, C=2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$C=2$$\end{document}.

Figure 7

Table 6. Percent of individuals from the five POFM clusters, represented in the rows, that are clustered in the corresponding five double k-means (Vichi, 2001) clusters.

Supplementary material: File

Matechou et al. supplementary material

Matechou et al. supplementary material
Download Matechou et al. supplementary material(File)
File 108.6 KB