We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This study bridges the study of social inclusion with welfare regime theory. By linking social inclusion with welfare regimes, we establish a novel analytical framework for assessing global trends and national divergences in social inclusion based on a multidimensional view of the concept. While scholars have developed typologies for social inclusion and welfare regimes independent of each other, limited insights exist on how social inclusion relates to welfare regime typologies. We develop a social inclusion index for 225 countries using principal component analysis with 10 measures of social inclusion from the United Nations’ Sustainable Development Goals Indicators Database. We then employ clustering algorithms to inductively group countries based on the index. We find six “worlds” of social inclusion based on the index and other social factors – the Low, Mid, and High Social Inclusion regimes and the Low, Mid, and High Social Exclusion regimes.
The partitioning of squared Eucliean distance between two vectors in M-dimensional space into the sum of squared lengths of vectors in mutually orthogonal subspaces is discussed and applications given to specific cluster analysis problems. Examples of how the partitioning idea can be used to help describe and interpret derived clusters, derive similarity measures for use in cluster analysis, and to design Monte Carlo studies with carefully specified types and magnitudes of differences between the underlying population mean vectors are presented. Most of the example applications presented in this paper involve the clustering of longitudinal data, but their use in cluster analysis need not be limited to this arena.
One of the most vexing problems in cluster analysis is the selection and/or weighting of variables in order to include those that truly define cluster structure, while eliminating those that might mask such structure. This paper presents a variable-selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. The heuristic was subjected to Monte Carlo testing across more than 2200 datasets with known cluster structure. The results indicate the heuristic is extremely effective at eliminating masking variables. A cluster analysis of real-world financial services data revealed that using the variable-selection heuristic prior to the K-means algorithm resulted in greater cluster stability.
This manuscript introduces a new Bayesian finite mixture methodology for the joint clustering of row and column stimuli/objects associated with two-mode asymmetric proximity, dominance, or profile data. That is, common clusters are derived which partition both the row and column stimuli/objects simultaneously into the same derived set of clusters. In this manner, interrelationships between both sets of entities (rows and columns) are easily ascertained. We describe the technical details of the proposed two-mode clustering methodology including its Bayesian mixture formulation and a Bayes factor heuristic for model selection. We present a modest Monte Carlo analysis to investigate the performance of the proposed Bayesian two-mode clustering procedure with respect to synthetically created data whose structure and parameters are known. Next, a consumer psychology application is provided examining physician pharmaceutical prescription behavior for various brands of prescription drugs in the neuroscience health market. We conclude by discussing several fertile areas for future research.
Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.
We propose a nonparametric item response theory model for dichotomously-scored items in a Bayesian framework. The model is based on a latent class (LC) formulation, and it is multidimensional, with dimensions corresponding to a partition of the items in homogenous groups that are specified on the basis of inequality constraints among the conditional success probabilities given the latent class. Moreover, an innovative system of prior distributions is proposed following the encompassing approach, in which the largest model is the unconstrained LC model. A reversible-jump type algorithm is described for sampling from the joint posterior distribution of the model parameters of the encompassing model. By suitably post-processing its output, we then make inference on the number of dimensions (i.e., number of groups of items measuring the same latent trait) and we cluster items according to the dimensions when unidimensionality is violated. The approach is illustrated by two examples on simulated data and two applications based on educational and quality-of-life data.
In many regression applications, users are often faced with difficulties due to nonlinear relationships, heterogeneous subjects, or time series which are best represented by splines. In such applications, two or more regression functions are often necessary to best summarize the underlying structure of the data. Unfortunately, in most cases, it is not known a priori which subset of observations should be approximated with which specific regression function. This paper presents a methodology which simultaneously clusters observations into a preset number of groups and estimates the corresponding regression functions' coefficients, all to optimize a common objective function. We describe the problem and discuss related procedures. A new simulated annealing-based methodology is described as well as program options to accommodate overlapping or nonoverlapping clustering, replications per subject, univariate or multivariate dependent variables, and constraints imposed on cluster membership. Extensive Monte Carlo analyses are reported which investigate the overall performance of the methodology. A consumer psychology application is provided concerning a conjoint analysis investigation of consumer satisfaction determinants. Finally, other applications and extensions of the methodology are discussed.
A procedure is described for orienting the nodes of a binary tree to maximize the Kendall rank order correlation (tau) between node order and a given external criterion. The procedure is computationally efficient and is based on application of an ordered set of node tests.
Pairwise partitioning is a nonmetric, divisive algorithm, for identifying feature structures based on pairwise similarities. For errorless data, this algorithm is shown to identify only (and sometimes all) valid features for certain hierarchical and multidimensional feature structures. Unfortunately, the algorithm is also extremely sensitive to error in the data. Fortunately, several modifications of the algorithm are shown to compensate significantly for this deficiency. The algorithm is illustrated with simulations and analyses of three sets of similarity data. The results suggest that this algorithm will be most useful (a) as an exploratory tool for generating a relatively large set of potential features that can be reduced using other criteria (either statistical or substantive) and (b) as a source of confirmatory or disconfirmatory evidence.
A review of model-selection criteria is presented, with a view toward showing their similarities. It is suggested that some problems treated by sequences of hypothesis tests may be more expeditiously treated by the application of model-selection criteria. Consideration is given to application of model-selection criteria to some problems of multivariate analysis, especially the clustering of variables, factor analysis and, more generally, describing a complex of variables.
One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N\times N$$\end{document} dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.
A method is proposed that combines dimension reduction and cluster analysis for categorical data by simultaneously assigning individuals to clusters and optimal scaling values to categories in such a way that a single between variance maximization objective is achieved. In a unified framework, a brief review of alternative methods is provided and we show that the proposed method is equivalent to GROUPALS applied to categorical data. Performance of the methods is appraised by means of a simulation study. The results of the joint dimension reduction and clustering methods are compared with the so-called tandem approach, a sequential analysis of dimension reduction followed by cluster analysis. The tandem approach is conjectured to perform worse when variables are added that are unrelated to the cluster structure. Our simulation study confirms this conjecture. Moreover, the results of the simulation study indicate that the proposed method also consistently outperforms alternative joint dimension reduction and clustering methods.
Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on the observed sum-of-squares error as compared to the minimum sum-of-squares error.
A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.
This paper develops a maximum likelihood based method for simultaneously performing multidimensional scaling and cluster analysis on two-way dominance or profile data. This MULTICLUS procedure utilizes mixtures of multivariate conditional normal distributions to estimate a joint space of stimulus coordinates and K vectors, one for each cluster or group, in a T-dimensional space. The conditional mixture, maximum likelihood method is introduced together with an E-M algorithm for parameter estimation. A Monte Carlo analysis is presented to investigate the performance of the algorithm as a number of data, parameter, and error factors are experimentally manipulated. Finally, a consumer psychology application is discussed involving consumer expertise/experience with microcomputers.
We propose a method to reduce many categorical variables to one variable with k categories, or stated otherwise, to classify n objects into k groups. Objects are measured on a set of nominal, ordinal or numerical variables or any mix of these, and they are represented as n points in p-dimensional Euclidean space. Starting from homogeneity analysis, also called multiple correspondence analysis, the essential feature of our approach is that these object points are restricted to lie at only one of k locations. It follows that these k locations must be equal to the centroids of all objects belonging to the same group, which corresponds to a sum of squared distances clustering criterion. The problem is not only to estimate the group allocation, but also to obtain an optimal transformation of the data matrix. An alternating least squares algorithm and an example are given.
Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.
In many classification problems, one often possesses external and/or internal information concerning the objects or units to be analyzed which makes it appropriate to impose constraints on the set of allowable classifications and their characteristics. CONCLUS, or CONstrained CLUStering, is a new methodology devised to perform constrained classification in either an overlapping or nonoverlapping (hierarchical or nonhierarchial) manner. This paper initially reviews the related classification literature. A discussion of the use of constraints in clustering problems is then presented. The CONCLUS model and algorithm are described in detail, as well as their flexibility for use in various applications. Monte Carlo results are presented for two synthetic data sets with appropriate discussion of the resulting implications. An illustration of CONCLUS is presented with respect to a sales territory design problem where the objects classified are various Forbes-500 companies. Finally, the discussion section highlights the main contribution of the paper and offers some areas for future research.
In the application of clustering methods to real world data sets, two problems frequently arise: (a) how can the various contributory variables in a specific battery be weighted so as to enhance some cluster structure that may be present, and (b) how can various alternative batteries be combined to produce a single clustering that “best” incorporates each contributory set. A new method is proposed (SYNCLUS, SYNthesized CLUStering) for dealing with these two problems.
Latent class models for cognitive diagnosis often begin with specification of a matrix that indicates which attributes or skills are needed for each item. Then by imposing restrictions that take this into account, along with a theory governing how subjects interact with items, parametric formulations of item response functions are derived and fitted. Cluster analysis provides an alternative approach that does not require specifying an item response model, but does require an item-by-attribute matrix. After summarizing the data with a particular vector of sum-scores, K-means cluster analysis or hierarchical agglomerative cluster analysis can be applied with the purpose of clustering subjects who possess the same skills. Asymptotic classification accuracy results are given, along with simulations comparing effects of test length and method of clustering. An application to a language examination is provided to illustrate how the methods can be implemented in practice.