Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-01-25T21:54:32.721Z Has data issue: false hasContentIssue false

Structural Analysis of Subjective Categorical Data

Published online by Cambridge University Press:  01 January 2025

Karl Christoph Klauer*
Affiliation:
University Bonn
William H. Batchelder
Affiliation:
University of California, Irvine
*
Requests for reprints should be sent to Karl Christoph Klauer, Psychologisches Institut, Universität Bonn, Römerstr. 168, 53118 Bonn, FR GERMANY.

Abstract

A general approach to the analysis of subjective categorical data is considered, in which agreement matrices of two or more raters are directly expressed in terms of error and agreement parameters. The method provides focused analyses of ratings from several raters for whom ratings have measurement error distributions that may induce bias in the evaluation of substantive questions of interest. Each rater's judgment process is modeled as a mixture of two components: an error variable that is unique for the rater in question as well as an agreement variable that operationalizes the “true” values of the units of observation. The statistical problems of identification, estimation, and testing of such measurement models are discussed.

The general model is applied in several special cases. The most simple situation is that underlying Cohen's Kappa, where two raters place units into unordered categories. The model provides a generalization and systematization of the Kappa-idea to correct for agreement by chance. In applications with typical research designs, including a between-subjects design and a mixed within-subjects, between-subjects design, the model is shown to disentangle structural and measurement components of the observations, thereby controlling for possible confounding effects of systematic rater bias. Situations considered include the case of more than two raters as well as the case of ordered categories. The different analyses are illustrated by means of real data sets.

Type
Original Paper
Copyright
Copyright © 1996 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors wish to thank Lawrence Hubert and Ivo Molenaar for helpful and detailed comments on a previous draft of this paper. Thanks are also due to Jens Möller und Bernd Strauß for the data from the 1992 Olympic Games. We thank the editor and three anonymous reviewers for valuable comments on an earlier draft.

References

Agresti, A. (1990). Categorical data analysis, New York: Wiley.Google Scholar
Bales, R. F. (1950). Interaction process analysis, Cambridge, MA: Addison-Wesley.Google Scholar
Batchelder, W. H., Romney, A. K. (1986). The statistical analysis of a general Condorcet model for dichotomous choice situations. In Grofman, B., Owen, G. (Eds.), Information pooling and group decision making (pp. 103112). Greenwich, CT: JAI Press.Google Scholar
Batchelder, W. H., Romney, A. K. (1988). Test theory without an answer key. Psychometrika, 53, 7192.CrossRefGoogle Scholar
Batchelder, W. H., Romney, A. K. (1989). New results in test theory without an answer key. In Roskam, E. (Eds.), Advances in mathematical psychology, Vol. II (pp. 229248). Heidelberg: Springer.Google Scholar
Brennan, R. L., Light, R. J. (1974). Measuring agreement when two observers classify people into categories not defined in advance. British Journal of Mathematical and Statistical Psychology, 2, 154163.CrossRefGoogle Scholar
Brewer, D. D., Romney, A. K., Batchelder, W. H. (1991). Consistency and consensus: A replication. Journal of Quantitative Anthropology, 3, 195205.Google Scholar
Carey, G., Gottesman, I. I. (1978). Reliability and validity in binary ratings: Areas of common misunderstanding in diagnosis and symptom ratings. Archives of General Psychiatry, 35, 14541459.CrossRefGoogle ScholarPubMed
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 3746.CrossRefGoogle Scholar
Cohen, J. (1968). Weighted Kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213220.CrossRefGoogle ScholarPubMed
Cohen, J. (1988). Statistical power analysis for the behavioral sciences 2nd ed.,, Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
Cooil, B., Rust, T. (1994). Reliability and expected loss: A unifying principle. Psychometrika, 59, 203216.CrossRefGoogle Scholar
Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90, 218244.CrossRefGoogle Scholar
Cressie, N., Holland, P. W. (1983). Characterizing the manifest probabilities of a latent trait model. Psychometrika, 48, 129141.CrossRefGoogle Scholar
Cronbach, L. J., Gleser, G. C., Nanda, H., Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles, New York: Wiley.Google Scholar
Dillon, W. R., Madden, T. J., Kumar, A. (1983). Analyzing sequential categorical data on dyadic interaction: A latent structure approach. Psychological Bulletin, 94, 564583.CrossRefGoogle Scholar
Dillon, W. R., Mulani, N. (1984). A probabilistic latent class model for assessing inter-judge reliability. Multivariate Behavioral Research, 19, 438458.CrossRefGoogle ScholarPubMed
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans, Philadelphia, PA: Society for Industrial and Applied Mathematics.CrossRefGoogle Scholar
Ekman, P. (1988). Gesichtsausdruck und Gefühl [Facial expression and emotion], Paderborn: Jungfermann.Google Scholar
Erdfelder, E., & Bredenkamp, J. (1993). Recognition of script-typical versus script-atypical information: Effects of cognitive elaboration. Manuscript submitted for publication.Google Scholar
Faul, F., Erdfelder, E. (1992). GPOWER: A-priori, post-hoc, and compromise power analyses for MSDOS [Computer program], Bonn, FRG: Bonn University, Department of Psychology.Google Scholar
Feger, H. (1983). Planung und Bewertung von wissenschaftlichen Beobachtungen [Design and evaluation of scientific observations]. In Feger, H., Bredenkamp, J. (Eds.), Datenerhebung (pp. 175). Göttingen: Hogrefe.Google Scholar
Gavanski, I., Wells, G. L. (1989). Counterfactual processing of normal and exceptional events. Journal of Experimental Social Psychology, 25, 314325.CrossRefGoogle Scholar
Grove, W. M., Andreason, N. C., McDonald-Scott, P., Keller, B., Shapiro, R. W. (1981). Reliability studies of psychiatric diagnosis. Archives of General Psychiatry, 38, 408411.CrossRefGoogle ScholarPubMed
Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cell counts. The Annals of Statistics, 5, 815841.CrossRefGoogle Scholar
Hu, X., Batchelder, W. H. (1994). The statistical analysis of general processing tree models with the EM algorithm. Psychometrika, 59, 2147.CrossRefGoogle Scholar
Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84, 289297.CrossRefGoogle Scholar
Hubert, L., Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193218.CrossRefGoogle Scholar
Janes, C. L. (1979). An extension of the random error coefficient of agreement to N ×N tables. British Journal of Psychiatry, 134, 617619.CrossRefGoogle Scholar
Jöreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psychometrika, 43, 443477.CrossRefGoogle Scholar
Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 223245.CrossRefGoogle Scholar
Klauer, K. C., Migulla, G. (1995). Spontanes kontrafaktisches Denken [Spontaneous counterfactual processing]. Zeitschrift für Sozialpsychologie, 26, 3445.Google Scholar
Klauer, K. C., Stern, E. (1992). How attitudes guide memory-based judgements: A two-process model. Journal of Experimental Social Psychology, 28, 186206.CrossRefGoogle Scholar
Koch, G. G., Landis, J. R., Freeman, J. L., Freeman, D. H. Jr., Lehnen, R. G. (1977). A general methodology for the analysis of experiments with repeated measurement of categorical data. Biometrics, 33, 133158.CrossRefGoogle ScholarPubMed
Koch, G. G., Reinfurt, D. W. (1975). The analysis of categorical data from mixed models. Biometrics, 27, 157173.CrossRefGoogle Scholar
Landis, R. J., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159174.CrossRefGoogle ScholarPubMed
Langeheine, R., Rost, J. (1988). Latent trait and latent class models, New York: Plenum.CrossRefGoogle Scholar
Lehmann, E. L. (1970). Testing statistical hypotheses, New York: Wiley.Google Scholar
Liou, M., Yu, L. (1991). Assessing statistical accuracy in ability estimation: A bootstrap approach. Psychometrika, 56, 5567.CrossRefGoogle Scholar
Lord, F. M., Novick, M. R. (1968). Statistical theories of mental test scores, Reading, MA: Lawrence Erlbaum.Google Scholar
Maher, K. M. (1987). A multiple choice model for aggregating group knowledge and estimating individual competence, Irvine: University of California.Google Scholar
Maxwell, A. E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 7983.CrossRefGoogle ScholarPubMed
Möller, J. (1993). Zur Ausdifferenzierung des Paradigmas “Spontane Attributionen”: Eine empirische Analyse zeitlich unmittelbarer Ursachenzuschreibungen [Towards a differentiation of the paradigm “spontaneous attributions”: An empirical analysis of immediate causal descriptions]. Zeitschrift für Sozialpsychologie, 24, 129136.Google Scholar
Möller, J., Strauß, B. (1994). Agreement matrix for ratings of causal location and stability of events of the 1992 Olympic Games, Kiel, FRG: University of Kiel.Google Scholar
Nisbett, R. E., Wilson, T. (1977). The halo effect: Evidence for unconscious alteration of judgements. Journal of Personality and Social Psychology, 35, 250256.CrossRefGoogle Scholar
Perreault, W. D. Jr., Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgements. Journal of Marketing Research, 26, 135148.CrossRefGoogle Scholar
Rao, C. R. (1973). Linear statistical inference and its applications, New York: Wiley.CrossRefGoogle Scholar
Schutz, W. C. (1952). Reliability, ambiguity and content analysis. Psychological Review, 59, 119129.CrossRefGoogle ScholarPubMed
Shweder, R. A., D'Andrade, R. G. (1980). The systematic distortion hypothesis. In Shweder, R. A., Fiske, D. W. (Eds.), New directions for methodology of behavioral science: Fallible judgements in behavioral research, San-Francisco: Jossey-Bass.Google Scholar
Simon, A., Boyer, E. G. (1974). Mirrors of behavior III. An anthology of observation instruments, Wyncote, PA: Communication Materials Center.Google Scholar
Spitznagel, E. L., Helzer, J. E. (1985). A proposed solution to the base rate problem in the kappa statistic. Archives of General Psychiatry, 42, 725728.CrossRefGoogle Scholar
Sprott, D. A., Vogel-Sprott, M. D. (1987). The use of the log-odds ratio to assess the reliability of dichotomous questionnaire data. Applied Psychological Measurement, 11, 307316.CrossRefGoogle Scholar
Uebersax, J. S. (1987). Diversity of decision-making models and the measurement of interrater agreement. Psychological Bulletin, 101, 140146.CrossRefGoogle Scholar
Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405416.CrossRefGoogle Scholar
Weller, S. C. (1984). Consistency and consensus among informants: Disease concepts in a rural mexican town. American Anthropologist, 88, 313338.Google Scholar
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374378.CrossRefGoogle Scholar