Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-01-25T19:53:35.733Z Has data issue: false hasContentIssue false

Classification of Wines Using Principal Component Analysis

Published online by Cambridge University Press:  22 March 2021

Jackson Barth
Affiliation:
Department of Statistical Science, Southern Methodist University, 3225 Daniel Ave, Dallas, Texas, 75275; e-mail: jbarth@smu.edu.
Duwani Katumullage
Affiliation:
Department of Statistical Science, Southern Methodist University, 3225 Daniel Ave, Dallas, Texas, 75275; e-mail: dkatumullage@smu.edu.
Chenyu Yang
Affiliation:
Department of Statistical Science, Southern Methodist University, 3225 Daniel Ave, Dallas, Texas, 75275; e-mail: chenyuy@smu.edu.
Jing Cao*
Affiliation:
Department of Statistical Science, Southern Methodist University, 3225 Daniel Ave, Dallas, Texas, 75275
*
e-mail: jcao@smu.edu (corresponding author).

Abstract

Classification of wines with a large number of correlated covariates may lead to classification results that are difficult to interpret. In this study, we use a publicly available dataset on wines from three known cultivars, where there are 13 highly correlated variables measuring chemical compounds of wines. The goal is to produce an efficient classifier with straightforward interpretation to shed light on the important features of wines in the classification. To achieve the goal, we incorporate principal component analysis (PCA) in the k-nearest neighbor (kNN) classification to deal with the serious multicollinearity among the explanatory variables. PCA can identify the underlying dominant features and provide a more succinct and straightforward summary over the correlated covariates. The study shows that kNN combined with PCA yields a much simpler and interpretable classifier that has comparable performance with kNN based on all the 13 variables. The appropriate number of principal components is chosen to strike a balance between predictive accuracy and simplicity of interpretation. Our final classifier is based on only two principal components, which can be interpreted as the strength of taste and level of alcohol and fermentation in wines, respectively. (JEL Classifications: C10, Cl4, D83)

Type
Articles
Copyright
Copyright © American Association of Wine Economists, 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors gratefully acknowledge helpful comments and advice from the editor, Karl Storchmann, and an anonymous reviewer.

References

Beltran, N. H., Duarte-Mermoud, M. A., Soto Vicencio, V. A., Salah, S. A., and Bustos, M. A. (2008). Chilean wine classification using volatile organic compounds data obtained with a fast GC analyzer. IEEE Transactions on Instrumentation and Measurement, 57(11), 24212436.CrossRefGoogle Scholar
Bredensteiner, E. J., and Bennett, K. P. (1999). Multicategory classification by support vector machines. Computational Optimization and Applications, 12, 5379.CrossRefGoogle Scholar
Cabrita, M., Aires-De-Sousa, J., Gomes Da Silva, M., Rei, F., and Costa Freitas, A. (2012). Multivariate statistical approaches for wine classification based on low molecular weight phenolic compounds. Australian Journal of Grape and Wine Research, 18, 138146.CrossRefGoogle Scholar
Cao, J. (2014). Quantifying randomness versus consensus in wine quality ratings. Journal of Wine Economics, 9(2), 202213.CrossRefGoogle Scholar
Cao, J., and Stokes, L. (2010). Evaluation of wine judge performance through three characteristics: Bias, discrimination, and variation. Journal of Wine Economics, 5(1), 132142.CrossRefGoogle Scholar
Corsi, A., and Ashenfelter, O. (2019). Predicting Italian wine quality from weather data and expert ratings. Journal of Wine Economics, 14(3), 234251.CrossRefGoogle Scholar
Duch, W. (2018). Coloring black boxes: Visualization of neural network decisions. ArXiv.Org; Ithaca. Available at https://arxiv.org/pdf/1802.08478.pdf.Google Scholar
Hodgson, R. T. (2008). An examination of judge reliability at a major U.S. wine competition. Journal of Wine Economics, 3(2), 105113.CrossRefGoogle Scholar
Johnson, R. A., and Wichern, D. W. (2019). Applied Multivariate Statistical Analysis (6th ed.). Upper Saddle River, NJ: Pearson.Google Scholar
Kubica, J., and Moore, A. (2003). Probabilistic noise identification and data cleaning. Paper presented at the Third IEEE International Conference on Data Mining, Melbourne, FL. In 2013 Third IEEE International Conference on Data Mining, 131138. Available at https://www.computer.org/csdl/proceedings-article/icdm/2003/19780131/12OmNzcPAqS.Google Scholar
Luxen, M. F. (2018). Consensus between ratings of red Bordeaux wines by prominent critics and correlations with Prices 2004–2010 and 2011–2016: Ashton revisited and expanded. Journal of Wine Economics, 13(1), 8391.CrossRefGoogle Scholar
McCannon, B. C. (2020). Wine descriptions provide information: A text analysis. Journal of Wine Economics, 15(1), 7194.CrossRefGoogle Scholar
Oczkowski, E. (2016). Identifying the effects of objective and subjective quality on wine prices. Journal of Wine Economics, 11(2), 249260.CrossRefGoogle Scholar
Santos, C. A. T., Páscoa, R. N. M. J., Sarraguça, M. C., Porto, P. A. L. S., Cerdeira, A. L., González-Sáiz, J. M., Pizarro, C., and Lopes, J. A. (2017). Merging vibrational spectroscopic data for wine classification according to the geographic origin. Food Research International, 102, 504510.CrossRefGoogle Scholar
Suthampan, E. (2017). Principle component analysis (PCA) for wine dataset. Available at https://rstudio-pubs-static.s3.amazonaws.com/253795_29cb3d89b03e476a99ee2d32a7886243.html#/Google Scholar
Wine Data Set (1991). University of California at Irvine. UCI Machine Learning Repository. Available at http://archive.ics.uci.edu/ml/datasets/wine (accessed May 5, 2020).Google Scholar
Wine-Searcher (2020). Piedmont [Piemonte] wine. Available at https://www.wine-searcher.com/regions-piedmont+%5Bpiemonte%5D (accessed May 5, 2020).Google Scholar
Zhong, P., and Fukushima, M. (2006). Second-order cone programming formulations for robust multiclass classification. Neural Computation, 19(1), 258282.CrossRefGoogle Scholar