The dutch draw: constructing a universal baseline for binary classification problems

Etienne van de Bijl; Jan Klein; Joris Pries; Sandjai Bhulai; Mark Hoogendoorn; Rob van der Mei

doi:10.1017/jpr.2024.52

The dutch draw: constructing a universal baseline for binary classification problems

Part of: Theory of computing Decision theory Multivariate analysis Design of experiments

Published online by Cambridge University Press: 19 September 2024

and

Etienne van de Bijl*: Affiliation:
Centrum Wiskunde & Informatica
Jan Klein*: Affiliation:
Centrum Wiskunde & Informatica
Joris Pries*: Affiliation:
Centrum Wiskunde & Informatica
Sandjai Bhulai*: Affiliation:
Vrije Universiteit Amsterdam
Mark Hoogendoorn*: Affiliation:
Vrije Universiteit Amsterdam
Rob van der Mei*: Affiliation:
Vrije Universiteit Amsterdam and Centrum Wiskunde & Informatica
*: *Postal address: Centrum Wiskunde & Informatica, Department of Stochastics, Science Park 123, 1098 XG, Amsterdam, Netherlands.
*Postal address: Centrum Wiskunde & Informatica, Department of Stochastics, Science Park 123, 1098 XG, Amsterdam, Netherlands.
*Postal address: Centrum Wiskunde & Informatica, Department of Stochastics, Science Park 123, 1098 XG, Amsterdam, Netherlands.
*****Postal address: Department of Mathematics, Vrije Universiteit Amsterdam, De Boelelaan 1111, 1081 HV, Amsterdam, Netherlands.
*******Postal address: Department of Computer Science, De Boelelaan 1111, 1081 HV, Amsterdam, Netherlands. Email: m.hoogendoorn@vu.nl
*Postal address: Centrum Wiskunde & Informatica, Department of Stochastics, Science Park 123, 1098 XG, Amsterdam, Netherlands.

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an $F_1$ of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is general, as it is applicable to any binary classification problem; simple, as it can be quickly determined without training or parameter tuning; and informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.

Keywords

Baseline benchmark evaluation supervised learning binary classification

MSC classification

Primary: 62C12: Empirical decision procedures; empirical Bayes procedures

Secondary: 62K86: Fuzziness and design of experiments 68Q87: Probability in computer science (algorithm analysis, random structures, phase transitions, etc.) 62H30: Classification and discrimination; cluster analysis 62H15: Hypothesis testing

Information

Type: Original Article
Information: Journal of Applied Probability , Volume 62 , Issue 2 , June 2025 , pp. 475 - 493

DOI: https://doi.org/10.1017/jpr.2024.52 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press on behalf of Applied Probability Trust

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Araújo, R. de, A., Oliveira, A. L. and Meira, S. (2017). A morphological neural network for binary classification problems. Eng. Appl. Artif. Intel. 65, 12–28. https://doi.org/10.1016/j.engappai.2017.07.014 CrossRef Google Scholar

Balayla, J. (2020). Prevalence threshold (

$\phi$ e) and the geometry of screening curves. PLoS ONE 15, e0240215. https://doi.org/10.1371/journal.pone.0240215 CrossRef Google Scholar PubMed

Canbek, G., Sagiroglu, S., Temizel, T. T. and Baykal, N. (2017). Binary classification performance measures/metrics: A comprehensive visualized roadmap to gain new insights. In Proc. 2017 Int. Conf. Computer Science and Engineering (UBMK), IEEE. https://doi.org/10.1109/ubmk.2017.8093539 CrossRef Google Scholar

Chinchor, N. (1992). MUC-4 evaluation metrics. In Proc. 4th Conf. Message Understanding, Association for Computational Linguistics, pp. 22–29. https://doi.org/10.3115/1072064.1072067 CrossRef Google Scholar

Couronné, R., Probst, P. and Boulesteix, A.-L. (2018). Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinformatics 19, 270. https://doi.org/10.1186/s12859-018-2264-5 CrossRef Google Scholar PubMed

Janosi, A., Steinbrunn, W., Pfisterer, M. and Detrano, R. (1988). Heart Disease. UCI Machine Learning Repository, available at https://doi.org/10.24432/C52P4X CrossRef Google Scholar

Koyejo, O., Natarajan, N., Ravikumar, P. and Dhillon, I. S. (2014). Consistent binary classification with generalized performance metrics. In Proc. 27th Int. Conf. Neural Information Processing Systems, Vol. 2. MIT Press, Cambridge, MA, pp. 2744–2752.Google Scholar

Lipton, Z. C., Elkan, C. and Naryanaswamy, B. (2014). Optimal thresholding of classifiers to maximize F1 measure. In Machine Learning and Knowledge Discovery in Databases, eds Calders, T., Esposito, F., Hüllermeier, E. and Meo, R., Springer, Berlin, pp. 225–239.CrossRef Google Scholar

Liu, Z., Luo, P., Wang, X. and Tang, X. (2015). Deep learning face attributes in the wild. In Proc. Int. Conf. Computer Vision, pp. 3730–3738.CrossRef Google Scholar

Min, J. H. and Jeong, C. (2009). A binary classification method for bankruptcy prediction. Expert Syst. Appl. 36, 5256–5263. https://doi.org/10.1016/j.eswa.2008.06.073 CrossRef Google Scholar

Muhammad, G. and Melhem, M. (2014). Pathological voice detection and binary classification using MPEG-7 audio features. Biomed. Sig. Proc. Control 11, 1–9. https://doi.org/10.1016/j.bspc.2014.02.001 CrossRef Google Scholar

Paszke, A. et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., Red Hook, NY, pp. 8024–8035.Google Scholar

Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.Google Scholar

Russakovsky, O. et al. (2015). ImageNet large scale visual recognition challenge. Int. J. Computer Vision 115, 211–225. https://doi.org/10.1007/s11263-015-0816-y CrossRef Google Scholar

Sergioli, G., Giuntini, R. and Freytes, H. (2019). A new quantum approach to binary classification. PLoS ONE 14, e0216224. https://doi.org/10.1371/journal.pone.0216224 CrossRef Google Scholar PubMed

Shahraki, H. R., Pourahmad, S. and Zare, N. (2017). k important neighbors: A novel approach to binary classification in high dimensional data. BioMed Research International 2017, 7560807. https://doi.org/10.1155/2017/7560807 Google Scholar

Sundarkumar, G. G. and Ravi, V. (2013). Malware detection by text and data mining. In Proc. 2013 IEEE Int. Conf. Computational Intelligence and Computing Research, IEEE, pp. 1–6. https://doi.org/10.1109/iccic.2013.6724229 CrossRef Google Scholar

Wang, S. and Manning, C. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proc. 50th Ann. Meeting Association for Computational Linguistics, Vol. 2, pp. 90–94. https://www.aclweb.org/anthology/P12-2018 Google Scholar

Wirth, R. and Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. In Proc. 4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining. https://cs.unibo.it/montesi/CBD/Beatriz/10.1.1.198.5133.pdf Google Scholar

van de Bijl et al. supplementary material 1

van de Bijl et al. supplementary material

File 502.4 KB

van de Bijl et al. supplementary material 2

van de Bijl et al. supplementary material

File 345.6 KB

van de Bijl et al. supplementary material 3

van de Bijl et al. supplementary material

File 18.1 KB

Article contents

The dutch draw: constructing a universal baseline for binary classification problems

Abstract

Keywords

MSC classification

Information

Access options

Article purchase

Temporarily unavailable

References

van de Bijl et al. supplementary material 1

van de Bijl et al. supplementary material 2

van de Bijl et al. supplementary material 3

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests