INSTITUTE OF INFORMATION TECHNOLOGIES - BAS

Cybernetics and Information Technologies
Volume 2, No 2. Sofia, 2002, Bulgarian Academy of Sciences


On Supervised and Unsupervised Discretization

Gennady Agre*, Stanimir Peev**

* Institute of Information Technologies, 1113 Sofia, email: agre@iinf.bas.bg
** Faculty of Mathematics and Informatics, Sofia University, 1000 Sofia

Abstract:
The paper discusses the problem of supervised and unsupervised discretization of continuous attributes - an important pre-processing step for many machine learning (ML) and data mining (DM) algorithms. Two ML algorithms - Simple Bayesian Classifier (SBC) and Symbolic Nearest Mean Classifier (SNMC)) essentially using attribute discretization have been selected for empirical comparison of supervised entropy-based discretization versus unsupervised equal width and equal frequency binning discretization methods. The results of such evaluation on 13 benchmark datasets do not confirm the widespread opinion (at least for SBC) that entropy-based MDL heuristics outperforms the unsupervised methods. Based on analysis of these results a modification of the entropy-based method as well as a new supervised discretization method have been proposed. The empirical evaluation shows that both methods significantly improve the classification accuracy of both classifiers.

Keywords: supervised and unsupervised discretization, machine learning, data mining.