Considerations of sample and feature size
- 1 September 1972
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Information Theory
- Vol. 18 (5) , 618-626
- https://doi.org/10.1109/tit.1972.1054863
Abstract
In many practical pattern-classification problems the underlying probability distributions are not completely known. Consequently, the classification logic must be determined on the basis of vector samples gathered for each class. Although it is common knowledge that the error rate on the design set is a biased estimate of the true error rate of the classifier, the amount of bias as a function of sample size per class and feature size has been an open question. In this paper, the design-set error rate for a two-class problem with multivariate normal distributions is derived as a function of the sample size per class(N)and dimensionality(L). The design-set error rate is compared to both the corresponding Bayes error rate and the test-set error rate. It is demonstrated that the design-set error rate is an extremely biased estimate of either the Bayes or test-set error rate if the ratio of samples per class to dimensions(N/L)is less than three. Also the variance of the design-set error rate is approximated by a function that is bounded by1/8N.Keywords
This publication has 12 references indexed in Scilit:
- On dimensionality and sample size in statistical pattern classificationPattern Recognition, 1971
- Estimation of classification errorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1970
- An Optimal Discriminant PlaneIEEE Transactions on Computers, 1970
- Interactive Pattern Analysis and ClassificationIEEE Transactions on Computers, 1970
- Experiments with the n-tuple Method of Pattern RecognitionIEEE Transactions on Computers, 1969
- Comments on "On the mean accuracy of statistical pattern recognizers" by Hughes, G. F.IEEE Transactions on Information Theory, 1969
- On the mean accuracy of statistical pattern recognizersIEEE Transactions on Information Theory, 1968
- Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern RecognitionIEEE Transactions on Electronic Computers, 1965
- Errors in DiscriminationThe Annals of Mathematical Statistics, 1961
- On the Distribution of two Random Matrices used in Classification ProceduresThe Annals of Mathematical Statistics, 1952