Boosting as entropy projection

Abstract
We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how Ada- Boost's choice of the new distribution can be seen as an approximate solution to the following prob- lem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mis- takes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribu- tion vector onto a hyperplane defined by the mis- take vector. We show that this new view of Ada- Boost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normaliza- tion factors of the updated distributions.