Time- and Space-Efficient Algorithms for Least Median of Squares Regression

1 September 1987

journal article
research article
Published by JSTOR in Journal of the American Statistical Association

Vol. 82 (399) , 794
https://doi.org/10.2307/2288788

Abstract

The least median of squared residuals regression line (or LMS line) is that line y = ax + b for which the median of the residuals |y_i - ax_i - b |² is minimized over all choices of a and b. If we rephrase the traditional ordinary least squares (OLS) problem as finding the a and b that minimize the mean of | y_i - ax_i - b |², one can see that in a formal sense LMS just replaces a “mean” by a “median.” This way of describing LMS regression does not do justice to the remarkable properties of LMS. In fact, LMS regression behaves in ways that distinguish it greatly from OLS as well as from many other methods for robustifying OLS (see, e.g., Rousseeuw 1984). As illustrations given here show, the LMS regression line should provide a valuable tool for studying those data sets in which the usual linear model assumptions are violated by the presence of some (not too small) groups of data values that behave distinctly from the bulk of the data. This feature of LMS regression is illustrated by the fit given in Figure 1 and the residual plots of Figures 2a and 2b. The LMS regression line is an attractive tool for data analysis, but it is not easy to compute. Steele and Steiger (1986) established that the function f(a, b) = median {| y_i - ax_i - b |²} can have on the order of n ₂ local minima, so typical local methods have little hope of finding the global minimum of f. The main objective of this article is to provide algorithms that do minimize f and are efficient in terms of both time and space. Two algorithms are given here that determine the LMS regression line for n points in the plane. Both of these algorithms draw their strength from the systematic use of affine duality, and one objective pursued here is the exposition of the technique of affine duality so that it will become more commonly considered by statisticians. The first algorithm uses the so-called sweep-line technique, and it runs in worst-case time complexity O(n ² log n), using only O(n) space. The second algorithm depends on the recent development of data structures that permit the economical searching of the arrangement determined by n lines. It requires only O(n ²) time, but the space needed is raised to O(n ²).

Keywords

This publication has 0 references indexed in Scilit: