Recently, Rousseeuw and collaborators have proposed another criterion called "regression depth". A regression line is called a nonfit if it can be rotated to vertical without passing through any data points. (We count the points exactly on the line as "passed through".) A nonfit is a very poor regression line, because it is combinatorially equivalent to a vertical line, which posits no relationship between independent and dependent variables. The regression depth of a line L is the minimum number of points whose removal makes L into a nonfit. Removing the red circles below and rotating about the indicated fulcrum demonstrates that the blue line has regression depth 2.

Rousseeuw and Hubert proved that for any set of n points in the plane there is always a line of regression depth at least n/3. The proof is a neat construction called the "catline": divide the points into equal thirds with vertical lines and then find a line (a "ham sandwich" line) that simultaneously cuts the left two-thirds and the right two-thirds into equal halves. It is not hard to see that any rotation of this line to vertical must pass through n/3 points.

Regression depth generalizes to more than one independent variable ("multiple regression") and also to more than one dependent variable ("multivariate regression"). If there are several independent variables but only one dependent variable, then the task is to find a hyperplane. The regression depth of a hyperplane H is the minimum number of data points that H must pass through in any rotation to vertical (parallel to the dependent variable's axis). Nina Amenta, David Eppstein, Shang-Hua Teng and I proved that there always exists a hyperplane of regression depth n/(d+1) for points in d dimensions, settling a conjecture of Rousseeuw. Our proof exploits a connection between center points-- related concepts are "data depth" and "Tukey median"-- and regression depth.
If there are several dependent variables, then the task is to find a lower-dimensional flat that explains the dependent variables in terms of the independent variables; for example, if there is one independent variable and two dependent variables, then the task is to find a line fitting a set of data points in 3-D. Now "vertical" means parallel to all the dependent variables' axes, and the regression depth of a flat F is the minimum number of data points in a double wedge bounded by a vertical hyperplane and one that contains F. (This definition properly generalizes the previous one-- the connection is clear when you view everything in the projective dual!) David Eppstein and I recently proved that for any numbers of independent and dependent variables, deep flats (that is, regression depth independent of n) always exist. The proof for the case of only one dependent variable (that is, the case in which the regression flat is a line) is in fact a generalization of the catline.