MINIMUM COVARIANCE DETERMINANT AND EXTENSIONS

0
632

Abstract

The minimum covariance determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. Since estimating the covariance matrix is the cornerstone of many multivariate statistical methods, the MCD is an important building block when developing robust multivariate techniques. It also serves as a convenient and efficient tool for outlier detection. The MCD estimator is reviewed, along with its main properties such as affine equivariance, breakdown value, and influence function. We discuss its computation, and list applications and extensions of the MCD in applied and methodological multivariate statistics. Two recent extensions of the MCD are described. The first one is a fast deterministic algorithm which inherits the robustness of the MCD while being almost affine equivariant. The second is tailored to high‐dimensional data, possibly with more dimensions than cases, and incorporates regularization to prevent singular matrices.

This article is categorized under:

  • Statistical and Graphical Methods of Data Analysis > Multivariate Analysis
  • Statistical and Graphical Methods of Data Analysis > Robust Methods
  • Statistical Learning and Exploratory Methods of the Data Sciences > Knowledge Discovery

1 INTRODUCTION

The minimum covariance determinant (MCD) estimator is one of the first affine equivariant and highly robust estimators of multivariate location and scatter (Rousseeuw, 1984, 1985). Being resistant to outlying observations makes the MCD very useful for outlier detection. Although already introduced in 1984, its main use has only started since the construction of the computationally efficient FastMCD algorithm of Rousseeuw & Van Driessen (1999). Since then, the MCD has been applied in numerous fields such as medicine, finance, image analysis, and chemistry. Moreover, the MCD has also been used to develop many robust multivariate techniques, which includes robust principal component analysis, factor analysis, and multiple regression. Recent modifications of the MCD include a deterministic algorithm and a regularized version for high‐dimensional data.

2 DESCRIPTION OF THE MCD ESTIMATOR

2.1 Motivation

In the multivariate location and scatter setting the data are stored in an n × p data matrix X = (x1, …, x n)′ with x i = (x i1, …, x ip)′ the ith observation, so n stands for the number of objects and p for the number of variables. We assume that the observations are sampled from an elliptically symmetric unimodal distribution with unknown parameters μ and Σ, where μ is a vector with p components and Σ is a positive definite p × p matrix. To be precise, a multivariate distribution is called elliptically symmetric and unimodal if there exists a strictly decreasing real function g such that the density can be written in the form:urn:x-wiley:19395108:media:wics1421:wics1421-math-0001(1)in which the statistical distance d(xμΣ) is given by:urn:x-wiley:19395108:media:wics1421:wics1421-math-0002(2)

To illustrate the MCD, we first consider the wine data set available in Hettich and Bay (1999) and also analyzed in Maronna, Martin, and Yohai (2006). This data set contains the quantities of 13 constituents found in three types of Italian wines. We consider the first group containing 59 wines, and focus on the constituents “Malic acid” and “Proline.” This yields a bivariate data set, that is, p = 2. A scatter plot of the data is shown in Figure 1, in which we see that the points on the lower right‐hand side of the plot are outlying relative to the majority of the data.

MINIMUM COVARIANCE DETERMINANT AND EXTENSIONS