For a given m x n matrix $X = {X_{ij}}$, where each row is a sample, each column is a zero-mean feature, the normal way of computing covariance matrix is

$$ \Sigma = \frac{1}{m} X^T \times X $$

This can be easily understood - $\Sigma_{ij}$ is the covariance between i-th and j-th feature of the dataset. The computation reflects that - $\Sigma_{ij}$ is computed by $(1/m) * <X_i, X_j>$, where $<X_i, X_j>$ is the inner product between column $X_i$ and column $X_j$. Since all features (columns) are zero-mean, this is exactly the definition of covariance between two random variables.

To my suprise, the other way of estimating the covariance is:

$$ \Sigma = \frac{1}{m} \sum_{i=1}^m { {X^{(i)}}^T \times X^{(i)} } $$

where $X^{(i)}$ is a 1xn row vector representing the i-th observed sample in the dataset. What that means is that instead of estimating the covariance matrix feature-wise, i.e., computing covariance values one by one, we're now estimating the entire covariance matrix using each single observation samples, and averaging those estimates ($\frac{1}{m}\sum$). This approach has the benefit that the covariance matrix can be built incrementally!

The below graph demonstrates the squared estimation errors of these two methods compared with the cov() function.

comparison

The two lines overlaps perfectly implying they're fundamentally equivelent. And as the sample size gets large, the estimation error gets small.

Here's the code for it (gist):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
clc;
close all;

mm = []
E1 = []
E2 = []

for m = 10:50:1000

e1 = 0.0;
e2 = 0.0;

mm = [mm m];

# averaging the error performance
for a = 1:10
n = 20;
X = 3 * rand(m,n) - 1.5;
C1 = cov(X);

# computing covariance feature by feature (column wise inner product)
C2 = (1/m) * X' * X;

# estimate covariance by computing covariance on each sample and then average
C = zeros(n, n);
for i = 1:m
C += X(i,:)' * X(i, :);
end

C = (1/m) * C;

e1 += (1/m*n) * sum(sum((C1 - C).^2, 1), 2);
e2 += (1/m*n) * sum(sum((C2 - C1).^2, 1), 2);
end

E1 = [E1 e1/10];
E2 = [E2 e2/10];
end

hold on;
semilogy(mm, E1, '-k');
semilogy(mm, E2, '-xr');
legend("sq err cov = avg cov per sample", "sq err cov = (1/m) * X' * X");
grid on;
hold off;