Example for Principal Component Analysis: Protein Consumption in Europe

Contents

Read Data Set

This data set shows nutrition data for different types of protein in European countries around 1970.

We want to understand in which way nutrition is different in different countries

Each of the 25 countries has 9 protein values, so we have 25 data points in R^9. Since we cannot make a plot in R^9 we want to find the 2-dimensional subspace which is closest to the data points. We can then plot the orthogonal projection of the data points onto this 2-dimensional subspace.

T = readtable('proteindata.txt','Delimiter','\t','HeaderLines',6,'ReadRowNames',true) % 'Table' data type in Matlab
data = table2array(T);                       % numbers as Matlab array
countrynames = T.Properties.RowNames;        % cell arrays of country names and protein names
proteinnames = T.Properties.VariableNames;
T = 
                 RedMeat    WhiteMeat    Eggs    Milk    Fish    Cereals    Starch    Nuts    FruitVeg
                 _______    _________    ____    ____    ____    _______    ______    ____    ________
    Albania      10.1        1.4         0.5      8.9     0.2    42.3       0.6       5.5     1.7     
    Austria       8.9         14         4.3     19.9     2.1      28       3.6       1.3     4.3     
    Belgium      13.5        9.3         4.1     17.5     4.5    26.6       5.7       2.1       4     
    Bulgaria      7.8          6         1.6      8.3     1.2    56.7       1.1       3.7     4.2     
    Czechosl.     9.7       11.4         2.8     12.5       2    34.3         5       1.1       4     
    Denmark      10.6       10.8         3.7       25     9.9    21.9       4.8       0.7     2.4     
    EGermany      8.4       11.6         3.7     11.1     5.4    24.6       6.5       0.8     3.6     
    Finland       9.5        4.9         2.7     33.7     5.8    26.3       5.1         1     1.4     
    France         18        9.9         3.3     19.5     5.7    28.1       4.8       2.4     6.5     
    Greece       10.2          3         2.8     17.6     5.9    41.7       2.2       7.8     6.5     
    Hungary       5.3       12.4         2.9      9.7     0.3    40.1         4       5.4     4.2     
    Ireland      13.9         10         4.7     25.8     2.2      24       6.2       1.6     2.9     
    Italy           9        5.1         2.9     13.7     3.4    36.8       2.1       4.3     6.7     
    Netherl.      9.5       13.6         3.6     23.4     2.5    22.4       4.2       1.8     3.7     
    Norway        9.4        4.7         2.7     23.3     9.7      23       4.6       1.6     2.7     
    Poland        6.9       10.2         2.7     19.3       3    36.1       5.9         2     6.6     
    Portugal      6.2        3.7         1.1      4.9    14.2      27       5.9       4.7     7.9     
    Romania       6.2        6.3         1.5     11.1       1    49.6       3.1       5.3     2.8     
    Spain         7.1        3.4         3.1      8.6       7    29.2       5.7       5.9     7.2     
    Sweden        9.9        7.8         3.5     24.7     7.5    19.5       3.7       1.4       2     
    Switzerl.    13.1       10.1         3.1     23.8     2.3    25.6       2.8       2.4     4.9     
    UK           17.4        5.7         4.7     20.6     4.3    24.3       4.7       3.4     3.3     
    USSR          9.3        4.6         2.1     16.6       3    43.6       6.4       3.4     2.9     
    WGermany     11.4       12.5         4.1     18.8     3.4    18.6       5.2       1.5     3.8     
    Yugosl.       4.4          5         1.2      9.5     0.6    55.9         3       5.7     3.2     

Perform Principal Component Analysis

We see that

Z = data';                        % use transpose matrix: columns of Z are protein values for each country
                                  % We obtain 25 data points in R^9
X = Z - mean(Z,2)*ones(1,25);     % subtract the row mean from each row
Variance = norm(X,'fro')^2        % Variance is 25 times the variance of the data
[U,S,V] = svd(X);
s = diag(S)                       % singular values
Variance2 = sum(s.^2)             % same as Variance
cumsum(s.^2)/sum(s.^2)            % amount of variance which is captured by 1,2,3,... principal components
Variance =
   5.2434e+03
s =
   61.0378
   27.1435
   19.3764
   14.1282
    9.3337
    7.6349
    6.1125
    4.1419
    2.4599
Variance2 =
   5.2434e+03
ans =
    0.7105
    0.8510
    0.9226
    0.9607
    0.9773
    0.9884
    0.9956
    0.9988
    1.0000

First two principal components

We choose the columns of U as the new basis of our 9-dimensional data space. This is an orthonormal change of basis: c = U'*x.

For our data points we obtain C = U'*X = S*V.

We then plot the projection of the data points onto the span of the first two columns u1,u2.

Each of the data points is approximated by c1*u1 + c2*u2, and we plot the points (c1,c2).

The horizontal green line corresponds to the best 1-dimensional subspace approximating the data.

The original unit vectors in the 9-dimensional data space (for 'RedMeat','WhiteMeat',etc.) become after the projection the columns of U(:,1:2)', i.e., the first two rows of U. We plot these vectors in red (with a scaling factor so that we can see them).

C = S(1:2,1:2)*V(:,1:2)';         % (c1,c2) coordinates of data points, same as U(:,1:2)'*X
W = 20*U(:,1:2);                  % projected unit vectors, scaled with factor 20

plot(C(1,:),C(2,:),'o'); hold on  % plot (c1,c2)
text(C(1,:),C(2,:),countrynames,'FontSize',7,'VerticalAlignment','bottom');  % label with country name
z = zeros(1,9);

plot([z;W(:,1)'],[z;W(:,2)'],'r');  % plot projected unit vectors for proteins, label with protein names
text(W(:,1),W(:,2),proteinnames,'FontSize',7,'VerticalAlignment','bottom','HorizontalAlignment','right','color','red');
grid on; axis equal
ax = axis; plot(ax(1:2),[0 0],'g',[0 0],ax(3:4),'g')   % draw horizontal and vertical axes in green
xlabel('c_1'); ylabel('c_2')
hold off

Interpretation of the result

First three principal components

We now use the first three principal components and plot the points (c1,c2,c3).

We show two "side views" of this 3d picture: c3 vs c1, and c3 vs c2.

C = S(1:3,1:3)*V(:,1:3)';          % (c1,c2,c3) coordinates of data points, same as U(:,1:3)'*X
W = 20*U(:,1:3);                   % projected unit vectors, scaled with factor 20

figure(1)                          % plot (c1,c3)
plot(C(1,:),C(3,:),'o'); hold on
text(C(1,:),C(3,:),countrynames,'FontSize',7,'VerticalAlignment','bottom');
z = zeros(1,9);
plot([z;W(:,1)'],[z;W(:,3)'],'r'); % plot projected unit vectors for proteins
text(W(:,1),W(:,3),proteinnames,'FontSize',7,'VerticalAlignment','bottom','HorizontalAlignment','right','color','red');
grid on; axis equal
ax = axis; plot(ax(1:2),[0 0],'g',[0 0],ax(3:4),'g') % draw horizontal and vertical axes in green
xlabel('c_1'); ylabel('c_3')
hold off

figure(2)                           % plot(c2,c3)
plot(C(2,:),C(3,:),'o'); hold on
text(C(2,:),C(3,:),countrynames,'FontSize',7,'VerticalAlignment','bottom');
z = zeros(1,9);
plot([z;W(:,2)'],[z;W(:,3)'],'r');  % plot projected unit vectors for proteins
text(W(:,2),W(:,3),proteinnames,'FontSize',7,'VerticalAlignment','bottom','HorizontalAlignment','right','color','red');
grid on; axis equal
ax = axis; plot(ax(1:2),[0 0],'g',[0 0],ax(3:4),'g') % draw horizontal and vertical axes in green
xlabel('c_2'); ylabel('c_3')
hold off

Interpretation of the result

We see that the third principal component c3 increases with fish, and decreases with white meat. Portugal, Norway, Greece, Finland are at the top (lots of fish, little white meat); Hungary and Austria are at the bottom (lots of white meat, little fish).