Example for Principal Component Analysis: Olympic Decathlon 1988

Contents

Read Data Set

This data set contains the results of the 1988 olympic decathlon competition. Each athlete competes in 10 events:

We want to understand how the athletes differ from each other:

We denote the 33 athletes by R1,...,R33 according to their final result (total of the points in 10 events): R1, R2, R3 won gold, silver, bronze.

Each of the 33 athlete was awarded points in 10 events, so we have 33 data points in R^10. Since we cannot make a plot in R^10 we want to find the 2-dimensional subspace which is closest to the data points. We can then plot the orthogonal projection of the data points onto this 2-dimensional subspace.

T = readtable('decathlon1988s.txt','Delimiter','\t','HeaderLines',6,'ReadRowNames',true) % 'Table' data type in Matlab
data = table2array(T);                       % numbers as Matlab array
athletenames = T.Properties.RowNames;        % cell arrays of athlete names and event names
eventnames = T.Properties.VariableNames;
[m,n] = size(data);
T =

  33×10 table

           r100m    longjmp    shotput    highjmp    r400m    r110m    discus    polevlt    javelin    r1500m
           _____    _______    _______    _______    _____    _____    ______    _______    _______    ______

    R1      806       918        819       1061       866      834      855        819        758       752  
    R2      890       922        788        776       923      916      754        941        764       725  
    R3      821       920        741        776       895      873      739        972        801       790  
    R4      947       905        791        831       858      884      763        880        799       648  
    R5      899       965        703        915       893      951      727        880        621       718  
    R6      856       918        662        776       936      924      689        972        700       835  
    R7      821       826        736        859       845      925      699       1132        762       611  
    R8      850       802        811        803       899      929      691        849        783       772  
    R9      827       842        760        831       854      891      713        880        835       747  
    R10     810       881        805        776       880      879      829        972        730       605  
    R11     874       950        811        776       817      942      702        849        838       584  
    R12     821       896        758        749       860      836      715        819        826       834  
    R13     856       883        662        859       898      854      655        910        690       826  
    R14     863       903        704        776       917      886      744        702        837       751  
    R15     854       922        741        776       869      797      699        819        798       761  
    R16     841       833        760        831       820      876      730        880        696       754  
    R17     761       755        856        803       757      726      884        849        932       546  
    R18     738       814        827        749       822      845      801        880        741       643  
    R19     845       823        692        749       911      912      672        849        611       795  
    R20     885       830        841        619       829      857      773        880        745       522  
    R21     748       900        724        749       815      774      641        790        844       768  
    R22     755       818        716        831       787      823      646        819        752       798  
    R23     778       833        747        803       804      853      794        731        673       727  
    R24     795       807        681        944       816      804      639        790        653       694  
    R25     861       869        676        831       827      854      625        760        630       646  
    R26     789       774        584        859       891      803      614        790        669       744  
    R27     838       809        648        670       879      834      625        819        581       799  
    R28     750       816        739        749       762      828      784        790        682       542  
    R29     804       790        631        696       898      777      624        731        628       731  
    R30     753       835        664        644       849      783      712        760        641       596  
    R31     767       635        727        723       758      746      791        790        705       589  
    R32     759       682        626        749       801      850      639        617        697       596  
    R33     738       859        502        723       782      710      551        645        662       745  

Perform Principal Component Analysis

We see that

Z = data';                         % use transpose matrix: columns of Z are event values for each athlete
                                   % We obtain 33 data points in R^10
X = Z - mean(Z,2)*ones(1,m);       % subtract the row mean from each row
Variance = norm(X,'fro')^2         % Variance is 33 times the variance of the data
[U,S,V] = svd(X);
s = diag(S)                        % singular values
Variance2 = sum(s.^2)              % same as Variance
cumsum(s.^2)/sum(s.^2)             % amount of variance which is captured by 1,2,3,... principal components
Variance =

   1.8799e+06


s =

  797.5138
  690.4129
  474.4599
  429.3924
  357.2134
  297.3019
  258.1165
  184.9458
  165.8706
  115.8598


Variance2 =

   1.8799e+06


ans =

    0.3383
    0.5919
    0.7116
    0.8097
    0.8776
    0.9246
    0.9600
    0.9782
    0.9929
    1.0000

First two principal components

We choose the columns of U as the new basis of our 10-dimensional data space. This is an orthonormal change of basis: c = U'*x.

For our data points we obtain C = U'*X = S*V.

We then plot the projection of the data points onto the span of the first two columns u1,u2.

Each of the data points is approximated by c1*u1 + c2*u2, and we plot the points (c1,c2).

The horizontal green line corresponds to the best 1-dimensional subspace approximating the data.

The original unit vectors in the 10-dimensional data space (for events) become after the projection the columns of U(:,1:2)', i.e., the first two rows of U. We plot these vectors in red (with a scaling factor so that we can see them).

C = S(1:2,1:2)*V(:,1:2)';          % (c1,c2) coordinates of data points, same as U(:,1:2)'*X
W = 600*U(:,1:2);                  % projected unit vectors, scaled with factor 600

plot(C(1,:),C(2,:),'o'); hold on   % plot (c1,c2)
text(C(1,:),C(2,:),athletenames,'FontSize',7,'VerticalAlignment','bottom');  % label with athlete name
z = zeros(1,n);

plot([z;W(:,1)'],[z;W(:,2)'],'r'); % plot projected unit vectors for events, label with event names
text(W(:,1),W(:,2),eventnames,'FontSize',7,'VerticalAlignment','bottom','HorizontalAlignment','right','color','red');
grid on; axis equal

xtotvec = ones(n,1)/sqrt(n);       % unit vector for finding sum of points
Wtot = 600*U(:,1:2)'*xtotvec;
% plot([0;Wtot(1)],[0;Wtot(2)],'k:'); % plot vector for "total"

ax = axis; plot(ax(1:2),[0 0],'g',[0 0],ax(3:4),'g')   % draw horizontal and vertical axes in green
xlabel('c_1'); ylabel('c_2')
hold off

Interpretation of the result

First three principal components

We now use the first three principal components and plot the points (c1,c2,c3).

We show two "side views" of this 3d picture: c3 vs c1, and c3 vs c2.

C = S(1:3,1:3)*V(:,1:3)';          % (c1,c2,c3) coordinates of data points, same as U(:,1:3)'*X
W = 600*U(:,1:3);                  % projected unit vectors, scaled with factor 600

figure(1)                          % plot (c1,c3)
plot(C(1,:),C(3,:),'o'); hold on
text(C(1,:),C(3,:),athletenames,'FontSize',7,'VerticalAlignment','bottom');
z = zeros(1,n);
plot([z;W(:,1)'],[z;W(:,3)'],'r'); % plot projected unit vectors for events
text(W(:,1),W(:,3),eventnames,'FontSize',7,'VerticalAlignment','bottom','HorizontalAlignment','right','color','red');
grid on; axis equal
ax = axis; plot(ax(1:2),[0 0],'g',[0 0],ax(3:4),'g') % draw horizontal and vertical axes in green
xlabel('c_1'); ylabel('c_3')
hold off

figure(2)                           % plot(c2,c3)
plot(C(2,:),C(3,:),'o'); hold on
text(C(2,:),C(3,:),athletenames,'FontSize',7,'VerticalAlignment','bottom');
z = zeros(1,n);
plot([z;W(:,2)'],[z;W(:,3)'],'r');  % plot projected unit vectors for events
text(W(:,2),W(:,3),eventnames,'FontSize',7,'VerticalAlignment','bottom','HorizontalAlignment','right','color','red');
grid on; axis equal
ax = axis; plot(ax(1:2),[0 0],'g',[0 0],ax(3:4),'g') % draw horizontal and vertical axes in green
xlabel('c_2'); ylabel('c_3')
hold off

Interpretation of the result

We see that the third principal component c3 increases with running events and pole vault, and decreases most strongly with high jump, javelin, 1500m.