Principal Component Analysis to perform dimensionality reduction
Introduction
To understand how PCA works, we will first start with a 2D dataset which has one direction of large variation and one of smaller variation. In this part, we will visualize what happens when we use PCA to reduce the data from 2D to 1D. In practice, we might want to reduce data from 256 to 50 dimensions, say; but using lower dimensional data in this example allows us to visualize the algorithms better.
% The following command loads the dataset. You should now have the
% variable X in your environment
load ('example_data.mat');
% Visualize the example dataset
plot(X(:, 1), X(:, 2), 'bo');
title('Example Dataset');
axis([0.5 6.5 2 8]); axis square;
Implementing PCA
Out PCA consists of two computational steps:
- First, compute the covariance matrix of the data.
- Then use MATLAB's SVD function to compute the eigenvectors
These will correspond to the principal components of variation in the data.
Before using PCA, it is important to first normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range.
% Before running PCA, it is important to first normalize X
[X_norm, mu, sigma] = featureNormalize(X);
%Implementation of featureNormalize is at the end section
After normalizing the data, we can run PCA to compute the principal components. First, we should compute the covariance matrix of the data, which is given by:
where X is the data matrix with examples in rows, and m is the number of examples. Note that Σ is a n × n matrix and not the summation operator. After computing the covariance matrix, we can run SVD on it to compute the principal components.
% Run PCA
[U, S] = pca(X_norm);
% Compute mu, the mean of the each feature
% Draw the eigenvectors centered at mean of data. These lines show the
% directions of maximum variations in the dataset.
hold on;
title('Computed eigenvectors of the dataset');
drawLine(mu, mu + 1.5 * S(1,1) * U(:,1)', '-k', 'LineWidth', 2);
drawLine(mu, mu + 1.5 * S(2,2) * U(:,2)', '-k', 'LineWidth', 2);
hold off;
fprintf('Top eigenvector: \n');
fprintf(' U(:,1) = %f %f \n', U(1,1), U(2,1));
Dimensionality Reduction with PCA
After computing the principal components, we can use them to reduce the feature dimension of your dataset by projecting each example onto a lower dimensional space, . We will use the eigenvectors returned by PCA and project the example dataset into a 1-dimensional space
In practice, if we were using a learning algorithm such as linear regression or perhaps neural networks, we could now use the projected data instead of the original data. By using the projected data, we can train our model faster as there are less dimensions in the input.
Projecting the data onto the principal components
Specifically, we are given a dataset X, the principal components U, and the desired number ofdimensions to reduce to K. We should project each example in X onto the top K components in U. Note that the top K components in U are given by the first K columns of U, that is
fprintf('\nDimension reduction on example dataset.\n\n');
% Plot the normalized dataset (returned from pca)
plot(X_norm(:, 1), X_norm(:, 2), 'bo');
title('The normalized and projected data after PCA');
axis([-4 3 -4 3]); axis square
% Project the data onto K = 1 dimension
K = 1;
Z = projectData(X_norm, U, K);
%Implementation of projectData is at the end section
fprintf('Projection of the first example: %f\n', Z(1));
fprintf('\n(this value should be about 1.481274)\n\n');
Reconstructing an approximation of the data
After projecting the data onto the lower dimensional space, we can approximately recover the data by projecting them back onto the original high dimensional space.
%Implementation of recoverData is at the end section
X_rec = recoverData(Z, U, K);
fprintf('Approximation of the first example: %f %f\n', X_rec(1, 1), X_rec(1, 2));
fprintf('\n(this value should be about -1.047419 -1.047419)\n\n');
% Draw lines connecting the projected points to the original points
hold on;
plot(X_rec(:, 1), X_rec(:, 2), 'ro');
title('The normalized and projected data after PCA');
for i = 1:size(X_norm, 1)
drawLine(X_norm(i,:), X_rec(i,:), '--k', 'LineWidth', 1);
end
hold off
Implementation of functions
Implementation of pca
function [U, S] = pca(X)
%PCA Run principal component analysis on the dataset X
% [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X
% Returns the eigenvectors U, the eigenvalues (on diagonal) in S
%
% Useful values
[m, n] = size(X);
Sigma = X' * X;
Sigma = Sigma ./ m;
% We need to return the following variables correctly.
U = zeros(n);
S = zeros(n);
[U, S, V] = svd(Sigma);
end
Implementation of featureNormalize
function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
mu = mean(X);
X_norm = bsxfun(@minus, X, mu);
sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);
% ============================================================
end
Implementation of projectData
function Z = projectData(X, U, K)
%PROJECTDATA Computes the reduced data representation when projecting only
%on to the top k eigenvectors
% Z = projectData(X, U, K) computes the projection of
% the normalized inputs X into the reduced dimensional space spanned by
% the first K columns of U. It returns the projected examples in Z.
%
% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);
U_reduce = U(:, 1:K);
for i = 1:size(X,1)
x = X(i, :)';
z = U_reduce' * x;
Z(i,:) = z;
end
end
Implementation of recoverData
function X_rec = recoverData(Z, U, K)
%RECOVERDATA Recovers an approximation of the original data when using the
%projected data
% X_rec = RECOVERDATA(Z, U, K) recovers an approximation the
% original data that has been reduced to K dimensions. It returns the
% approximate reconstruction in X_rec.
%
% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1));
U_reduce = U(:, 1:K);
X_rec = Z * U_reduce';
end
Implementation of drawLine
function drawLine(p1, p2, varargin)
%DRAWLINE Draws a line from point p1 to point p2
% DRAWLINE(p1, p2) Draws a line from point p1 to point p2 and holds the
% current figure
plot([p1(1) p2(1)], [p1(2) p2(2)], varargin{:});
end