Regularized logistic regression for quality assurance

Problem statement

During quality assurance (QA), each microchip from a fabrication plant goes through various tests to ensure it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.

Load data

The file QA_data.txt contains the dataset for our logistic regression problem. The first two columns contains the test scores and the third column contains a label which indicateds verdict. The dataset is loaded from the data file into the variables X and y:

%% Load Data
%  The first two columns contains the X values and the third column
%  contains the label (y).
data = load('QA_data.txt');
X = data(:, [1, 2]); y = data(:, 3);

Plot data

Before starting to implement any learning algorithm, it is always good to visualize the data if possible. We start the exercise by first plotting the data to understand the problem we are working with.

plotData(X, y);

%Implementation of plotData is given at the last section of document

% Put some labels

hold on;

% Labels and Legend

xlabel('Microchip Test 1');

ylabel('Microchip Test 2');

% Specified in plot order

legend('y = 1', 'y = 0');

title('Plot of training data');

hold off;

The axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers. Our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.

Feature mapping

One way to fit the data better is to create more features from each data point. we will map the features into all polynomial terms of x1 and x2 up to the sixth power.

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot.

While the feature mapping allows us to build a more expressive classifier, it also more susceptible to overfitting.

% Add Polynomial Features
% Note that mapFeature also adds a column of ones for us, so the intercept
% term is handled
X = mapFeature(X(:,1), X(:,2));
%Implementation of mapFeature is at the end section

Sigmoid function

The logistic regression hypothesis is defined as:

where function g is the sigmoid function. The sigmoid function is defined as:

For large positive values of x, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0. for 0 it should be exactly 0.5. Implementation of sigmoid function is given at the end of document.

Cost function and gradient

The regularized cost function in logistic regression is

The gradient of the cost function is a vector where the jth element is defined as follows:

Let' compute initial cost using the initial value of θ (initialized to all zeros).

% Initialize fitting parameters
initial_theta = zeros(size(X, 2), 1);
% Set regularization parameter lambda to 1
lambda = 1;
% Compute and display initial cost and gradient for regularized logistic
% regression
[cost, grad] = costFunctionReg(initial_theta, X, y, lambda);
%Implementation of costFunctionReg is at the end section
fprintf('Cost at initial theta (zeros): %f\n', cost);
Cost at initial theta (zeros): 0.693147

Learning parameters using builtin function

Octave/MATLAB’s fminunc is an optimization solver that finds the minimum of an unconstrained function. For logistic regression, we want to optimize the cost function J(θ) with parameters θ.

Concretely, we are going to use fminunc to find the best parameters θ for the logistic regression cost function, given a fixed dataset (of X and y values) we will pass to fminunc the following inputs:

The initial values of the parameters we are trying to optimize.
A function that, when given the training set and a particular θ, computes the logistic regression cost and gradient with respect to θ for the dataset (X, y).
Regularization paramaeter lambda

We already implemented everything needed to use the builtin function so let's use that

% Set Options
options = optimset('GradObj', 'on', 'MaxIter', 400);
% Optimize
[theta, J, exit_flag] = fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);
Local minimum found.

Optimization completed because the size of the gradient is less than
the default value of the optimality tolerance.

<stopping criteria details>

In this code snippet, we first defined the options to be used with fminunc. Specifically, we set the GradObj option to on, which tells fminunc that our function returns both the cost and the gradient. This allows fminunc to use the gradient when minimizing the function. Furthermore, we set the MaxIter option to 400, so that fminunc will run for at most 400 steps before it terminates. To specify the actual function we are minimizing, we use a "short-hand" for specifying functions with the

. This creates a function, with argument t, which calls your costFunction. This allows us to wrap the costFunction for use with fminunc.

Plot decision boundary

We plot the non-linear decision boundary by computing the classifier’s predictions on an evenly spaced grid and then and drew a contour plot of where the predictions change from y = 0 to y = 1.

% Plot Boundary

plotDecisionBoundary(theta, X, y);

hold on;

title(sprintf('lambda = %g', lambda))

% Labels and Legend

xlabel('Microchip Test 1')

ylabel('Microchip Test 2')

legend('y = 1', 'y = 0', 'Decision boundary')

hold off;

Performance on training data

Let's check how our model is performing on the training data

% Compute accuracy on our training set
p = predict(theta, X);
%Implementation of predict is given in the last section
fprintf('Train Accuracy: %f\n', mean(double(p == y)) * 100);
Train Accuracy: 83.050847

Effect of regularization parameter

Let's see how changing value of regularization parameter effects decision boundary by plotting graphs

lambda = 0;

[theta, J, exit_flag] = fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);

Solver stopped prematurely.

fminunc stopped because it exceeded the iteration limit,
options.MaxIterations = 400 (the selected value).

plotDecisionBoundary(theta, X, y);

title(sprintf('lambda = %g', lambda));

Without any regularization our model overfits which is bad for generalization

lambda = 100;

[theta, J, exit_flag] = fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);

Local minimum found.

Optimization completed because the size of the gradient is less than
the default value of the optimality tolerance.

<stopping criteria details>

plotDecisionBoundary(theta, X, y);

title(sprintf('lambda = %g', lambda));

Too much regularization causes underfiting which is also bad for generalization

Implementation of functions

Implementation of plotData

function plotData(X, y)
%PLOTDATA Plots the data points X and y into a new figure 
%   PLOTDATA(x,y) plots the data points with + for the positive examples
%   and o for the negative examples. X is assumed to be a Mx2 matrix.
% Create New Figure
figure; hold on;
pos = find(y==1); neg = find(y == 0);
% Plot Examples
plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2,'MarkerSize', 7);
plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y','MarkerSize', 7);
% =========================================================================
hold off;
end

Implementation of mapFeature

function out = mapFeature(X1, X2)
% MAPFEATURE Feature mapping function to polynomial features
%
%   MAPFEATURE(X1, X2) maps the two input features
%   to quadratic features used in the regularization exercise.
%
%   Returns a new feature array with more features, comprising of 
%   X1, X2, X1.^2, X2.^2, X1*X2, X1*X2.^2, etc..
%
%   Inputs X1, X2 must be the same size
%
degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
    for j = 0:i
        out(:, end+1) = (X1.^(i-j)).*(X2.^j);
    end
end
end

Implementation of sigmoid function

The code works with vectors and matrices. For a matrix, function performs the sigmoid function on every element

function g = sigmoid(z)
%SIGMOID Compute sigmoid function
%   g = SIGMOID(z) computes the sigmoid of z.
g = zeros(size(z));
for row_index = 1:size(z,1)
    for col_index = 1: size(z,2)
    val = z(row_index, col_index);
    g(row_index, col_index) = 1/(1 + exp(-val));
    end
end
end

Implementation of costFunctionReg

function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
%   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters. 
% Initialize some useful values
m = length(y); % number of training examples
% We need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));
for index = 1:m
   y_i = y(index);
   x_i = X(index,:);
   h_theta_i = sigmoid(x_i * theta);
   J = J +  ( -y_i * log(h_theta_i) - (1-y_i) * log(1-h_theta_i ) );
end
J = J/m;
total = sum(theta.^2);
total = total - theta(1)^2; %We do not regularize theta_zero
J = J + (lambda/(2*m)) * total;
total = 0;
for index = 1:m
   y_i = y(index);
   x_i = X(index,:);
   h_theta_i = sigmoid(x_i * theta);
   total = total +  (h_theta_i - y_i) * X(index,1);
end
grad(1) = total / m;
for theta_index = 2:size(theta)
    total = 0;
    for index = 1:m
        y_i = y(index);
        x_i = X(index,:);
        h_theta_i = sigmoid(x_i * theta);
        total = total +  (h_theta_i - y_i) * X(index,theta_index);
    end
    total = total/m;
    grad(theta_index) = total + (lambda/m) * theta(theta_index);
end
% =============================================================
end

Implementation of plotDecisionBoundary

function plotDecisionBoundary(theta, X, y)
%PLOTDECISIONBOUNDARY Plots the data points X and y into a new figure with
%the decision boundary defined by theta
%   PLOTDECISIONBOUNDARY(theta, X,y) plots the data points with + for the 
%   positive examples and o for the negative examples. X is assumed to be 
%   a either 
%   1) Mx3 matrix, where the first column is an all-ones column for the 
%      intercept.
%   2) MxN, N>3 matrix, where the first column is all-ones
% Plot Data
plotData(X(:,2:3), y);
hold on
if size(X, 2) <= 3
    % Only need 2 points to define a line, so choose two endpoints
    plot_x = [min(X(:,2))-2,  max(X(:,2))+2];
    % Calculate the decision boundary line
    plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
    % Plot, and adjust axes for better viewing
    plot(plot_x, plot_y);
    
    % Legend, specific for the exercise
    legend('Admitted', 'Not admitted', 'Decision Boundary');
    axis([30, 100, 30, 100]);
else
    % Here is the grid range
    u = linspace(-1, 1.5, 50);
    v = linspace(-1, 1.5, 50);
    z = zeros(length(u), length(v));
    % Evaluate z = theta*x over the grid
    for i = 1:length(u)
        for j = 1:length(v)
            z(i,j) = mapFeature(u(i), v(j))*theta;
        end
    end
    z = z'; % important to transpose z before calling contour
    % Plot z = 0
    % Notice you need to specify the range [0, 0]
    contour(u, v, z, [0, 0], 'LineWidth', 2);
end
hold off
end

Implementation of predict

function p = predict(theta, X)
%PREDICT Predict whether the label is 0 or 1 using learned logistic 
%regression parameters theta
%   p = PREDICT(theta, X) computes the predictions for X using a 
%   threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)
m = size(X, 1); % Number of training examples
p = zeros(m, 1);
for index = 1:m
   x_i = X(index,:);
   h_theta_i = sigmoid(x_i * theta);
   if(h_theta_i<.5)
       p(index) = 0;
   else
       p(index) = 1;
   end
end
% =========================================================================
end