Lab: Real Data - Handwritten Image Classification

Machine Learning Day

Lab: Real Data - Handwritten Image Classification

The lab is aimed at applying a full learning pipeline on a real dataset, namely images of handwritten digits. You will be using a subset of the MNIST dataset for a binary classification task.

Code/data

Get the code file and add the directory to MATLAB path (or set it as current/working directory).
Follow the instructions to work your way through the lab.

1. MNIST Data

Load the dataset "MNIST_3_5", using load('MNIST_3_5.mat'); to obtain X and Y, the matrices of examples and labels respectively.
Each row of X is a vectorized 28x28 grayscale image of a handwritten digit from the MNIST dataset. The positive examples are '5' while the negative are '3'. Visualize some examples from the dataset using the function visualizeExample.
Randomly split the dataset into a training and a test set, of n_train = 100 and n_test = 1000 points respectively using randomSplitDataset.

1. Learning pipeline - Binary classification

Choose a learning algorithm from the ones you used already, namely: kNN, regularized linear least squares, kernel regularized least squares (gaussian or polynomial kernel). Compute the classification error on the test set using the function calcErr. To select the best tuning parameters, define a suitable range for the parameters and use hold out cross-validation (using holdoutCVkNN, holdoutCVRLS, holdoutCVKernRLS).

Predict the labels for the test set and visualize some of the misclassified examples:

ind = find((sign(Yp)~=sign(Yte)));
idx = ind(randi(numel(ind)));
figure; visualizeExample(Xte(idx,:));

(Optional) Compute the mean and standard deviation of the test error over a number of random splits in test and training sets and for an increasing number of training examples, e.g. [10, 400]. For each split compute the test error by choosing suitable parameter range(s) and applying holdout CV on the training set. What happens to the mean and standard deviation across splits, when the number of examples in the training set increases? How do these values depend on the settings of holdout CV (range of parameters, number of repetitions and the fraction of training set used as validation)?