The lab is aimed at applying a full learning pipeline on a real dataset, namely images of handwritten digits. You will be using a subset of the MNIST dataset for a binary classification task.
1. MNIST Data
- Load the dataset "MNIST_3_5", using
load('MNIST_3_5.mat');to obtain X and Y, the matrices of examples and labels respectively.
- Each row of X is a vectorized 28x28 grayscale image of a handwritten digit from the MNIST dataset. The positive examples are '5' while the negative are '3'. Visualize some examples from the dataset using the function
- Randomly split the dataset into a training and a test set, of
n_train = 100and
n_test = 1000points respectively using
1. Learning pipeline - Binary classification
- Choose a learning algorithm from the ones you used already, namely: kNN, regularized linear least squares, kernel regularized least squares (gaussian or polynomial kernel). Compute the classification error on the test set using the function
calcErr. To select the best tuning parameters, define a suitable range for the parameters and use hold out cross-validation (using
holdoutCVkNN, holdoutCVRLS, holdoutCVKernRLS).
- Predict the labels for the test set and visualize some of the misclassified examples:
ind = find((sign(Yp)~=sign(Yte)));
idx = ind(randi(numel(ind)));
- (Optional) Compute the mean and standard deviation of the test error over a number of random splits in test and training sets and for an increasing number of training examples, e.g. [10, 400]. For each split compute the test error by choosing suitable parameter range(s) and applying holdout CV on the training set. What happens to the mean and standard deviation across splits, when the number of examples in the training set increases? How do these values depend on the settings of holdout CV (range of parameters, number of repetitions and the fraction of training set used as validation)?