Follow
the instructions below. Think hard before you call the
instructors!
Download:
zipfile (unzip it in a local folder)
Set Matlab path to include the local folder
1.A Load the dataset MNIST_3_5 by using the following lines
it produces two matrices X and Y, the first containing the examples and the second the labels.
1.B Each row of X is a vectorized 28x28 grayscale image of an handwritten digit from the MNIST dataset. In particular, the positive examples are '5' while the negative are '3'. Visualize some examples from the dataset by using the function visualizeExample (check the function documentation to see how it works).
1.C Use the function
[Xtr, Ytr, Xts, Yts] = randomSplitDataset(X, Y, n_train, n_test);
to split the dataset in a training of 100 points and a test set of 1000 points (n_train is the number of examples for the training set).
1.D Choose a learning algorithm among kNN, regularized linear least squares, regularized kernel least squares, logistic regression, kernel logistic regression. Use the split of point 1.C to compute the classification error on the test set using the function calcErr. To select the best hyperparameters, define a suitable range for the parameters and then use holdoutCV.
1.E Predict the labels for the test set, and then use the following code to visualize some of the misclassified examples.
ind = find(I);
nel = numel(ind);
for i=1:6
figure;
idx = ind(randi(nel));
visualizeExample(Xts(idx,:));
end
1.F Repeat the point 1.D and 1.E for the other algorithms.
2.A Choose an algorithm and complete the table below by computing mean and standard deviation of the test error for a number (e.g. 20 or what the computer lets you choose in a reasonable time) of random splits in test and training sets and for an increasing number of training examples.
For each split compute the classification error on the test set (n_test = 1000) by choosing a suitable parameter range and using the parameters found with holdoutCV on the training set. To compute the mean and standard deviation, use the functions mean and std, respectively.
10 |
20 |
50 |
100 |
200 |
400 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
What happens to the mean and the standard deviation, when the number of examples in the training set increases?
How do the mean and the standard deviation depend on the settings of holdoutCV (range of parameters, number of repetitions and the fraction of training set used as validation)?
3.A repeat the exercise 2.A for the other algorithms.