Machine Learning Day

Lab 1: k-Nearest Neighbors and Cross-validation

This lab is about local methods for binary classification and model selection. The goal is to provide some familiarity with a basic local method algorithm, namely k-Nearest Neighbors (k-NN) and offer some practical insights on the bias-variance trade-off. In addition, it explores a basic method for model selection, namely the selection of parameter k through Cross-validation (CV).

Getting Started

  • Get the code file, add the directory to MATLAB path (or set it as current/working directory).
  • Use the editor to write/save and run/debug longer scripts and functions.
  • Use the command window to try/test commands, view variables and see the use of functions.
  • Use plot (for 1D), imshow, imagesc (for 2D matrices), scatter, scatter3D to visualize variables of different types.
  • Work your way through the examples below, by following the instructions.

1. Data generation

  1. Use function MixGauss with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3.
  2. Obtain a 2-class train set [X, Y] by having data on opposite corners sharing the same class with labels +1 and -1. Example: if you generated the data following the order above, you can use a mapping like Y = 2*(1/2-mod(Y, 2));
  3. Generate a test set [Xte, Yte] from the same distribution, starting with 200 samples per class.
  4. Visualize both sets using scatter.

2. kNN classification

The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set. Study the code of function kNNClassify (for quick reference type help kNNClassify).

  1. Use kNNClassify to generate predictions Yp for the 2-class data generated at Section 1. Pick a "reasonable" k.
  2. Evaluate the classification performance (prediction error) by comparing the estimated labels Yp to the true labels Yte by: err = sum(Yp~=Yte)/length(Yte);
  3. Visualize the obtained results, e.g. by plotting the wrongly classified points using different colors/markers:
    scatter(Xte(:, 1), Xte(:, 2), markerSize, Yte); % color points by "true" label
    l = (Yp ~= Yte); % specify wrong predictions
    scatter(Xte(l, 1), Xte(l, 2), markerSize, Yp(l), 'x'); % color them
  4. Use the provided function separatingFkNN to visualize the separating function, or the areas of the 2D plane that are associated by the classifier with each class. Overlay the test points using scatter.

3. Parameter selection: what is a good value for k?

So far we considered an arbitrary choice for k. You will now use the provided function holdoutCVkNN for model selection (type help holdoutCVkNN for an example use).

  1. Perform hold-out cross-validation using a percentage of the training set for validation. Note: for the suggested parameters rep=10 and pho=0.3, the hold-out procedure may be quite unstable.
    • Use a large range of candidate values for k (e.g. k=1,3,5...,21).
    • Repeat the process for rep times using at each iteration a random p of the training set for validation. Try rep=10, pho=0.3.
    • Plot the training and validation errors for the different values of k.
    • How would you now answer the question "what is the best value for k"?
  2. How is the value of k affected by pho (percentage of points held out) and rep (number of repetitions e.g., 1, 5, 30, 50, 100)? What does a large number of repetitions provide?
  3. Apply the model obtained by cross-validation (i.e., best k) to the test set and check if there is an improvement on the classification error over the result of Part 2.

4. (Optional)

  1. Dependence on training size: Evaluate the performance as the size of the training set grows, e.g., n = {50, 100, 300, 500,...}. How would you choose a good range for k as n changes? What can you say about the stability of the solution? Check by repeating the validation multiple times.
  2. Try classifying more difficult datasets, for instance, by increasing the variance or adding noise by randomly flipping the labels on the training set.
  3. Modify the function kNNClassify to handle a) multi-class problems and b) regression problems.