## Machine Learning Day

### Lab 1: k-Nearest Neighbors and Cross-validation

This lab is about local methods for binary classification and model selection. The goal is to provide some familiarity with a basic local method algorithm, namely k-Nearest Neighbors (k-NN) and offer some practical insights on the bias-variance trade-off. In addition, it explores a basic method for model selection, namely the selection of parameter k through Cross-validation (CV).

### Getting Started

• Get the code file, add the directory to MATLAB path (or set it as current/working directory).
• Use the editor to write/save and run/debug longer scripts and functions.
• Use the command window to try/test commands, view variables and see the use of functions.
• Use `plot` (for 1D), `imshow`, `imagesc` (for 2D matrices), `scatter`, `scatter3D` to visualize variables of different types.
• Work your way through the examples below, by following the instructions.

### 1. Data generation

1. Use function `MixGauss` with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3.
2. Obtain a 2-class train set `[X, Y]` by having data on opposite corners sharing the same class with labels +1 and -1. Example: if you generated the data following the order above, you can use a mapping like `Y = 2*(1/2-mod(Y, 2));`
3. Generate a test set `[Xte, Yte]` from the same distribution, starting with 200 samples per class.
4. Visualize both sets using `scatter`.

### 2. kNN classification

The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set. Study the code of function `kNNClassify` (for quick reference type `help kNNClassify`).

1. Use `kNNClassify` to generate predictions `Yp` for the 2-class data generated at Section 1. Pick a "reasonable" k.
2. Evaluate the classification performance (prediction error) by comparing the estimated labels `Yp` to the true labels `Yte` by: `err = sum(Yp~=Yte)/length(Yte);`
3. Visualize the obtained results, e.g. by plotting the wrongly classified points using different colors/markers:
``scatter(Xte(:, 1), Xte(:, 2), markerSize, Yte); % color points by "true" labell = (Yp ~= Yte); % specify wrong predictionsscatter(Xte(l, 1), Xte(l, 2), markerSize, Yp(l), 'x'); % color them``
4. Use the provided function `separatingFkNN` to visualize the separating function, or the areas of the 2D plane that are associated by the classifier with each class. Overlay the test points using `scatter`.

### 3. Parameter selection: what is a good value for k?

So far we considered an arbitrary choice for k. You will now use the provided function `holdoutCVkNN` for model selection (type `help holdoutCVkNN` for an example use).

1. Perform hold-out cross-validation using a percentage of the training set for validation. Note: for the suggested parameters rep=10 and pho=0.3, the hold-out procedure may be quite unstable.
• Use a large range of candidate values for k (e.g. k=1,3,5...,21).
• Repeat the process for `rep` times using at each iteration a random `p` of the training set for validation. Try `rep=10, pho=0.3`.
• Plot the training and validation errors for the different values of k.
• How would you now answer the question "what is the best value for k"?
2. How is the value of k affected by `pho` (percentage of points held out) and `rep` (number of repetitions e.g., 1, 5, 30, 50, 100)? What does a large number of repetitions provide?
3. Apply the model obtained by cross-validation (i.e., best k) to the test set and check if there is an improvement on the classification error over the result of Part 2.

### 4. (Optional)

1. Dependence on training size: Evaluate the performance as the size of the training set grows, e.g., n = {50, 100, 300, 500,...}. How would you choose a good range for k as n changes? What can you say about the stability of the solution? Check by repeating the validation multiple times.
2. Try classifying more difficult datasets, for instance, by increasing the variance or adding noise by randomly flipping the labels on the training set.
3. Modify the function `kNNClassify` to handle a) multi-class problems and b) regression problems.