## Machine Learning Day

### Lab 1: k-Nearest Neighbors and Cross-validation

This lab is about local methods for binary classification and model selection. The goal is to provide some familiarity with a basic local method algorithm, namely k-Nearest Neighbors (k-NN) and offer some practical insights on the bias-variance trade-off. In addition, it explores a basic method for model selection, namely the selection of parameter k through Cross-validation (CV).

### Getting Started

- Get the code file, add the directory to MATLAB path (or set it as current/working directory).
- Use the editor to write/save and run/debug longer scripts and functions.
- Use the command window to try/test commands, view variables and see the use of functions.
- Use
`plot`

(for 1D),`imshow`

,`imagesc`

(for 2D matrices),`scatter`

,`scatter3D`

to visualize variables of different types. - Work your way through the examples below, by following the instructions.

### 1. Data generation

- Use function
`MixGauss`

with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3. - Obtain a 2-class train set
`[X, Y]`

by having data on opposite corners sharing the same class with labels +1 and -1.*Example*: if you generated the data following the order above, you can use a mapping like`Y = 2*(1/2-mod(Y, 2));`

- Generate a test set
`[Xte, Yte]`

from the same distribution, starting with 200 samples per class. - Visualize both sets using
`scatter`

.

### 2. kNN classification

The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set. Study the code of function `kNNClassify`

(for quick reference type `help kNNClassify`

).

- Use
`kNNClassify`

to generate predictions`Yp`

for the 2-class data generated at Section 1. Pick a "reasonable" k. - Evaluate the classification performance (prediction error) by comparing the estimated labels
`Yp`

to the true labels`Yte`

by:`err = sum(Yp~=Yte)/length(Yte);`

- Visualize the obtained results, e.g. by plotting the wrongly classified points using different colors/markers:
`scatter(Xte(:, 1), Xte(:, 2), markerSize, Yte);`

% color points by "true" label

l = (Yp ~= Yte);% specify wrong predictions

scatter(Xte(l, 1), Xte(l, 2), markerSize, Yp(l), 'x');% color them - Use the provided function
`separatingFkNN`

to visualize the separating function, or the areas of the 2D plane that are associated by the classifier with each class. Overlay the test points using`scatter`

.

### 3. Parameter selection: what is a good value for k?

So far we considered an arbitrary choice for k. You will now use the provided function `holdoutCVkNN`

for model selection (type `help holdoutCVkNN`

for an example use).

- Perform hold-out cross-validation using a percentage of the training set for validation.
*Note: for the suggested parameters rep=10 and pho=0.3, the hold-out procedure may be quite unstable.*- Use a large range of candidate values for k (e.g. k=1,3,5...,21).
- Repeat the process for
`rep`

times using at each iteration a random`p`

of the training set for validation. Try`rep=10, pho=0.3`

. - Plot the training and validation errors for the different values of k.
- How would you now answer the question "what is the best value for k"?

- How is the value of k affected by
`pho`

(percentage of points held out) and`rep`

(number of repetitions e.g., 1, 5, 30, 50, 100)? What does a large number of repetitions provide? - Apply the model obtained by cross-validation (i.e., best k) to the test set and check if there is an improvement on the classification error over the result of Part 2.

### 4. (Optional)

- Dependence on training size: Evaluate the performance as the size of the training set grows, e.g., n = {50, 100, 300, 500,...}. How would you choose a good range for k as n changes? What can you say about the stability of the solution? Check by repeating the validation multiple times.
- Try classifying more difficult datasets, for instance, by increasing the variance or adding noise by randomly flipping the labels on the training set.
- Modify the function
`kNNClassify`

to handle a) multi-class problems and b) regression problems.