This lab is about local methods for
binary classification on synthetic data. The goal of the lab is to
get familiar with the k-Nearest Neighbors (kNN) algorithm and to get
a practical grasp of what we have discussed in class. Follow the
instructions below. Think hard before you call the
instructors!
Extract the zip file in a folder and set the
Matlab path to that folder.
Call MixGauss
with appropriate parameters and produce a dataset with four classes:
the classes must live in the 2D space and be centered on the corners
of the unit square (0,0), (0,1) (1,1), (1,0), all with std deviation 0.25.
[Xtr,Ytr]=MixGauss(....)
Use the Matlab function "scatter" to plot the points.
Manipulate the data so to obtain a 2-class
problem where data on opposite corners share the same class with
labels +1 and -1 (if you produced the data following the centers
order provided above, you may use
Ytr=mod(Ytr,2)*2-1)
Similarly generate a "test set" [Xte,Yte].
The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set.
2.A Have a look at the code of function kNNClassify (for a quick reference type "help kNNClassify" on the Matlab shell)
2.B Use kNNClassify on the previously generated 2-class data from step 0. Pick a "reasonable" k.
2.C Think of how to plot the data to get a glimpse of the obtained results. A possible way is:
figure;
scatter(Xte(:,1),Xte(:,2),25,Yte); %plot test points (empty circles) associating a different color to each "true" label
hold on
sel = (Ypred ~= Yte);
scatter(Xte(sel,1),Xte(sel,2),200,Yte(sel),'x'); % plot wrongly predicted test points (filled circles)
2.D To evaluate the classification performance compare the estimated outputs with those previously generated.
Matlab line: sum(Ypred~=Yte)/Nt %Nt number of test data points
2.E To visualize the separating function (and thus get a more general view of what areas are associated with each class) you may use the routine separatingFkNN (type "help separatingFkNN" on the Matlab shell, if you still have doubts on how to use it, have a look at the code).
So far, we have considered an arbitrary k.
3.A Perform
a hold out cross validation procedure on the available training
data.
You may want to use the function holdoutCVkNN
available in
the zip file (here again, type "help
holdoutCVkNN" on the Matlab shell.
You fill
find there a useful example of use).
Plot the training and
validation errors for the different values ok k.
3.B Now, can you answer to the question "what is the best value for k"?
3.C What happens with different values of p (percentage of points held out) and rep (number of repetitions of the experiment)?
3.D Test the model by cross validation by applying kNN (with the best k) to a separate test set (eg Xte generated before) and see if there is an improvement on the classification error with respect to what you got at section 2.D.
4.A Evaluate the results as the size of the training set grows. n=10,20,50,100,300,... (of course k needs to be chosen accordingly).
4.B Generate more complex datasets with MixGauss function, for instance by choosing larger variance on the data generation part.
4.C You may also add noise to the data by randomly flipping the labels on the training set (percentage of flipped labels).
4.D Modify the kNNClassify function to handle multiclass problems.
4.E Modify the kNNClassify function to handle regression problems.