## Machine Learning Day

### Lab 3: Dimensionality reduction and feature selection

In this lab we will look into the problems of dimensionality reduction through Principal Component Analysis (PCA) and feature selection through Orthogonal Matching Pursuit (OMP).

### 1. Data generation

Generate a 2-class dataset of D-dimensional points with N points for each class. Start with N = 100 and D = 30 and create a train and a test set.

1. The first two variables (dimensions) for each point will be generated by `MixGauss`, i.e., extracted from two Gaussian distributions, with centroids (1, 1) and (-1,-1) and sigmas 0.7. Adjust the output labels of the classes to be {1,-1} respectively and plot these variables in a 2D plane using `scatter(X(:,1), X(:,2), markerSize, Y);`.
2. The remaining (D-2) variables will be generated by Gaussian noise with `sigma_noise = 0.01` and `X_noise = sigma_noise*randn(2*N, D-2);`.
3. The final D-dimensional train and test data matrices will then be given by `X = [X X_noise];`, of size N x D, storing N instances in a D-dimensional space.

### 2. Principal Component Analysis

1. Compute the principal components of the training set, using the provided function `PCA`. Select the number of principal components by setting `k<D`.
2. For the data projected on the subspace of K components, `X_p`, plot the first dimension (as points in a line, using `plot(X_proj(:,1), 1)`), the first two (`scatter(X_p(:,1), X_p(:,2), markerSize, Ytr);`) and the first three (`scatter3(X_p(:,1), X_p(:,2), X_p(:,3), markerSize, Ytr);`).
3. How would you interpret the meaning of the resulting plots? What is the effective dimensionality of this dataset?
4. Visualize the square root of the first k=10 eigenvalues. Can you infer the dimensionality from this distribution?
5. Visualize the eigenvectors associated with the largest, e.g., `scatter(1:D, abs(V(:,1)))`, second largest and third largest eigenvalue.
6. Repeat the above with a dataset of different `sigma_noise`, e.g., in [0, 2]. To what extent is data visualization by PCA affected by noise?

### 3. Variable selection with OMP

1. Standardize the data matrix, so that each column has mean 0 and standard deviation 1. Use the statistics from the train set X (mean and standard deviation) to standardize the corresponding test set Xte. An example of a vectorized implementation:
``m = mean(X);s = std(X);X = bsxfun(@rdivide, bsxfun(@minus, X, m, s);``
2. Use the orthogonal matching pursuit algorithm (function `OMatchingPursuit`) using T repetitions, to obtain T-1 coefficients for a sparse approximation of the training set Xt. Plot the resulting coefficients w using `stem(1:D,w)`. What is the output when setting `T = 3` and what is the interpretation of the indices of these first active dimensions (coefficients)?
3. Check the predicted labels on the training (and test) set when approximating the output using w:
``Ypred = sign(Xt * w);err = calcErr(Yt, Ypred);``
Plot the errors on the same plot for increasing T. How do the train and test error change with the number of iterations of the method?
4. By applying cross-validation on the training set through the provided `holdoutCVOMP` find the optimum number of iterations in the range `intIter = 2:D` (indicative values: `pho = 0.2, rip = 30`). Plot the training and validation error on the same plot. How do the errors change with the number of iterations?

### 4. (Optional)

Compare the results of Parts 2 and 3 and evaluate the benefits of the two different methods for dimensionality reduction and feature selection when choosing `N >> D`, `N ~= D` and `N << D`.