Machine Learning Day

Lab 0: Data Generation

This first (optional) lab is focused on getting started with MATLAB/Octave and working with data for ML. The goal is to provide basic familiarity with MATLAB syntax, along with some preliminary data generation, processing and visualization.

MATLAB/Octave resources

The labs are designed for MATLAB/Octave. Below you can find a number of resources to get you started.

  • MATLAB getting started tutorial for an introduction to the environment, syntax and conventions.
  • MATLAB has very thorough documentation, both online and built in. In the command window, type: help functionName (check use) or doc functionName (pull up documentation).
  • Built in tutorials: in the command window enter demo.
  • Comprehensive MATLAB reference and introduction: (pdf

Getting Started

  • Get the code file, add the directory to MATLAB path (or set it as current/working directory).
  • Use the editor to write/save and run/debug longer scripts and functions.
  • Use the command window to try/test commands, view variables and see the use of functions.
  • Use plot (for 1D), imshow, imagesc (for 2D matrices), scatter, scatter3D to visualize variables of different types.
  • Work your way through the examples below, by following the instructions.

1. Optional - MATLAB Warm-up

  1. Create a column vector v = [1; 2; 3] and a row vector u = [1,2,3]
    • What happens with the command v'? What is the corresponding algebraic/matrix operation?
    • Create z = [5;4;3] and try basic numerical operations of addition and subtraction with v.
    • What happens with u + z?
  2. Create the matrices A = [1 2 3; 4 5 6; 7 8 9] and B = A'
    • What kind of matrix is C = A + B?
    • Explore what happens with A(:,1), A(1,:), A(2:3,:) and A(:).
  3. Use the product operator *
    • What happens with 2*u, u*2, 2*v?
    • What happens with u*v and v*u, why? With A*v, u*A and A*u?
    • Use size and/or length functions to find the dimensions of vectors and matrices.
  4. Use the element-wise operators .* and ./, e.g., u.*z and z./u
    • What happens with v.*z and v./z?
    • Why aren't A*A and A.*A the same?
  5. Use the functions zeros, ones, rand, randn
    • Create a 3 x 5 matrix of all zeros, all ones or random numbers uniformly distributed between 2 and 3 and random numbers distributed according to a Gaussian of variance 2.
  6. Use the functions eye and diag
    • Create a 3 x 3 identity matrix and a matrix whose diagonal is the vector v.

2. Core - Data generation

The function MixGauss(means, sigmas, n) generates datasets where the distribution of each class is an isotropic Gaussian with a given mean and variance, according to the values in matrices/vectors means and sigmas. Study the function code or type help MixGauss on the MATLAB shell. The function scatter can be used to plot points in 2D.

  1. Generate and visualize a simple dataset:
    [X, C] = MixGauss([[0;0], [1;1]], [0.5, 0.25], 1000);
    figure; scatter(X(:,1), X(:,2), 25, C);
  2. Generate more complex datasets:
    • 4-class dataset: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1), (1,1), (1,0), all with variance 0.2.
    • 2-class dataset: manipulate the data to obtain a 2-class problem where data on opposite corners share the same class. Hint: if you generated the data following the suggested center order, you can use the function mod to quickly obtain two labels, e.g. Y = mod(C, 2).

3. Optional - Extra practice

  1. Generate datasets of larger variances, higher dimensionality of input space etc.
  2. Add noise to the data by flipping the labels of random points.
  3. For a dataset compute the distances among all input points (use vectorization in your code, avoid using a for loop). How does the mean distance change with the number of dimensions?
  4. Generate regression data: Consider a regression model defined by a linear function with coefficients w and Gaussian noise of level (SNR) delta.
    • Create a MATLAB function with input the number of points n, the number of dimensions D, the D-dimensional vector w and the scalar delta and output an (n x D) matrix X and an (n x 1) vector Y.
    • Plot the underlying (linear) function and the noisy output on the same figure.
    • Test/visualize the 1-D and 2-D cases, but make the function generic to account for higher dimensional data.
  5. Generate regression data using a 1-D model with a non-linear function.
  6. Generate a dataset (either for regression or for classification) where most of the input variables are "noise", i.e., they are unrelated to the output.