Question 18, Multinomial Log-Likelihood (15 points)
A multiclass classifier with classes outputs the probability of class given input via the softmax: Write down the log-likelihood expression for a single training example under this multinomial model. Your derivation should follow the same steps used in the course for the binary cross-entropy loss.Solution
For a single example with true class , the probability assigned by the model is . Using indicator notation, we can write the log-likelihood as: Substituting the softmax expression: Since exactly one indicator is 1 (the true class ), this simplifies to: The negative of this expression (averaged over training examples) is the categorical cross-entropy loss minimised during training. Providing the expression for the full dataset (summed over examples) is also acceptable.Question 3, Confusion Matrix & Business Cost (15 points)
A Data Scientist is evaluating four binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The model must satisfy:- Recall rate ≥ 80 %
- False positive rate (FPR) ≤ 10 %
- Minimum business cost
| Model | TN | FP | FN | TP |
|---|---|---|---|---|
| A | 91 | 9 | 22 | 78 |
| B | 99 | 1 | 21 | 79 |
| C | 96 | 4 | 10 | 90 |
| D | 98 | 2 | 18 | 82 |
Solution
Compute recall and FPR for each model:| Model | Recall | FPR | Recall ≥ 80%? | FPR ≤ 10%? | Business cost |
|---|---|---|---|---|---|
| A | 78/100 = 78% | 9/100 = 9% | ✗ | ✓ | , |
| B | 79/100 = 79% | 1/100 = 1% | ✗ | ✓ | , |
| C | 90/100 = 90% | 4/100 = 4% | ✓ | ✓ | 5×4+10 = 30 |
| D | 82/100 = 82% | 2/100 = 2% | ✓ | ✓ | 5×2+18 = 28 |
Question 7, Naive Bayes Multi-Feature Classifier (20 points)
You face a binary classification problem with target and inputs where .- (5 points) Write the posterior distribution for each class .
- (5 points) Write the joint probability of class and features assuming features are conditionally independent given the class label.
- (5 points) Derive an optimal decision rule using the posteriors of the two classes.
- (5 points) If each conditional is univariate Gaussian, suggest a method to estimate the posterior .
Solution
Let denote the event . 1. Posterior (Bayes rule): 2. Joint under conditional independence: Applying the chain rule and the conditional independence assumption : 3. Optimal decision rule: Since the evidence is the same for both classes, it cancels in the argmax: 4. Estimation with Gaussian conditionals (Gaussian Naive Bayes): From training data compute: the prior (fraction of examples in class ), and, for each feature and class , MLE estimates of the mean and variance . Plug the Gaussian PDF into the decision rule above (or equivalently work in log-space) to classify new examples.Question 4, SGD and Feature Scaling (10 points)
Part A (5 points)
Scaling / normalization of features has been found to be beneficial for faster convergence of SGD. Explain why.Part B (5 points)
A colleague suggests that the scaling should also include the target variable . Do you agree or disagree? Explain.Solution
Part A. If features have very different scales, the loss surface in parameter space becomes highly anisotropic (elongated along the direction of the large-scale feature). The gradient will then point mostly in the direction of the dominant feature, forcing SGD to take tiny zig-zag steps in the other directions to avoid overshooting. After standardizing the features (zero mean, unit variance) the loss surface becomes more isotropic, enabling larger, more direct steps toward the minimum and hence faster convergence. Part B. Agreement, scaling the target variable is also helpful. The probabilistic model of the regression errors assumes a Gaussian noise process; target variables with heavy tails (e.g. house prices, counts) violate this assumption. Applying a log-transform or standardization to makes the target distribution more Gaussian, which improves the validity of the MLE criterion and can stabilize training.Question 5, SGD Jitter, Momentum, and Convergence (15 points)
The figure in your notes shows a contour plot of a loss function in weight space together with the SGD trajectory as it searches for the minimum. The trajectory exhibits substantial jitter. (A) (5 points) Is this the best trajectory you can think of? (B) (5 points) Why is there so much jitter? (C) (5 points) A colleague suggests that adding a fraction of the previous update vector to the current update improves convergence. Do you agree? If so, why? Recall the weight update:Solution
(A) No. The jittery trajectory converges very slowly; a smoother, more direct path would be preferable. (B) The axes of the contour plot represent the model parameters . The jitter arises because the two parameter dimensions have very different dynamic ranges (the loss surface is elongated). A single learning rate is simultaneously too large for the steep direction and too small for the flat direction, causing oscillation in the steep direction. (C) Yes. Adding a fraction of the previous update implements a momentum (exponential moving average) update: The averaging filters out high-frequency noise in the gradient, resulting in a much smoother, more direct trajectory and faster convergence, as illustrated by the right-hand plot in the notes.Question 9, L1 vs L2 Regularization (15 points)
Part A (5 points)
Explain what early stopping does in terms of controlling model capacity, and what its main risk is.Part B (10 points)
It has been argued that early stopping has an effect similar to L2 regularization. A colleague claims the same equivalence holds for L1 regularization. Do you agree or disagree? Justify your answer and sketch the constrained loss contours for L1 regularization.Solution
Part A. Early stopping limits the number of SGD update steps, effectively capping the model’s capacity, the fewer the steps, the less the model can overfit the training data. The main risk is incorrectly tuning this hyperparameter: stopping too early leads to underfitting; stopping too late leads to overfitting. Finding the right stopping point requires periodic evaluation on a held-out validation set, which adds cost. Part B. Disagreement. Early stopping produces a regularization effect analogous to L2 (weight-decay) because it shrinks all weights toward zero at a similar rate. L1 regularization, however, has a fundamentally different geometric structure: the L1 constraint ball (a “diamond” in 2-D) has corners aligned with the coordinate axes, which encourages the optimal solution to lie at a corner where some weights are exactly zero. This induces sparsity, many weights become identically zero, whereas early stopping does not yield sparse solutions. The L1-constrained loss contour (diamond shape) is therefore not equivalent to early stopping.Question 16, Class Imbalance in Binary Classification (25 points)
Racial and ethnic bias is unfortunately prevalent in machine learning systems. You are given a dataset of person images for binary classification: (a) people of colour () and (b) not people of colour (). The number of examples is much larger than examples.- (12.5 points) Describe what will happen when you do not address the class imbalance.
- (12.5 points) Outline two methods you can employ to address the imbalance and reduce classifier bias as much as possible.
Solution
1. Without correction. Because every example contributes equally to the cross-entropy loss, the minority class () contributes very little to the total gradient signal. The classifier will therefore learn almost entirely from the majority class and fail to learn discriminative features for the minority class. At inference it will exhibit low recall on examples, i.e. it will systematically misclassify people of colour as “not people of colour”. 2. Remediation methods.- Subsampling the majority class. Randomly downsample to match the count of before training. This restores balance at the cost of discarding potentially useful data.
- Weighted loss function. Introduce a class weight for the minority class in the binary CE loss:
Question 13, Cross-Entropy with Soft Probability Labels (5 points)
In binary classification you have worked with cross-entropy (CE) involving a hard ground truth and a predicted probability . You now face a 3-class problem where the ground truth is also expressed as a probability distribution: Calculate the cross-entropy loss. PS: This formulation is the basis of Hinton’s 2015 knowledge-distillation paper.Solution
The generalised cross-entropy for soft labels is: (Also accepted: bits in the reversed convention.)Question 8, CNN Parameter Count and Output Size (15 points)
Consider a CNN trained to classify 64×64 RGB images (3 channels) into 3 vehicle classes (cars, trucks, motorcycles). The architecture is:- Conv1: 32 filters, kernel , stride 1, ReLU
- MaxPool1: , stride 2
- Conv2: 64 filters, kernel , stride 1, ReLU
- MaxPool2: , stride 2
- Conv3: 128 filters, kernel , stride 1, ReLU
- MaxPool3: , stride 2
- FC1: 512 neurons, ReLU
- Output: 3 neurons, softmax
PS3.A (5 points)
How many trainable parameters does Conv1 have?PS3.B (5 points)
What is the output spatial size of Conv1 (assuming no padding)?PS3.C (5 points)
What is the effect of increasing the number of filters in a convolutional layer on the parameter count and on the model’s performance?Solution
PS3.A, Parameters in Conv1. Each of the 32 filters has size (spatial × input channels). Including one bias per filter: (Without biases: . Both values are acceptable.) PS3.B, Output size of Conv1 (no padding, stride 1): PS3.C, Effect of more filters. Increasing the filter count proportionally increases the number of trainable parameters (more weights to store and update). With more filters the network can learn a richer set of feature detectors, which generally improves performance on complex tasks. The trade-offs are higher memory and compute cost, and a greater risk of overfitting on small datasets.Question 20, CNN vs FC Network on Permuted Images (15 points)
You are given the MNIST handwritten-digit dataset. Each image is subjected to a fixed but unknown pixel permutation to produce ; the labels are left unchanged. The dataset is .- (10 points) Describe and justify the classification performance impact on a CNN that was designed and trained to classify the original (unpermuted) images, if it is now evaluated on the permuted dataset.
- (5 points) If the CNN is replaced with a fully connected (FC) network retrained from scratch on , what is the performance impact of the permutation?
Solution
1. CNN on permuted images. A CNN’s convolutional filters exploit spatial locality: they detect edges, textures, and shapes by combining activations from spatially adjacent pixels. The fixed permutation scrambles the spatial arrangement, so pixels that were neighbours in the original image are now scattered across the grid. The learned filters no longer correspond to any meaningful spatial pattern in , so the pre-trained CNN’s performance degrades dramatically (approaching random chance) on the permuted dataset. The CNN’s translational equivariance and weight-sharing assumptions are violated by the permutation. 2. FC network retrained on permuted images. A fully connected network assigns an independent weight to every input pixel without any spatial neighbourhood assumptions. Because is a fixed bijection applied consistently across all training and test examples, the FC network can learn the mapping from permuted pixel positions to digit classes just as effectively as it would on the original images, the permutation is effectively just a relabelling of the input coordinates. Retraining a FC network from scratch on yields no performance loss compared to training on the original dataset.Question 1, Softmax Temperature (10 points)
Consider the following enhanced softmax function for a vector of length : Using your calculator or a local Python interpreter, calculate the softmax of the vector for and . Explain the difference between the two results and the impact of in the classifier.Solution
With :- Increasing above 1 (i.e. decreasing ) spreads probability evenly among all classes.
- Decreasing toward 0 (i.e. increasing ) concentrates all probability on the highest-score class.
Question 2, Polynomial Basis Functions for Binary Classification (25 points)
In linear regression we mapped each input example into a function using polynomial basis functions . You are now asked to apply the same approach to binary classification on the circular 2-D dataset shown in your notes.- Explain the impact of applying the polynomial basis-function transformation to the dataset.
- Draw the 3-D plot of the decision boundary that a logistic regressor (single neuron) will find after processing the transformed data.
- Draw the block diagram of the trainer, ensuring you quote all tensor sizes.
Solution
Applying the polynomial transformation (e.g. adding features ) increases the dimensionality of the input. In this higher-dimensional feature space a linear classification boundary (a hyperplane / plane) exists that perfectly separates the two classes of the circular dataset, i.e. what was a circle in 2-D becomes a flat decision boundary in the lifted feature space. The 3-D decision boundary is a plane in the space that projects back to a circle in the original 2-D input space. The block diagram is a standard single neuron: input features weight matrix sigmoid binary cross-entropy loss; plus an SGD update block feeding back to .Question 6, Missing Data in Regression (15 points)
You are given a dataset and asked to perform a regression task. Unfortunately some examples have missing features. You know which examples have missing values.- (5 points) Describe the implications if you simply ignore the missing examples.
- (10 points) Outline two methods you can think of to address finding the regression function in this case. Be descriptive but concise (≤ 3 sentences per method).
Solution
1. Implications of ignoring missing examples. If missing features are replaced with a sentinel value (e.g. −999 999 or 0), those examples become outliers or introduce non-existent categories that are never present in the real data. This severely hurts generalization; any model will be distorted by trying to fit the artificial replacement values. 2. Methods to handle missing data.- Complete-case deletion. Remove any row that has at least one missing feature before training. This keeps the data clean but can drastically reduce the number of training examples (especially when different examples are missing different features), potentially increasing variance.
- Imputation. Replace the missing value with a statistic computed from the non-missing entries of the same column, e.g. the column mean (fast, assumes MCAR) or a nearest-neighbour average (more accurate, uses correlation structure). After imputation the full dataset is used for training, preserving sample size at the cost of introducing a small estimation error.
Question 11, Poisson MLE and Sensitivity to Outliers (20 points)
You are hired as a data scientist at Walmart and tasked with modelling daily store visitors. The (normalised) dataset of seven observations is: Note: the numbers are in units of thousands (the normalisation is irrelevant to the calculations). You decide to fit a Poisson distribution: where equals the expected value of .- (5 points) Assuming the data are i.i.d., write down the log-likelihood function (natural log).
- (10 points) Solve to obtain the MLE estimate .
- (5 points) Is the Black Friday observation () important or unimportant to the MLE estimate? Explain.

