Skip to main content

Question 18 — Multinomial Log-Likelihood (15 points)

A multiclass classifier with KK classes outputs the probability of class kk given input x\mathbf{x} via the softmax: P(y=kx)=exp(xwk)j=1Kexp(xwj)P(y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{x}^\top \mathbf{w}_k)}{\sum_{j=1}^{K} \exp(\mathbf{x}^\top \mathbf{w}_j)} Write down the log-likelihood expression for a single training example (x,y)(\mathbf{x}, y) under this multinomial model. Your derivation should follow the same steps used in the course for the binary cross-entropy loss.

Solution

For a single example with true class yy, the probability assigned by the model is P(yx)P(y \mid \mathbf{x}). Using indicator notation, we can write the log-likelihood as: logp(yx)=k=1K1[y=k]logP(y=kx)\log p(y \mid \mathbf{x}) = \sum_{k=1}^{K} \mathbf{1}[y = k]\, \log P(y = k \mid \mathbf{x}) Substituting the softmax expression: =k=1K1[y=k][xwklnj=1Kexp(xwj)]= \sum_{k=1}^{K} \mathbf{1}[y = k] \left[ \mathbf{x}^\top \mathbf{w}_k - \ln\sum_{j=1}^{K} \exp(\mathbf{x}^\top \mathbf{w}_j) \right] Since exactly one indicator is 1 (the true class cc), this simplifies to: =xwclnj=1Kexp(xwj)= \mathbf{x}^\top \mathbf{w}_c - \ln\sum_{j=1}^{K}\exp(\mathbf{x}^\top \mathbf{w}_j) The negative of this expression (averaged over mm training examples) is the categorical cross-entropy loss minimised during training. Providing the expression for the full dataset (summed over mm examples) is also acceptable.

Question 3 — Confusion Matrix & Business Cost (15 points)

A Data Scientist is evaluating four binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The model must satisfy:
  1. Recall rate ≥ 80 %
  2. False positive rate (FPR) ≤ 10 %
  3. Minimum business cost
After training, the following confusion matrices were obtained:
ModelTNFPFNTP
A9192278
B9912179
C9641090
D9821882
Recall =TP/(TP+FN)= \text{TP}/(\text{TP}+\text{FN}) FPR =FP/(FP+TN)= \text{FP}/(\text{FP}+\text{TN}) Business cost =5×FP+FN= 5 \times \text{FP} + \text{FN} Which model satisfies all requirements? Justify your answer numerically.

Solution

Compute recall and FPR for each model:
ModelRecallFPRRecall ≥ 80%?FPR ≤ 10%?Business cost
A78/100 = 78%9/100 = 9%
B79/100 = 79%1/100 = 1%
C90/100 = 90%4/100 = 4%5×4+10 = 30
D82/100 = 82%2/100 = 2%5×2+18 = 28
Models A and B fail the recall requirement. Among the passing models, Model D has a lower business cost (28 < 30). Answer: Model D.

Question 7 — Naive Bayes Multi-Feature Classifier (20 points)

You face a binary classification problem with target y{0,1}y \in \{0, 1\} and inputs x=[x1,x2,,xn]\mathbf{x} = [x_1, x_2, \dots, x_n] where xiRx_i \in \mathbb{R}.
  1. (5 points) Write the posterior distribution for each class k{0,1}k \in \{0, 1\}.
  2. (5 points) Write the joint probability of class and features assuming features are conditionally independent given the class label.
  3. (5 points) Derive an optimal decision rule using the posteriors of the two classes.
  4. (5 points) If each conditional p(xiyk)p(x_i \mid y_k) is univariate Gaussian, suggest a method to estimate the posterior y^\hat{y}.

Solution

Let yky_k denote the event y=ky = k. 1. Posterior (Bayes rule): p(ykx)=p(xyk)p(yk)p(x)p(y_k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid y_k)\, p(y_k)}{p(\mathbf{x})} 2. Joint under conditional independence: Applying the chain rule and the conditional independence assumption xixjykx_i \perp x_j \mid y_k: p(yk,x1,,xn)=p(yk)i=1np(xiyk)p(y_k,\, x_1, \ldots, x_n) = p(y_k)\prod_{i=1}^n p(x_i \mid y_k) 3. Optimal decision rule: Since the evidence p(x)=kp(yk)ip(xiyk)p(\mathbf{x}) = \sum_k p(y_k)\prod_i p(x_i \mid y_k) is the same for both classes, it cancels in the argmax: y^=argmaxk  p(yk)i=1np(xiyk)\hat{y} = \arg\max_{k} \; p(y_k) \prod_{i=1}^n p(x_i \mid y_k) 4. Estimation with Gaussian conditionals (Gaussian Naive Bayes): From training data compute: the prior p^(yk)=Nk/N\hat{p}(y_k) = N_k / N (fraction of examples in class kk), and, for each feature ii and class kk, MLE estimates of the mean μik=1Nkj:y(j)=kxi(j)\mu_{ik} = \frac{1}{N_k}\sum_{j: y^{(j)}=k} x_i^{(j)} and variance σik2\sigma_{ik}^2. Plug the Gaussian PDF into the decision rule above (or equivalently work in log-space) to classify new examples.

Question 4 — SGD and Feature Scaling (10 points)

Part A (5 points)

Scaling / normalization of features has been found to be beneficial for faster convergence of SGD. Explain why.

Part B (5 points)

A colleague suggests that the scaling should also include the target variable yy. Do you agree or disagree? Explain.

Solution

Part A. If features have very different scales, the loss surface in parameter space becomes highly anisotropic (elongated along the direction of the large-scale feature). The gradient wL\nabla_{\mathbf{w}} L will then point mostly in the direction of the dominant feature, forcing SGD to take tiny zig-zag steps in the other directions to avoid overshooting. After standardizing the features (zero mean, unit variance) the loss surface becomes more isotropic, enabling larger, more direct steps toward the minimum and hence faster convergence. Part B. Agreement — scaling the target variable is also helpful. The probabilistic model of the regression errors assumes a Gaussian noise process; target variables with heavy tails (e.g. house prices, counts) violate this assumption. Applying a log-transform or standardization to yy makes the target distribution more Gaussian, which improves the validity of the MLE criterion and can stabilize training.

Question 5 — SGD Jitter, Momentum, and Convergence (15 points)

The figure in your notes shows a contour plot of a loss function in weight space together with the SGD trajectory as it searches for the minimum. The trajectory exhibits substantial jitter. (A) (5 points) Is this the best trajectory you can think of? (B) (5 points) Why is there so much jitter? (C) (5 points) A colleague suggests that adding a fraction of the previous update vector to the current update improves convergence. Do you agree? If so, why? Recall the weight update: wk+1=wkηwL(wk)\mathbf{w}_{k+1} = \mathbf{w}_k - \eta \nabla_{\mathbf{w}} L(\mathbf{w}_k)

Solution

(A) No. The jittery trajectory converges very slowly; a smoother, more direct path would be preferable. (B) The axes of the contour plot represent the model parameters w\mathbf{w}. The jitter arises because the two parameter dimensions have very different dynamic ranges (the loss surface is elongated). A single learning rate η\eta is simultaneously too large for the steep direction and too small for the flat direction, causing oscillation in the steep direction. (C) Yes. Adding a fraction β\beta of the previous update implements a momentum (exponential moving average) update: Vk+1=βVk+(1β)wL(wk),wk+1=wkηVk+1V_{k+1} = \beta V_k + (1-\beta)\, \nabla_{\mathbf{w}} L(\mathbf{w}_k), \qquad \mathbf{w}_{k+1} = \mathbf{w}_k - \eta\, V_{k+1} The averaging filters out high-frequency noise in the gradient, resulting in a much smoother, more direct trajectory and faster convergence — as illustrated by the right-hand plot in the notes.

Question 9 — L1 vs L2 Regularization (15 points)

Part A (5 points)

Explain what early stopping does in terms of controlling model capacity, and what its main risk is.

Part B (10 points)

It has been argued that early stopping has an effect similar to L2 regularization. A colleague claims the same equivalence holds for L1 regularization. Do you agree or disagree? Justify your answer and sketch the constrained loss contours for L1 regularization.

Solution

Part A. Early stopping limits the number of SGD update steps, effectively capping the model’s capacity — the fewer the steps, the less the model can overfit the training data. The main risk is incorrectly tuning this hyperparameter: stopping too early leads to underfitting; stopping too late leads to overfitting. Finding the right stopping point requires periodic evaluation on a held-out validation set, which adds cost. Part B. Disagreement. Early stopping produces a regularization effect analogous to L2 (weight-decay) because it shrinks all weights toward zero at a similar rate. L1 regularization, however, has a fundamentally different geometric structure: the L1 constraint ball (a “diamond” in 2-D) has corners aligned with the coordinate axes, which encourages the optimal solution to lie at a corner where some weights are exactly zero. This induces sparsity — many weights become identically zero — whereas early stopping does not yield sparse solutions. The L1-constrained loss contour (diamond shape) is therefore not equivalent to early stopping.

Question 16 — Class Imbalance in Binary Classification (25 points)

Racial and ethnic bias is unfortunately prevalent in machine learning systems. You are given a dataset of person images for binary classification: (a) people of colour (y=1y=1) and (b) not people of colour (y=0y=0). The number of y=0y=0 examples is much larger than y=1y=1 examples.
  1. (12.5 points) Describe what will happen when you do not address the class imbalance.
  2. (12.5 points) Outline two methods you can employ to address the imbalance and reduce classifier bias as much as possible.

Solution

1. Without correction. Because every example contributes equally to the cross-entropy loss, the minority class (y=1y=1) contributes very little to the total gradient signal. The classifier will therefore learn almost entirely from the majority class and fail to learn discriminative features for the minority class. At inference it will exhibit low recall on y=1y=1 examples — i.e. it will systematically misclassify people of colour as “not people of colour”. 2. Remediation methods.
  • Subsampling the majority class. Randomly downsample y=0y=0 to match the count of y=1y=1 before training. This restores balance at the cost of discarding potentially useful data.
  • Weighted loss function. Introduce a class weight α>1\alpha > 1 for the minority class in the binary CE loss:
L(w)=1mi=1m[αyilny^i+(1yi)ln(1y^i)]L(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^m \bigl[\alpha\, y_i \ln \hat{y}_i + (1-y_i)\ln(1-\hat{y}_i)\bigr] Setting α=m0/m1\alpha = m_0/m_1 (ratio of majority to minority count) amplifies the gradient contribution from the minority class without discarding any data. α\alpha can also be treated as a tunable hyperparameter.

Question 13 — Cross-Entropy with Soft Probability Labels (5 points)

In binary classification you have worked with cross-entropy (CE) involving a hard ground truth y{0,1}y \in \{0,1\} and a predicted probability y^\hat{y}. You now face a 3-class problem where the ground truth is also expressed as a probability distribution: y=[0.10, 0.40, 0.50],y^=[0.80, 0.15, 0.05]y = [0.10,\ 0.40,\ 0.50], \qquad \hat{y} = [0.80,\ 0.15,\ 0.05] Calculate the cross-entropy loss. PS: This formulation is the basis of Hinton’s 2015 knowledge-distillation paper.

Solution

The generalised cross-entropy for soft labels is: CE(y,y^)=kyklog2y^kCE(y,\hat{y}) = -\sum_{k} y_k \log_2 \hat{y}_k =[0.10log2(0.80)+0.40log2(0.15)+0.50log2(0.05)]= -\bigl[0.10 \log_2(0.80) + 0.40 \log_2(0.15) + 0.50 \log_2(0.05)\bigr] =[0.10×(0.322)+0.40×(2.737)+0.50×(4.322)]= -\bigl[0.10 \times (-0.322) + 0.40 \times (-2.737) + 0.50 \times (-4.322)\bigr] =[0.0321.0952.161]=3.288 bits= -[-0.032 - 1.095 - 2.161] = \mathbf{3.288 \text{ bits}} (Also accepted: CE(y^,y)2.906CE(\hat{y}, y) \approx 2.906 bits in the reversed convention.)

Question 8 — CNN Parameter Count and Output Size (15 points)

Consider a CNN trained to classify 64×64 RGB images (3 channels) into 3 vehicle classes (cars, trucks, motorcycles). The architecture is:
  • Conv1: 32 filters, kernel 3×33 \times 3, stride 1, ReLU
  • MaxPool1: 2×22 \times 2, stride 2
  • Conv2: 64 filters, kernel 3×33 \times 3, stride 1, ReLU
  • MaxPool2: 2×22 \times 2, stride 2
  • Conv3: 128 filters, kernel 3×33 \times 3, stride 1, ReLU
  • MaxPool3: 2×22 \times 2, stride 2
  • FC1: 512 neurons, ReLU
  • Output: 3 neurons, softmax

PS3.A (5 points)

How many trainable parameters does Conv1 have?

PS3.B (5 points)

What is the output spatial size of Conv1 (assuming no padding)?

PS3.C (5 points)

What is the effect of increasing the number of filters in a convolutional layer on the parameter count and on the model’s performance?

Solution

PS3.A — Parameters in Conv1. Each of the 32 filters has size 3×3×33 \times 3 \times 3 (spatial × input channels). Including one bias per filter: params=32×(3×3×3+1)=32×28=896\text{params} = 32 \times (3 \times 3 \times 3 + 1) = 32 \times 28 = \mathbf{896} (Without biases: 32×27=86432 \times 27 = 864. Both values are acceptable.) PS3.B — Output size of Conv1 (no padding, stride 1): output=6431+1=6262×62×32\text{output} = \frac{64 - 3}{1} + 1 = 62 \quad \Rightarrow \quad \mathbf{62 \times 62 \times 32} PS3.C — Effect of more filters. Increasing the filter count proportionally increases the number of trainable parameters (more weights to store and update). With more filters the network can learn a richer set of feature detectors, which generally improves performance on complex tasks. The trade-offs are higher memory and compute cost, and a greater risk of overfitting on small datasets.

Question 20 — CNN vs FC Network on Permuted Images (15 points)

You are given the MNIST handwritten-digit dataset. Each 28×2828 \times 28 image x\mathbf{x} is subjected to a fixed but unknown pixel permutation π\pi to produce π(x)\pi(\mathbf{x}); the labels yy are left unchanged. The dataset is (π(x),y)(\pi(\mathbf{x}),\, y).
  1. (10 points) Describe and justify the classification performance impact on a CNN that was designed and trained to classify the original (unpermuted) images, if it is now evaluated on the permuted dataset.
  2. (5 points) If the CNN is replaced with a fully connected (FC) network retrained from scratch on (π(x),y)(\pi(\mathbf{x}), y), what is the performance impact of the permutation?

Solution

1. CNN on permuted images. A CNN’s convolutional filters exploit spatial locality: they detect edges, textures, and shapes by combining activations from spatially adjacent pixels. The fixed permutation π\pi scrambles the spatial arrangement, so pixels that were neighbours in the original image are now scattered across the 28×2828 \times 28 grid. The learned filters no longer correspond to any meaningful spatial pattern in π(x)\pi(\mathbf{x}), so the pre-trained CNN’s performance degrades dramatically (approaching random chance) on the permuted dataset. The CNN’s translational equivariance and weight-sharing assumptions are violated by the permutation. 2. FC network retrained on permuted images. A fully connected network assigns an independent weight to every input pixel without any spatial neighbourhood assumptions. Because π\pi is a fixed bijection applied consistently across all training and test examples, the FC network can learn the mapping from permuted pixel positions to digit classes just as effectively as it would on the original images — the permutation is effectively just a relabelling of the input coordinates. Retraining a FC network from scratch on (π(x),y)(\pi(\mathbf{x}), y) yields no performance loss compared to training on the original dataset.

Question 1 — Softmax Temperature (10 points)

Consider the following enhanced softmax function for a vector z\mathbf{z} of length KK: softmax(z)i=eβzij=1Keβzjfor i=1,,K\text{softmax}(\mathbf{z})_i = \frac{e^{-\beta z_i}}{\sum_{j=1}^K e^{-\beta z_j}} \quad \text{for } i = 1, \dotsc , K Using your calculator or a local Python interpreter, calculate the softmax of the vector z=[1,8,3,10,5]\mathbf{z} = [1, 8, 3, 10, 5] for β=1\beta = 1 and β=10\beta = 10. Explain the difference between the two results and the impact of β1\beta \ne 1 in the classifier.

Solution

With β=1/T\beta = 1/T:
  • Increasing TT above 1 (i.e. decreasing β\beta) spreads probability evenly among all classes.
  • Decreasing TT toward 0 (i.e. increasing β\beta) concentrates all probability on the highest-score class.
In the limit TT \to \infty every class has the same probability (uniform). In the limit T0T \to 0 only the argmax class receives all the probability (hard argmax). Concretely, for β=10\beta = 10 the high-scoring entries (z4=10z_4 = 10) receive almost all of the probability mass, making the classifier very “confident.” For β=1\beta = 1 the distribution is softer and multiple classes receive non-negligible probability. A temperature T>1T > 1 corrects an over-confident network by bringing logits closer together; T<1T < 1 drives logits further apart.

Question 2 — Polynomial Basis Functions for Binary Classification (25 points)

In linear regression we mapped each input example xx into a function g(x,w)=wTϕ(x)g(x, \mathbf{w}) = \mathbf{w}^T \phi(x) using polynomial basis functions ϕi(x)\phi_i(x). You are now asked to apply the same approach to binary classification on the circular 2-D dataset shown in your notes.
  1. Explain the impact of applying the polynomial basis-function transformation to the dataset.
  2. Draw the 3-D plot of the decision boundary that a logistic regressor (single neuron) will find after processing the transformed data.
  3. Draw the block diagram of the trainer, ensuring you quote all tensor sizes.
Hint: you may use the kernel form k(xi,xj)=ϕ(xi)Tϕ(xj)k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j).

Solution

Applying the polynomial transformation (e.g. adding features x12,x22,x1x2,x_1^2, x_2^2, x_1 x_2, \ldots) increases the dimensionality of the input. In this higher-dimensional feature space a linear classification boundary (a hyperplane / plane) exists that perfectly separates the two classes of the circular dataset — i.e. what was a circle in 2-D becomes a flat decision boundary in the lifted feature space. The 3-D decision boundary is a plane in the (x1,x2,x12+x22)(x_1, x_2, x_1^2+x_2^2) space that projects back to a circle in the original 2-D input space. The block diagram is a standard single neuron: nn input features \to weight matrix w  [n×1]\mathbf{w} \; [n \times 1] \to sigmoid σ()\sigma(\cdot) \to binary cross-entropy loss; plus an SGD update block feeding back to w\mathbf{w}.

Question 6 — Missing Data in Regression (15 points)

You are given a dataset and asked to perform a regression task. Unfortunately some examples (x,y)(\mathbf{x}, y) have missing features. You know which examples have missing values.
  1. (5 points) Describe the implications if you simply ignore the missing examples.
  2. (10 points) Outline two methods you can think of to address finding the regression function in this case. Be descriptive but concise (≤ 3 sentences per method).

Solution

1. Implications of ignoring missing examples. If missing features are replaced with a sentinel value (e.g. −999 999 or 0), those examples become outliers or introduce non-existent categories that are never present in the real data. This severely hurts generalization; any model will be distorted by trying to fit the artificial replacement values. 2. Methods to handle missing data.
  • Complete-case deletion. Remove any row that has at least one missing feature before training. This keeps the data clean but can drastically reduce the number of training examples (especially when different examples are missing different features), potentially increasing variance.
  • Imputation. Replace the missing value with a statistic computed from the non-missing entries of the same column — e.g. the column mean (fast, assumes MCAR) or a nearest-neighbour average (more accurate, uses correlation structure). After imputation the full dataset is used for training, preserving sample size at the cost of introducing a small estimation error.

Question 11 — Poisson MLE and Sensitivity to Outliers (20 points)

You are hired as a data scientist at Walmart and tasked with modelling daily store visitors. The (normalised) dataset of seven observations is: X={1,2,5,10,5,8,100}\mathbb{X} = \{1, 2, 5, 10, 5, 8, 100\} Note: the numbers are in units of thousands (the normalisation is irrelevant to the calculations). You decide to fit a Poisson distribution: Pr(x=k)=λkeλk!,k=0,1,2,\Pr(x = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots where λ>0\lambda > 0 equals the expected value of xx.
  1. (5 points) Assuming the data are i.i.d., write down the log-likelihood function LLLL (natural log).
  2. (10 points) Solve LL/λ=0\partial LL / \partial \lambda = 0 to obtain the MLE estimate λ^\hat{\lambda}.
  3. (5 points) Is the Black Friday observation (x=100x = 100) important or unimportant to the MLE estimate? Explain.

Solution

1. Log-likelihood. LL=lni=1mp(xi)=i=1m[λ+xilnλlnxi!]=mλ+lnλi=1mxii=1mlnxi!LL = \ln \prod_{i=1}^m p(x_i) = \sum_{i=1}^m \left[ -\lambda + x_i \ln\lambda - \ln x_i! \right] = -m\lambda + \ln\lambda \sum_{i=1}^m x_i - \sum_{i=1}^m \ln x_i! 2. MLE estimate. LLλ=0    m+1λi=1mxi=0    λ^=1mi=1mxi\frac{\partial LL}{\partial \lambda} = 0 \;\Rightarrow\; -m + \frac{1}{\lambda}\sum_{i=1}^m x_i = 0 \;\Rightarrow\; \hat{\lambda} = \frac{1}{m}\sum_{i=1}^m x_i The MLE of λ\lambda is simply the sample mean. 3. Impact of the Black Friday outlier. The Black Friday value is very important. Without it (using {1,2,5,10,5,8}\{1,2,5,10,5,8\}, m=6m=6) the estimate is λ^=31/65.2\hat{\lambda} = 31/6 \approx 5.2. With it (m=7m=7, sum = 131) the estimate becomes λ^=131/718.7\hat{\lambda} = 131/7 \approx 18.7. The MLE estimator is the sample mean and is therefore sensitive to outliers — a single anomalous observation shifts the estimate substantially.

Question 17 — Correlated Dataset and SGD Mini-Batch Size (15 points)

You are given a dataset with KK classes for multi-class neural-network classification. Initial exploration reveals that the dataset contains heavily correlated examples per class.

Part A (5 points)

Explain why a strongly correlated dataset may not be a good one for training.

Part B (10 points)

Given that you must work with this dataset, what is the best choice of SGD mini-batch size and why? How are the gradients affected by correlation within a mini-batch?

Solution

Part A. Correlated examples within a class contain redundant information. The classifier cannot learn the full diversity of intra-class variation, so even a small distributional shift at test time (e.g. a slightly different viewpoint or lighting condition) will cause large performance drops. The model memorises the dominant template of each class rather than its true decision boundary. Part B. Correlated examples yield correlated gradient estimates within each mini-batch. Normally, you want each gradient update to be driven by a representative sample of all classes, which motivates a batch size of order M×KM \times K (a multiple MM of the number of classes KK, limited by GPU memory). With correlated data, however, increasing MM beyond 1 gives almost no additional information per batch — the gradient estimates remain highly correlated regardless of batch size. Therefore the optimal strategy is a small mini-batch size (even M=1M=1, i.e. stochastic gradient descent), accepting higher gradient variance in exchange for more frequent and more diverse updates.