Question 1 — Gridworld Policies and the Bellman Equation (15 points)
Consider a 3×4 gridworld with 2 terminal states and 1 wall (impassable). The agent has 4 deterministic actions (N, S, E, W). (a) How many deterministic policies exist in this gridworld? Give both the general formula and the numerical answer. (b) In a separate 5×5 gridworld under a uniform random policy with discount , there are two special states A and B. State A teleports the agent to A’ with reward +10; state B teleports to B’ with reward +5. Using the Bellman policy-evaluation equation explain qualitatively why while .Solution
(a) The number of deterministic policies is where is the number of non-terminal states. The grid has cells minus 2 terminal and 1 wall, giving . Hence (A common error is to write , which is incorrect.) (b) Because the teleports are deterministic, the Bellman equation gives and . In Sutton & Barto’s standard 5×5 example one finds and under the uniform random policy. Then but more directly: the discount means the future stream of rewards from cannot fully compensate for the immediate , so . For B, the destination sits in the interior of the grid where the random policy accumulates additional positive return, so and . The key insight: the Bellman equation accounts for the entire discounted stream from the destination state, not merely the immediate teleport reward — and a discount factor strictly below 1 strictly shrinks future contributions.Question 2 — Return Computation and Discounting (15 points)
An MDP has a single non-terminal state with two actions, and . Action yields rewards , while action yields . Compute and in closed form as functions of the discount , and determine the values of for which is preferred over .Solution
For action : Equivalently, the Bellman recursion gives the same answer. For action : Comparing:- If : , prefer .
- If : they are equal.
- If : , prefer .
Question 3 — Hidden Markov Model Transition / Observation Models (15 points)
To model humidity as a proxy for temperature, let the hidden state (hot, cold) with transition table and observable humidity with emission table (a) Write out the transition and measurement models as conditional probabilities. (b) Explain qualitatively how you would compute the posterior for the observation sequence “SMSL” using the forward algorithm.Solution
(a) Transition model: Measurement model: (b) The forward algorithm initialises with a prior over the initial state, then recursively updates The filtered posterior is . Apply this once per observation in “SMSL” to obtain the posterior at each step.Question 4 — Kalman Filter Behaviour and Noise Trade-offs (10 points)
An agent moves in a 1-D world with constant velocity. A scalar Kalman filter is implemented with equal process and measurement noise standard deviations . (a) Explain why, at the start of filtering, the estimate tracks the noisy measurements closely, whereas after several steps the estimate smooths over them. (b) What happens if the process-noise variance is set too low? Too high? (c) If the measurement matrix , is a meaningful state estimate still possible? Why or why not?Solution
(a) The Kalman gain is close to 1 when the predicted covariance is large (initial uncertainty), so the update weights the measurement heavily. As iterations proceed, decreases and the gain converges to its (nonzero) steady-state value, so the filter blends prediction and measurement in a fixed ratio rather than tracking each measurement directly — visually this looks like smoothing. (b) If the process variance is set artificially low, the filter becomes over-confident in its model and fails to react to real changes in the system. Set it too high and the filter becomes over-reactive, essentially mirroring the noisy measurements. (c) No. If , the measurement carries no information about the state ( is independent of ), so the correction step is a no-op. The filter degenerates into pure prediction and its covariance grows unboundedly.Question 5 — Bayes Filter for Occupancy Mapping (15 points)
A robotic vacuum’s cell-level position is known from ceiling cameras. The robot measures distances via an infinite-range lidar. You want to estimate the binary occupancy map for a grid. (a) Write down the posterior you need to estimate and apply Bayes’ rule to it. (b) Explain why the full joint is intractable for cells, and give the simplified factored form under the usual cell-independence assumption. (c) Sketch the Bayes-filter update in pseudocode.Solution
(a) The target posterior is (b) With binary cells, the joint has possible configurations — computationally infeasible. Assuming cells are independent given the measurements, which reduces the problem to independent binary Bayes filters (one per cell). (c) Each cell is updated independently in log-odds form to avoid numerical underflow and turn the Bayes update into a simple addition. Define and as the prior log-odds. The recursive update is where the middle term is the inverse-sensor-model log-odds for cell . Pseudocode:Question 6 — Multi-Head Self-Attention and the Role of (15 points)
Consider a transformer layer with three attention heads, each producing . (a) Why are multiple heads used instead of a single larger head? (b) Write the concatenation and output-projection equations and explain the role of the projection matrix . (c) What happens if is fixed to the identity matrix? (d) Why is the position-wise feed-forward network (FFN) necessary after self-attention?Solution
(a) Different heads can specialise in complementary relational structure — e.g. one head on syntactic (subject–verb) dependencies, another on semantics, a third on long-range coreference. These roles emerge during training and give the layer richer relational capacity than a single head of equal total width. (b) Concatenation and projection: linearly mixes features across heads and projects back into the residual-stream dimension. Without it, heads cannot interact. (c) If (which requires , as is the case here), the heads’ outputs are simply concatenated with no learned cross-head mixing inside the attention sub-layer — each head’s contribution stays in its own slice of the residual stream. Cross-head interaction can still happen later via the FFN and subsequent layers, so the loss is partial rather than total. (d) Self-attention is a (weighted) linear combination of value vectors: without additional nonlinearity, the layer is effectively linear in token content. The FFN applies a non-linear transformation (e.g. GELU) token-wise, giving the model capacity to form higher-level abstractions. Intuitively: attention decides where to look; the FFN decides what to compute with what it found.Question 7 — Vision Transformer: Patching, Attention and Classifier Head (15 points)
You wish to classify a image using a ViT encoder with patch size and a single attention head. (a) How many patches are produced and what is their dimensionality? How are they embedded and position-encoded? (b) Write down the query / key / value computation and the attention equation (assume ). (c) Describe the classifier head on top of the encoder and dimension the final tensor for a -class problem.Solution
(a) The image is split into flattened patches of dimension . Each patch is mapped to a -dimensional embedding by a trainable linear projection. A learnable positional embedding encoding the 2-D location of each patch is added to the patch embeddings. (b) For , with joint projection : (c) A[CLS] token (or global-average-pool of patch tokens) feeds an MLP classification head ending in a softmax over classes; the final posterior is a vector in (or shape for a batch of size ). The full forward pass stacks: patch-embed → +positional → LayerNorm → self-attention → residual → LayerNorm → FFN → residual → pool → Linear → softmax.
Question 8 — Skip-gram Word2Vec Objective (10 points)
Define the skip-gram loss for a center word with a symmetric context window of size , and explain why the same output projection matrix is reused for every context position.Solution
Let the context positions be . The skip-gram assumption is that context words are conditionally independent given the center word. The per-position loss is the sum of cross-entropies over all neighbours: The output (projection) matrix is shared across all context positions because the skip-gram model assumes that the distribution of context words depends only on the center word and not on its position relative to the center. Using different output matrices per position would encode positional dependence, contradicting the conditional-independence assumption that justifies the factorisation above.Question 9 — CLIP Contrastive Image–Text Learning (10 points)
(a) Sketch the CLIP architecture at the block level (image encoder, text encoder, projection layers, similarity computation). (b) Explain the contrastive loss used to train it and what each component encourages.Solution
(a) The image encoder is a CNN (e.g., ResNet) or a ViT that maps an image into a feature vector, followed by a linear projection into a shared embedding space. A parallel text encoder (usually a transformer) maps the caption into the same shared space via its own linear projection. Similarity between image and text embeddings is computed via a (scaled) dot product, producing an similarity matrix for a batch of image–text pairs. (b) CLIP uses a single symmetric InfoNCE loss over the similarity matrix , where is the (normalised) image embedding, the (normalised) text embedding, and is a learned temperature. Define and the total loss . Each row’s softmax cross-entropy implicitly pulls the matched (diagonal) image–text pair together while pushing the in-batch negatives apart through the denominator. The symmetric form ensures the embedding is learned consistently in both directions (image→text retrieval and text→image retrieval).Question 10 — Siamese Networks and Contrastive Loss (10 points)
You are given a Siamese network with two identical CNN branches that share weights, used to decide whether two input images belong to the same class (e.g., bothperson) or different classes (e.g., person vs. animal). The contrastive loss for a pair with label (0 = same class, 1 = different class) is
where is the Euclidean distance between the two CNN embeddings and is a margin.
(a) Sketch (or describe) the network architecture and explain why both branches must share weights.
(b) Explain the role of each of the two terms in the loss and what the network learns to do for matched vs. unmatched pairs.
(c) Why is the margin necessary in the second term?

