Skip to main content

Question 1 — Gridworld Policies and the Bellman Equation (15 points)

Consider a 3×4 gridworld with 2 terminal states and 1 wall (impassable). The agent has 4 deterministic actions (N, S, E, W). (a) How many deterministic policies exist in this gridworld? Give both the general formula and the numerical answer. (b) In a separate 5×5 gridworld under a uniform random policy with discount γ=0.9\gamma = 0.9, there are two special states A and B. State A teleports the agent to A’ with reward +10; state B teleports to B’ with reward +5. Using the Bellman policy-evaluation equation v(s)=aπ(as)sp(ss,a)[r(s,a,s)+γv(s)],v(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)\bigl[r(s,a,s') + \gamma v(s')\bigr], explain qualitatively why v(A)<10v(A) < 10 while v(B)>5v(B) > 5.

Solution

(a) The number of deterministic policies is Π=AS|\Pi| = |A|^{|S|} where S|S| is the number of non-terminal states. The grid has 3×4=123\times 4 = 12 cells minus 2 terminal and 1 wall, giving S=9|S|=9. Hence Π=49=262,144.|\Pi| = 4^9 = 262{,}144. (A common error is to write SA|S|^{|A|}, which is incorrect.) (b) Because the teleports are deterministic, the Bellman equation gives v(A)=10+γv(A)v(A) = 10 + \gamma\, v(A') and v(B)=5+γv(B)v(B) = 5 + \gamma\, v(B'). In Sutton & Barto’s standard 5×5 example one finds v(A)1.5v(A')\approx 1.5 and v(B)0.4v(B')\approx 0.4 under the uniform random policy. Then v(A)10+0.91.511.4(discounted)  <  10/(1γ),v(A)\approx 10 + 0.9\cdot 1.5 \approx 11.4 \cdot \text{(discounted)} \;<\; 10/(1-\gamma), but more directly: the discount γ<1\gamma<1 means the future stream of rewards from AA' cannot fully compensate for the immediate +10+10, so v(A)=10+0.9v(A)8.8<10v(A) = 10 + 0.9\,v(A') \approx 8.8 < 10. For B, the destination BB' sits in the interior of the grid where the random policy accumulates additional positive return, so v(B)>0v(B') > 0 and v(B)=5+0.9v(B)>5v(B) = 5 + 0.9\,v(B') > 5. The key insight: the Bellman equation accounts for the entire discounted stream from the destination state, not merely the immediate teleport reward — and a discount factor strictly below 1 strictly shrinks future contributions.

Question 2 — Return Computation and Discounting (15 points)

An MDP has a single non-terminal state S0S_0 with two actions, LL and RR. Action LL yields rewards 1,0,1,0,1, 0, 1, 0, \dots, while action RR yields 0,2,0,2,0, 2, 0, 2, \dots. Compute VL(S0)V_L(S_0) and VR(S0)V_R(S_0) in closed form as functions of the discount γ\gamma, and determine the values of γ\gamma for which LL is preferred over RR.

Solution

For action LL: VL(S0)=1+γ2+γ4+=k=0γ2k=11γ2.V_L(S_0) = 1 + \gamma^2 + \gamma^4 + \cdots = \sum_{k=0}^{\infty} \gamma^{2k} = \frac{1}{1-\gamma^2}. Equivalently, the Bellman recursion VL(S0)=1+γ2VL(S0)V_L(S_0) = 1 + \gamma^2 V_L(S_0) gives the same answer. For action RR: VR(S0)=2γ+2γ3+=2γk=0γ2k=2γ1γ2.V_R(S_0) = 2\gamma + 2\gamma^3 + \cdots = 2\gamma \sum_{k=0}^{\infty} \gamma^{2k} = \frac{2\gamma}{1-\gamma^2}. Comparing:
  • If γ<0.5\gamma < 0.5: VL>VRV_L > V_R, prefer LL.
  • If γ=0.5\gamma = 0.5: they are equal.
  • If γ>0.5\gamma > 0.5: VR>VLV_R > V_L, prefer RR.
A far-sighted agent (large γ\gamma) prefers the larger but delayed reward; a myopic agent prefers the immediate reward.

Question 3 — Hidden Markov Model Transition / Observation Models (15 points)

To model humidity as a proxy for temperature, let the hidden state xt{H,C}x_t \in \{H,C\} (hot, cold) with transition table Ht+1Ct+1Ht0.70.3Ct0.40.6\begin{array}{c|cc} & H_{t+1} & C_{t+1} \\ \hline H_t & 0.7 & 0.3 \\ C_t & 0.4 & 0.6 \end{array} and observable humidity zt{S,M,L}z_t \in \{S, M, L\} with emission table SMLH0.10.40.5C0.70.20.1\begin{array}{c|ccc} & S & M & L \\ \hline H & 0.1 & 0.4 & 0.5 \\ C & 0.7 & 0.2 & 0.1 \end{array} (a) Write out the transition and measurement models as conditional probabilities. (b) Explain qualitatively how you would compute the posterior p(xtz1:t)p(x_t | z_{1:t}) for the observation sequence “SMSL” using the forward algorithm.

Solution

(a) Transition model: p(xt+1=Hxt=H)=0.7,p(xt+1=Cxt=H)=0.3p(x_{t+1}=H \mid x_t=H) = 0.7, \quad p(x_{t+1}=C \mid x_t=H) = 0.3 p(xt+1=Hxt=C)=0.4,p(xt+1=Cxt=C)=0.6p(x_{t+1}=H \mid x_t=C) = 0.4, \quad p(x_{t+1}=C \mid x_t=C) = 0.6 Measurement model: p(z=SH)=0.1, p(z=MH)=0.4, p(z=LH)=0.5p(z=S|H)=0.1,\ p(z=M|H)=0.4,\ p(z=L|H)=0.5 p(z=SC)=0.7, p(z=MC)=0.2, p(z=LC)=0.1p(z=S|C)=0.7,\ p(z=M|C)=0.2,\ p(z=L|C)=0.1 (b) The forward algorithm initialises α1(x1)=p(x1)p(z1x1)\alpha_1(x_1) = p(x_1)\,p(z_1\mid x_1) with a prior over the initial state, then recursively updates αt(xt)=p(ztxt)xt1p(xtxt1)αt1(xt1).\alpha_t(x_t) = p(z_t\mid x_t) \sum_{x_{t-1}} p(x_t\mid x_{t-1})\, \alpha_{t-1}(x_{t-1}). The filtered posterior is p(xtz1:t)=αt(xt)/xtαt(xt)p(x_t\mid z_{1:t}) = \alpha_t(x_t) / \sum_{x_t} \alpha_t(x_t). Apply this once per observation in “SMSL” to obtain the posterior at each step.

Question 4 — Kalman Filter Behaviour and Noise Trade-offs (10 points)

An agent moves in a 1-D world with constant velocity. A scalar Kalman filter is implemented with equal process and measurement noise standard deviations σv=σw=5\sigma_v = \sigma_w = 5. (a) Explain why, at the start of filtering, the estimate tracks the noisy measurements closely, whereas after several steps the estimate smooths over them. (b) What happens if the process-noise variance is set too low? Too high? (c) If the measurement matrix Ct=0C_t = 0, is a meaningful state estimate still possible? Why or why not?

Solution

(a) The Kalman gain Kt=Ptt1Ptt1+σw2K_t = \frac{P_{t|t-1}}{P_{t|t-1} + \sigma_w^2} is close to 1 when the predicted covariance Ptt1P_{t|t-1} is large (initial uncertainty), so the update weights the measurement heavily. As iterations proceed, PttP_{t|t} decreases and the gain converges to its (nonzero) steady-state value, so the filter blends prediction and measurement in a fixed ratio rather than tracking each measurement directly — visually this looks like smoothing. (b) If the process variance is set artificially low, the filter becomes over-confident in its model and fails to react to real changes in the system. Set it too high and the filter becomes over-reactive, essentially mirroring the noisy measurements. (c) No. If Ct=0C_t = 0, the measurement carries no information about the state (ztz_t is independent of xtx_t), so the correction step is a no-op. The filter degenerates into pure prediction and its covariance grows unboundedly.

Question 5 — Bayes Filter for Occupancy Mapping (15 points)

A robotic vacuum’s cell-level position s1:ts_{1:t} is known from ceiling cameras. The robot measures distances z1:tz_{1:t} via an infinite-range lidar. You want to estimate the binary occupancy map m=[m1,,mK]\mathbf m = [m_1,\dots,m_K] for a 100×100100\times 100 grid. (a) Write down the posterior you need to estimate and apply Bayes’ rule to it. (b) Explain why the full joint is intractable for K=10,000K=10{,}000 cells, and give the simplified factored form under the usual cell-independence assumption. (c) Sketch the Bayes-filter update in pseudocode.

Solution

(a) The target posterior is p(mz1:t,s1:t)=p(z1:tm,s1:t)p(ms1:t)p(z1:ts1:t).p(\mathbf m \mid z_{1:t}, s_{1:t}) = \frac{p(z_{1:t}\mid \mathbf m, s_{1:t})\, p(\mathbf m \mid s_{1:t})}{p(z_{1:t}\mid s_{1:t})}. (b) With K=104K=10^4 binary cells, the joint has 210,0002^{10{,}000} possible configurations — computationally infeasible. Assuming cells are independent given the measurements, p(mz1:t,s1:t)=i=1Kp(miz1:t,s1:t),p(\mathbf m \mid z_{1:t}, s_{1:t}) = \prod_{i=1}^{K} p(m_i \mid z_{1:t}, s_{1:t}), which reduces the problem to KK independent binary Bayes filters (one per cell). (c) Each cell is updated independently in log-odds form to avoid numerical underflow and turn the Bayes update into a simple addition. Define t,i=logp(mi=1z1:t,s1:t)p(mi=0z1:t,s1:t)\ell_{t,i} = \log\frac{p(m_i=1\mid z_{1:t},s_{1:t})}{p(m_i=0\mid z_{1:t},s_{1:t})} and 0\ell_0 as the prior log-odds. The recursive update is t,i=t1,i+logp(mi=1zt,st)p(mi=0zt,st)0,\ell_{t,i} = \ell_{t-1,i} + \log\frac{p(m_i=1\mid z_t, s_t)}{p(m_i=0\mid z_t, s_t)} - \ell_0, where the middle term is the inverse-sensor-model log-odds for cell ii. Pseudocode:
def occupancy_grid_update(log_odds, s_t, z_t, l0):
    for i in cells_in_sensor_cone(s_t, z_t):
        log_odds[i] += inverse_sensor_log_odds(i, s_t, z_t) - l0
    return log_odds
The probability of occupancy is recovered as p(mi=1)=1/(1+et,i)p(m_i=1) = 1 / (1 + e^{-\ell_{t,i}}).

Question 6 — Multi-Head Self-Attention and the Role of WOW^O (15 points)

Consider a transformer layer with three attention heads, each producing Z(h)RT×dhZ^{(h)} \in \mathbb{R}^{T\times d_h}. (a) Why are multiple heads used instead of a single larger head? (b) Write the concatenation and output-projection equations and explain the role of the projection matrix WOW^O. (c) What happens if WOW^O is fixed to the identity matrix? (d) Why is the position-wise feed-forward network (FFN) necessary after self-attention?

Solution

(a) Different heads can specialise in complementary relational structure — e.g. one head on syntactic (subject–verb) dependencies, another on semantics, a third on long-range coreference. These roles emerge during training and give the layer richer relational capacity than a single head of equal total width. (b) Concatenation and projection: Z=Concat(Z(1),Z(2),Z(3))RT×d,Y=ZWO,WORd×d.Z = \mathrm{Concat}(Z^{(1)}, Z^{(2)}, Z^{(3)}) \in \mathbb{R}^{T\times d}, \qquad Y = Z W^O, \quad W^O \in \mathbb{R}^{d\times d}. WOW^O linearly mixes features across heads and projects back into the residual-stream dimension. Without it, heads cannot interact. (c) If WO=IW^O = I (which requires dmodel=hdhd_{\text{model}} = \sum_h d_h, as is the case here), the heads’ outputs are simply concatenated with no learned cross-head mixing inside the attention sub-layer — each head’s contribution stays in its own slice of the residual stream. Cross-head interaction can still happen later via the FFN and subsequent layers, so the loss is partial rather than total. (d) Self-attention is a (weighted) linear combination of value vectors: without additional nonlinearity, the layer is effectively linear in token content. The FFN applies a non-linear transformation (e.g. GELU) token-wise, giving the model capacity to form higher-level abstractions. Intuitively: attention decides where to look; the FFN decides what to compute with what it found.

Question 7 — Vision Transformer: Patching, Attention and Classifier Head (15 points)

You wish to classify a 224×224×3224\times 224\times 3 image using a ViT encoder with P=28P = 28 patch size and a single attention head. (a) How many patches are produced and what is their dimensionality? How are they embedded and position-encoded? (b) Write down the query / key / value computation and the attention equation (assume H=1H=1). (c) Describe the classifier head on top of the encoder and dimension the final tensor for a KK-class problem.

Solution

(a) The image is split into N=HW/P2=224224/282=64N = HW/P^2 = 224\cdot 224 / 28^2 = 64 flattened patches of dimension P2C=2823=2352P^2 \cdot C = 28^2 \cdot 3 = 2352. Each patch is mapped to a DD-dimensional embedding by a trainable linear projection. A learnable positional embedding encoding the 2-D location (h,w)(h,w) of each patch is added to the patch embeddings. (b) For zRN×Dz \in \mathbb{R}^{N\times D}, with joint projection UqkvRD×3DU_{qkv} \in \mathbb{R}^{D\times 3D}: [q,k,v]=zUqkv,[q, k, v] = z U_{qkv}, A=softmax ⁣(qkD)RN×N,A = \mathrm{softmax}\!\left(\frac{q k^\top}{\sqrt{D}}\right) \in \mathbb{R}^{N\times N}, SA(z)=Av.\mathrm{SA}(z) = A v. (c) A [CLS] token (or global-average-pool of patch tokens) feeds an MLP classification head ending in a softmax over KK classes; the final posterior is a vector in RK\mathbb{R}^{K} (or shape B×KB\times K for a batch of size BB). The full forward pass stacks: patch-embed → +positional → LayerNorm → self-attention → residual → LayerNorm → FFN → residual → pool → Linear(DK)(D\rightarrow K) → softmax.

Question 8 — Skip-gram Word2Vec Objective (10 points)

Define the skip-gram loss for a center word wtw_t with a symmetric context window of size CC, and explain why the same output projection matrix is reused for every context position.

Solution

Let the context positions be C(t)={tC,,t1,t+1,,t+C}\mathcal C(t) = \{t-C,\dots,t-1,t+1,\dots,t+C\}. The skip-gram assumption is that context words are conditionally independent given the center word. The per-position loss is the sum of cross-entropies over all 2C2C neighbours: Lt=jC(t)logpmodel(wt+jwt,θ).\mathcal L_t = -\sum_{j\in\mathcal C(t)} \log p_{\text{model}}(w_{t+j} \mid w_t, \theta). The output (projection) matrix is shared across all context positions because the skip-gram model assumes that the distribution of context words depends only on the center word and not on its position relative to the center. Using different output matrices per position would encode positional dependence, contradicting the conditional-independence assumption that justifies the factorisation above.

Question 9 — CLIP Contrastive Image–Text Learning (10 points)

(a) Sketch the CLIP architecture at the block level (image encoder, text encoder, projection layers, similarity computation). (b) Explain the contrastive loss used to train it and what each component encourages.

Solution

(a) The image encoder is a CNN (e.g., ResNet) or a ViT that maps an image into a feature vector, followed by a linear projection into a shared embedding space. A parallel text encoder (usually a transformer) maps the caption into the same shared space via its own linear projection. Similarity between image and text embeddings is computed via a (scaled) dot product, producing an N×NN\times N similarity matrix for a batch of NN image–text pairs. (b) CLIP uses a single symmetric InfoNCE loss over the N×NN\times N similarity matrix Sij=ui,vj/τS_{ij} = \langle u_i, v_j\rangle / \tau, where uiu_i is the (normalised) image embedding, vjv_j the (normalised) text embedding, and τ\tau is a learned temperature. Define Lit=1Ni=1Nlogexp(Sii)j=1Nexp(Sij),Lti=1Nj=1Nlogexp(Sjj)i=1Nexp(Sij),L_{i\to t} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(S_{ii})}{\sum_{j=1}^{N}\exp(S_{ij})}, \qquad L_{t\to i} = -\frac{1}{N}\sum_{j=1}^{N}\log\frac{\exp(S_{jj})}{\sum_{i=1}^{N}\exp(S_{ij})}, and the total loss L=12(Lit+Lti)L = \tfrac{1}{2}(L_{i\to t} + L_{t\to i}). Each row’s softmax cross-entropy implicitly pulls the matched (diagonal) image–text pair together while pushing the N1N-1 in-batch negatives apart through the denominator. The symmetric form ensures the embedding is learned consistently in both directions (image→text retrieval and text→image retrieval).

Question 10 — Siamese Networks and Contrastive Loss (10 points)

You are given a Siamese network with two identical CNN branches that share weights, used to decide whether two input images belong to the same class (e.g., both person) or different classes (e.g., person vs. animal). The contrastive loss for a pair (x1,x2)(x_1, x_2) with label y{0,1}y \in \{0, 1\} (0 = same class, 1 = different class) is L(x1,x2,y)=(1y)12DW2+y12[max(0,mDW)]2\mathcal{L}(x_1, x_2, y) = (1-y)\,\tfrac{1}{2}\,D_W^2 + y\,\tfrac{1}{2}\,\big[\max(0,\, m - D_W)\big]^2 where DW=fW(x1)fW(x2)2D_W = \lVert f_W(x_1) - f_W(x_2) \rVert_2 is the Euclidean distance between the two CNN embeddings and m>0m > 0 is a margin. (a) Sketch (or describe) the network architecture and explain why both branches must share weights. (b) Explain the role of each of the two terms in the loss and what the network learns to do for matched vs. unmatched pairs. (c) Why is the margin mm necessary in the second term?

Solution

(a) The Siamese network has two identical CNN towers that each map an input image into an embedding vector fW(x)f_W(x). The towers share weights so that the same input always produces the same embedding — only then does the Euclidean distance DWD_W between two embeddings become a meaningful, symmetric measure of similarity in a single learned feature space. Two independent CNNs would learn two unrelated spaces and the distance would be meaningless. The goal is not to classify each image but to learn an embedding in which same-class pairs are close and different-class pairs are far apart. (b) For a matched pair (y=0y=0), only the first term is active: it minimises the squared Euclidean distance DW2D_W^2, pulling the two embeddings together. For an unmatched pair (y=1y=1), only the second term is active: it minimises [max(0,mDW)]2[\max(0, m - D_W)]^2, which pushes the embeddings apart until their distance reaches the margin mm. Together the terms shape an embedding space where intra-class distances shrink and inter-class distances grow. (c) Without a margin, the model could trivially reduce the second term by pushing unmatched pairs infinitely far apart, leading to unbounded gradients and degenerate solutions. The margin mm caps the penalty: once DWmD_W \ge m the pair is considered “far enough” and contributes zero loss. This focuses learning on the hard negatives — pairs that are still closer than the margin — and stabilises training.