AI Final Study Guide

Question 1, Gridworld Policies and the Bellman Equation (15 points)

Consider a 3×4 gridworld with 2 terminal states and 1 wall (impassable). The agent has 4 deterministic actions (N, S, E, W). (a) How many deterministic policies exist in this gridworld? Give both the general formula and the numerical answer. (b) In a separate 5×5 gridworld under a uniform random policy with discount

\gamma = 0.9

, there are two special states A and B. State A teleports the agent to A’ with reward +10; state B teleports to B’ with reward +5. Using the Bellman policy-evaluation equation

v(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)\bigl[r(s,a,s') + \gamma v(s')\bigr],

explain qualitatively why

v(A) < 10

while

v(B) > 5

Solution

(a) The number of deterministic policies is

|\Pi| = |A|^{|S|}

where

|S|

is the number of non-terminal states. The grid has

3\times 4 = 12

cells minus 2 terminal and 1 wall, giving

|S|=9

. Hence

|\Pi| = 4^9 = 262{,}144.

(A common error is to write

|S|^{|A|}

, which is incorrect.) (b) Because the teleports are deterministic, the Bellman equation gives

v(A) = 10 + \gamma\, v(A')

and

v(B) = 5 + \gamma\, v(B')

. In Sutton & Barto’s standard 5×5 example one finds

v(A')\approx 1.5

and

v(B')\approx 0.4

under the uniform random policy. Then

v(A)\approx 10 + 0.9\cdot 1.5 \approx 11.4 \cdot \text{(discounted)} \;<\; 10/(1-\gamma),

but more directly: the discount

\gamma<1

means the future stream of rewards from

A'

cannot fully compensate for the immediate

+10

, so

v(A) = 10 + 0.9\,v(A') \approx 8.8 < 10

. For B, the destination

B'

sits in the interior of the grid where the random policy accumulates additional positive return, so

v(B') > 0

and

v(B) = 5 + 0.9\,v(B') > 5

. The key insight: the Bellman equation accounts for the entire discounted stream from the destination state, not merely the immediate teleport reward, and a discount factor strictly below 1 strictly shrinks future contributions.

Question 2, Return Computation and Discounting (15 points)

An MDP has a single non-terminal state

S_0

with two actions,

L

and

R

. Action

L

yields rewards

1, 0, 1, 0, \dots

, while action

R

yields

0, 2, 0, 2, \dots

. Compute

V_L(S_0)

and

V_R(S_0)

in closed form as functions of the discount

\gamma

, and determine the values of

\gamma

for which

L

is preferred over

R

Solution

For action

L

V_L(S_0) = 1 + \gamma^2 + \gamma^4 + \cdots = \sum_{k=0}^{\infty} \gamma^{2k} = \frac{1}{1-\gamma^2}.

Equivalently, the Bellman recursion

V_L(S_0) = 1 + \gamma^2 V_L(S_0)

gives the same answer. For action

R

V_R(S_0) = 2\gamma + 2\gamma^3 + \cdots = 2\gamma \sum_{k=0}^{\infty} \gamma^{2k} = \frac{2\gamma}{1-\gamma^2}.

Comparing:

If $\gamma < 0.5$ : $V_L > V_R$ , prefer $L$ .
If $\gamma = 0.5$ : they are equal.
If $\gamma > 0.5$ : $V_R > V_L$ , prefer $R$ .

A far-sighted agent (large

\gamma

) prefers the larger but delayed reward; a myopic agent prefers the immediate reward.

Question 3, Hidden Markov Model Transition / Observation Models (15 points)

To model humidity as a proxy for temperature, let the hidden state

x_t \in \{H,C\}

(hot, cold) with transition table

\begin{array}{c|cc} & H_{t+1} & C_{t+1} \\ \hline H_t & 0.7 & 0.3 \\ C_t & 0.4 & 0.6 \end{array}

and observable humidity

z_t \in \{S, M, L\}

with emission table

\begin{array}{c|ccc} & S & M & L \\ \hline H & 0.1 & 0.4 & 0.5 \\ C & 0.7 & 0.2 & 0.1 \end{array}

(a) Write out the transition and measurement models as conditional probabilities. (b) Explain qualitatively how you would compute the posterior

p(x_t | z_{1:t})

for the observation sequence “SMSL” using the forward algorithm.

Solution

(a) Transition model:

p(x_{t+1}=H \mid x_t=H) = 0.7, \quad p(x_{t+1}=C \mid x_t=H) = 0.3

p(x_{t+1}=H \mid x_t=C) = 0.4, \quad p(x_{t+1}=C \mid x_t=C) = 0.6

Measurement model:

p(z=S|H)=0.1,\ p(z=M|H)=0.4,\ p(z=L|H)=0.5

p(z=S|C)=0.7,\ p(z=M|C)=0.2,\ p(z=L|C)=0.1

(b) The forward algorithm initialises

\alpha_1(x_1) = p(x_1)\,p(z_1\mid x_1)

with a prior over the initial state, then recursively updates

\alpha_t(x_t) = p(z_t\mid x_t) \sum_{x_{t-1}} p(x_t\mid x_{t-1})\, \alpha_{t-1}(x_{t-1}).

The filtered posterior is

p(x_t\mid z_{1:t}) = \alpha_t(x_t) / \sum_{x_t} \alpha_t(x_t)

. Apply this once per observation in “SMSL” to obtain the posterior at each step.

Question 4, Kalman Filter Behaviour and Noise Trade-offs (10 points)

An agent moves in a 1-D world with constant velocity. A scalar Kalman filter is implemented with equal process and measurement noise standard deviations

\sigma_v = \sigma_w = 5

. (a) Explain why, at the start of filtering, the estimate tracks the noisy measurements closely, whereas after several steps the estimate smooths over them. (b) What happens if the process-noise variance is set too low? Too high? (c) If the measurement matrix

C_t = 0

, is a meaningful state estimate still possible? Why or why not?

Solution

(a) The Kalman gain

K_t = \frac{P_{t|t-1}}{P_{t|t-1} + \sigma_w^2}

is close to 1 when the predicted covariance

P_{t|t-1}

is large (initial uncertainty), so the update weights the measurement heavily. As iterations proceed,

P_{t|t}

decreases and the gain converges to its (nonzero) steady-state value, so the filter blends prediction and measurement in a fixed ratio rather than tracking each measurement directly, visually this looks like smoothing. (b) If the process variance is set artificially low, the filter becomes over-confident in its model and fails to react to real changes in the system. Set it too high and the filter becomes over-reactive, essentially mirroring the noisy measurements. (c) No. If

C_t = 0

, the measurement carries no information about the state (

z_t

is independent of

x_t

), so the correction step is a no-op. The filter degenerates into pure prediction and its covariance grows unboundedly.

Question 5, Bayes Filter for Occupancy Mapping (15 points)

A robotic vacuum’s cell-level position

s_{1:t}

is known from ceiling cameras. The robot measures distances

z_{1:t}

via an infinite-range lidar. You want to estimate the binary occupancy map

\mathbf m = [m_1,\dots,m_K]

for a

100\times 100

grid. (a) Write down the posterior you need to estimate and apply Bayes’ rule to it. (b) Explain why the full joint is intractable for

K=10{,}000

cells, and give the simplified factored form under the usual cell-independence assumption. (c) Sketch the Bayes-filter update in pseudocode.

Solution

(a) The target posterior is

p(\mathbf m \mid z_{1:t}, s_{1:t}) = \frac{p(z_{1:t}\mid \mathbf m, s_{1:t})\, p(\mathbf m \mid s_{1:t})}{p(z_{1:t}\mid s_{1:t})}.

(b) With

K=10^4

binary cells, the joint has

2^{10{,}000}

possible configurations, computationally infeasible. Assuming cells are independent given the measurements,

p(\mathbf m \mid z_{1:t}, s_{1:t}) = \prod_{i=1}^{K} p(m_i \mid z_{1:t}, s_{1:t}),

which reduces the problem to

K

independent binary Bayes filters (one per cell). (c) Each cell is updated independently in log-odds form to avoid numerical underflow and turn the Bayes update into a simple addition. Define

\ell_{t,i} = \log\frac{p(m_i=1\mid z_{1:t},s_{1:t})}{p(m_i=0\mid z_{1:t},s_{1:t})}

and

\ell_0

as the prior log-odds. The recursive update is

\ell_{t,i} = \ell_{t-1,i} + \log\frac{p(m_i=1\mid z_t, s_t)}{p(m_i=0\mid z_t, s_t)} - \ell_0,

where the middle term is the inverse-sensor-model log-odds for cell

i

. Pseudocode:

def occupancy_grid_update(log_odds, s_t, z_t, l0):
    for i in cells_in_sensor_cone(s_t, z_t):
        log_odds[i] += inverse_sensor_log_odds(i, s_t, z_t) - l0
    return log_odds

The probability of occupancy is recovered as

p(m_i=1) = 1 / (1 + e^{-\ell_{t,i}})

Question 6, Multi-Head Self-Attention and the Role of $W^O$ (15 points)

Consider a transformer layer with three attention heads, each producing

Z^{(h)} \in \mathbb{R}^{T\times d_h}

. (a) Why are multiple heads used instead of a single larger head? (b) Write the concatenation and output-projection equations and explain the role of the projection matrix

W^O

. (c) What happens if

W^O

is fixed to the identity matrix? (d) Why is the position-wise feed-forward network (FFN) necessary after self-attention?

Solution

(a) Different heads can specialise in complementary relational structure, e.g. one head on syntactic (subject–verb) dependencies, another on semantics, a third on long-range coreference. These roles emerge during training and give the layer richer relational capacity than a single head of equal total width. (b) Concatenation and projection:

Z = \mathrm{Concat}(Z^{(1)}, Z^{(2)}, Z^{(3)}) \in \mathbb{R}^{T\times d}, \qquad Y = Z W^O, \quad W^O \in \mathbb{R}^{d\times d}.

W^O

linearly mixes features across heads and projects back into the residual-stream dimension. Without it, heads cannot interact. (c) If

W^O = I

(which requires

d_{\text{model}} = \sum_h d_h

, as is the case here), the heads’ outputs are simply concatenated with no learned cross-head mixing inside the attention sub-layer, each head’s contribution stays in its own slice of the residual stream. Cross-head interaction can still happen later via the FFN and subsequent layers, so the loss is partial rather than total. (d) Self-attention is a (weighted) linear combination of value vectors: without additional nonlinearity, the layer is effectively linear in token content. The FFN applies a non-linear transformation (e.g. GELU) token-wise, giving the model capacity to form higher-level abstractions. Intuitively: attention decides where to look; the FFN decides what to compute with what it found.

Question 7, Vision Transformer: Patching, Attention and Classifier Head (15 points)

You wish to classify a

224\times 224\times 3

image using a ViT encoder with

P = 28

patch size and a single attention head. (a) How many patches are produced and what is their dimensionality? How are they embedded and position-encoded? (b) Write down the query / key / value computation and the attention equation (assume

H=1

). (c) Describe the classifier head on top of the encoder and dimension the final tensor for a

K

-class problem.

Solution

(a) The image is split into

N = HW/P^2 = 224\cdot 224 / 28^2 = 64

flattened patches of dimension

P^2 \cdot C = 28^2 \cdot 3 = 2352

. Each patch is mapped to a

D

-dimensional embedding by a trainable linear projection. A learnable positional embedding encoding the 2-D location

(h,w)

of each patch is added to the patch embeddings. (b) For

z \in \mathbb{R}^{N\times D}

, with joint projection

U_{qkv} \in \mathbb{R}^{D\times 3D}

[q, k, v] = z U_{qkv},

A = \mathrm{softmax}\!\left(\frac{q k^\top}{\sqrt{D}}\right) \in \mathbb{R}^{N\times N},

\mathrm{SA}(z) = A v.

K

classes; the final posterior is a vector in

\mathbb{R}^{K}

(or shape

B\times K

for a batch of size

B

). The full forward pass stacks: patch-embed → +positional → LayerNorm → self-attention → residual → LayerNorm → FFN → residual → pool → Linear

(D\rightarrow K)

→ softmax.

Question 8, Skip-gram Word2Vec Objective (10 points)

Define the skip-gram loss for a center word

w_t

with a symmetric context window of size

C

, and explain why the same output projection matrix is reused for every context position.

Solution

Let the context positions be

\mathcal C(t) = \{t-C,\dots,t-1,t+1,\dots,t+C\}

. The skip-gram assumption is that context words are conditionally independent given the center word. The per-position loss is the sum of cross-entropies over all

2C

neighbours:

\mathcal L_t = -\sum_{j\in\mathcal C(t)} \log p_{\text{model}}(w_{t+j} \mid w_t, \theta).

The output (projection) matrix is shared across all context positions because the skip-gram model assumes that the distribution of context words depends only on the center word and not on its position relative to the center. Using different output matrices per position would encode positional dependence, contradicting the conditional-independence assumption that justifies the factorisation above.

Question 9, CLIP Contrastive Image–Text Learning (10 points)

(a) Sketch the CLIP architecture at the block level (image encoder, text encoder, projection layers, similarity computation). (b) Explain the contrastive loss used to train it and what each component encourages.

Solution

(a) The image encoder is a CNN (e.g., ResNet) or a ViT that maps an image into a feature vector, followed by a linear projection into a shared embedding space. A parallel text encoder (usually a transformer) maps the caption into the same shared space via its own linear projection. Similarity between image and text embeddings is computed via a (scaled) dot product, producing an

N\times N

similarity matrix for a batch of

N

image–text pairs. (b) CLIP uses a single symmetric InfoNCE loss over the

N\times N

similarity matrix

S_{ij} = \langle u_i, v_j\rangle / \tau

, where

u_i

is the (normalised) image embedding,

v_j

the (normalised) text embedding, and

\tau

is a learned temperature. Define

L_{i\to t} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(S_{ii})}{\sum_{j=1}^{N}\exp(S_{ij})}, \qquad L_{t\to i} = -\frac{1}{N}\sum_{j=1}^{N}\log\frac{\exp(S_{jj})}{\sum_{i=1}^{N}\exp(S_{ij})},

and the total loss

L = \tfrac{1}{2}(L_{i\to t} + L_{t\to i})

. Each row’s softmax cross-entropy implicitly pulls the matched (diagonal) image–text pair together while pushing the

N-1

in-batch negatives apart through the denominator. The symmetric form ensures the embedding is learned consistently in both directions (image→text retrieval and text→image retrieval).

Question 10, Siamese Networks and Contrastive Loss (10 points)

You are given a Siamese network with two identical CNN branches that share weights, used to decide whether two input images belong to the same class (e.g., both person) or different classes (e.g., person vs. animal). The contrastive loss for a pair

(x_1, x_2)

with label

y \in \{0, 1\}

(0 = same class, 1 = different class) is

\mathcal{L}(x_1, x_2, y) = (1-y)\,\tfrac{1}{2}\,D_W^2 + y\,\tfrac{1}{2}\,\big[\max(0,\, m - D_W)\big]^2

where

D_W = \lVert f_W(x_1) - f_W(x_2) \rVert_2

is the Euclidean distance between the two CNN embeddings and

m > 0

is a margin. (a) Sketch (or describe) the network architecture and explain why both branches must share weights. (b) Explain the role of each of the two terms in the loss and what the network learns to do for matched vs. unmatched pairs. (c) Why is the margin

m

necessary in the second term?

Solution

(a) The Siamese network has two identical CNN towers that each map an input image into an embedding vector

f_W(x)

. The towers share weights so that the same input always produces the same embedding, only then does the Euclidean distance

D_W

between two embeddings become a meaningful, symmetric measure of similarity in a single learned feature space. Two independent CNNs would learn two unrelated spaces and the distance would be meaningless. The goal is not to classify each image but to learn an embedding in which same-class pairs are close and different-class pairs are far apart. (b) For a matched pair (

y=0

), only the first term is active: it minimises the squared Euclidean distance

D_W^2

, pulling the two embeddings together. For an unmatched pair (

y=1

), only the second term is active: it minimises

[\max(0, m - D_W)]^2

, which pushes the embeddings apart until their distance reaches the margin

m

. Together the terms shape an embedding space where intra-class distances shrink and inter-class distances grow. (c) Without a margin, the model could trivially reduce the second term by pushing unmatched pairs infinitely far apart, leading to unbounded gradients and degenerate solutions. The margin

m

caps the penalty: once

D_W \ge m

the pair is considered “far enough” and contributes zero loss. This focuses learning on the hard negatives, pairs that are still closer than the margin, and stabilises training.

Edit this page on GitHub or file an issue.

​Question 1, Gridworld Policies and the Bellman Equation (15 points)

​Solution

​Question 2, Return Computation and Discounting (15 points)

​Solution

​Question 3, Hidden Markov Model Transition / Observation Models (15 points)

​Solution

​Question 4, Kalman Filter Behaviour and Noise Trade-offs (10 points)

​Solution

​Question 5, Bayes Filter for Occupancy Mapping (15 points)

​Solution

​Question 6, Multi-Head Self-Attention and the Role of WOW^OWO (15 points)

​Solution

​Question 7, Vision Transformer: Patching, Attention and Classifier Head (15 points)

​Solution

​Question 8, Skip-gram Word2Vec Objective (10 points)

​Solution

​Question 9, CLIP Contrastive Image–Text Learning (10 points)

​Solution

​Question 10, Siamese Networks and Contrastive Loss (10 points)

​Solution

Question 1, Gridworld Policies and the Bellman Equation (15 points)

Solution

Question 2, Return Computation and Discounting (15 points)

Solution

Question 3, Hidden Markov Model Transition / Observation Models (15 points)

Solution

Question 4, Kalman Filter Behaviour and Noise Trade-offs (10 points)

Solution

Question 5, Bayes Filter for Occupancy Mapping (15 points)

Solution

Question 6, Multi-Head Self-Attention and the Role of $W^O$ (15 points)

Solution

Question 7, Vision Transformer: Patching, Attention and Classifier Head (15 points)

Solution

Question 8, Skip-gram Word2Vec Objective (10 points)

Solution

Question 9, CLIP Contrastive Image–Text Learning (10 points)

Solution

Question 10, Siamese Networks and Contrastive Loss (10 points)

Solution