A dedicated repository with a full application (not a notebook) is available at aegean-ai/basket-tube. It exposes a UI that you can use to see how this kind of application can be engineered end-to-end. Please note that the app is still work in progress and it lacks behind some of the choices made in the notebook we refer to below.

You can use the following Colab notebook to complete this assignment: Basketball State Extraction
Task 1 - Analyzing Player Performance from Commentary
Build a chat-based assistant that answers questions about player performance using the spoken commentary from the game video. The system should be able to respond to user queries in a chat interface that mimic Google’s responses to queries such as:Q: “Analyze the player that scored the most in this game”A: While the video does not provide a final leading scorer for the entire game, Stephen Curry is noted as having the most points among the individual player scores mentioned in the first half, reaching 12 points by (37:12).Here’s an analysis of his performance in the first half:Early Struggles and Defensive Coverage: The Lakers made it difficult for Curry early in the game (16:22). He took only three shots initially (15:17) as the Lakers forced him to get rid of the ball, often with two defenders when he came off a pick and roll (15:10-15:16).First Field Goal: Curry scored his first field goal later in the first half (11:06).Offensive Impact: By the end of the first half, he came “alive from three-point range” (42:22-42:25), hitting back-to-back three-pointers (37:09-37:12, 42:01-42:08), helping the Warriors extend their lead to 13 points by halftime (42:10).Overall Role: He was described as the oldest scoring champion in the NBA in 23 years, showcasing his elite scoring ability (19:09-19:12). LeBron James even called him the MVP, stating he had the best season all year (19:23-19:29).Underrated Aspects: Commentators highlighted his rebounding and finishing around the rim as underrated aspects of his game (18:54-19:06).
- a local LLM running on your machine, or
- a hosted LLM/API
- extract or load the video commentary as timestamped text
- organize the commentary so relevant parts can be retrieved for a user query
- answer natural-language questions such as “Who scored the most?”, “How did Curry perform in the first half?”, or “When did LeBron start impacting the game?”
- support each answer with timestamps from the commentary
- clearly separate directly stated facts from higher-level summaries or interpretations
- a direct response to the question
- a short summary of the player’s performance
- 2 to 5 commentary-based pieces of evidence with timestamps
- a brief note that the answer is based only on commentary and may not fully reflect what actually happened in the game
- Obtain the commentary transcript from the video.
- Split it into timestamped chunks.
- Retrieve the chunks most relevant to the user’s question.
- Pass those chunks to the LLM.
- Prompt the model to answer only from the retrieved commentary and cite timestamps.
Task 2 - Verifying Player Actions from Video
In Task 1, you analyzed player performance using the spoken commentary. In this task, you must verify player actions by analyzing the video footage itself. The goal is to determine what a player actually does on the court and support your conclusions with visual evidence. Your system should examine the game video, identify short plays or action segments, and recognize the actions performed by a player of interest. Rather than producing a generic summary, the system should return structured evidence about what happened, when it happened, and why the prediction is reasonable based on the visible content of the video. You may use any suitable approach, including vision-language models, video-language models, object detection, tracking, pose estimation, or a combination of methods. You do not need to build a perfect end-to-end sports analytics system. A strong solution is one that focuses on a clear method, produces interpretable outputs, and explains its limitations. Your output should be grounded in the visual content of the video. For each detected action, you should aim to provide:- the player name
- the action category
- the start timestamp
- optionally the end timestamp
- a short explanation of the evidence visible in the clip
- one or more representative frames or visual summaries
- if possible, a bird’s-eye or court-level representation of the play
- shooting
- passing
- dribbling
- rebounding
- defending
- assisting
- screening
- cutting
- stealing
- blocking
- Split the game video into smaller clips or candidate plays.
- Select a player of interest.
- Track that player, and if possible the ball, across the clip.
- Extract visual evidence from frames, motion, or trajectories.
- Predict the action performed.
- Return the result with timestamps and supporting evidence.
Task 2.1 - Chunking the Video into a series of plays
Before recognizing player actions, you should divide the game video into smaller units that are easier to analyze. A full basketball game is too long and visually complex to process effectively in one pass, so your first step is to segment the video into short plays, possessions, or candidate action windows. In basketball, a play is a coordinated sequence of actions intended to create a scoring opportunity or prevent one. Plays may be offensive or defensive, and they often involve multiple players moving at the same time. For this assignment, you do not need to produce a perfect tactical classification of every play. Instead, you should aim to identify meaningful video segments that can later be analyzed for player actions. Examples of useful play boundaries include:- a possession beginning after a rebound or inbound
- a clear offensive setup followed by a shot attempt
- a transition play
- a defensive sequence ending in a steal, rebound, or whistle
- a short interval around a highlighted player action
- fixed-length windows
- scene or shot change detection
- scoreboard or clock changes
- commentary cues
- model-based identification of action peaks or possession changes
Task 2.2 - Analyzing Specific Player Actions
Once the video has been divided into short segments, focus on a player of interest and identify the actions that player performs. Your system should attempt to recognize specific basketball actions and provide evidence for each prediction. Examples of player actions include:- Shooting: attempting to score by releasing the ball toward the basket
- Passing: moving the ball to a teammate
- Dribbling: controlling the ball while moving
- Rebounding: gaining possession after a missed shot
- Defending: pressuring or preventing an opponent’s movement or shot
- Assisting: making the pass that directly leads to a score
- Screening: positioning to block or delay a defender
- Cutting: moving quickly into open space to receive the ball
- Stealing: taking possession from an opponent
- Blocking: deflecting or stopping a shot attempt
- the timestamp
- the player
- the action category
- a short explanation of why the action was assigned
- a representative frame, clip, or bird’s-eye view if available

| Timestamp | Player | Action | Birds eye view |
|---|---|---|---|
| 12:34 | Stephen Curry | 3-Point Shot | |
| 15:20 | LeBron James | Assist |
- using a VLM on key frames sampled from each play
- using a video-language model to process short clips directly
- detecting and tracking players and the ball
- using pose estimation to capture motion patterns for shooting or defending
- reasoning over trajectories to infer passes, cuts, and drives
- combining commentary-based retrieval with video verification
- one player at a time
- a limited set of action categories
- short clips rather than the full game
- interpretable evidence rather than perfect accuracy
- small or fast-moving ball trajectories
- player occlusion
- unclear jersey numbers
- camera cuts and zoom changes
- actions that require longer context, such as assists or defensive pressure

