Books
- TIF - Foundations of Computer Vision by Antonio Torralba, Phillip Isola and William T. Freeman. Free online. Covers the latest deep learning applications including diffusion models.
- BISHOP - Deep Learning - Foundations and Concepts by C. Bishop and H. Bishop. Available to view online from the book’s website.
- SZELINSKI - Computer Vision: Algorithms and Applications, 2nd Edition. Free to download for personal use. Alternative to TIF for some topics.
Planned Schedule
Part I: Detection and Segmentation
| Lecture | Topic | Description |
|---|---|---|
| 1 | Introduction | Computer vision for agents with egomotion. Prerequisites review: Python, linear algebra, probability theory, camera fundamentals. |
| 2 | Statistical Learning | End-to-end prediction, featurization, fully connected neural architectures, maximum likelihood optimization. Reading: BISHOP Chapters 4-5 |
| 3 | Dense Neural Networks | Cross entropy loss, training and regularization of dense layers. Reading: BISHOP Chapter 6 |
| 4 | CNNs | Spatial feature hierarchies, image classification, ResNets for real-time perception. Reading: BISHOP Chapter 10 |
| 5 | Object Detection | YOLO and Faster R-CNN architectures for identifying and locating objects. Reading: SZELINSKI Chapter 6 |
| 6 | Semantic Segmentation | Pixel-level labeling, panoptic segmentation for full scene understanding. Reading: SZELINSKI Chapter 6 |
| 7 | Vision Transformers | Self-attention for global image dependencies, ViT vs CNN trade-offs. Reading: BISHOP Chapter 12, TIF Chapter 26 |
| 8 | Object Tracking | Video stream processing, handling occlusion, motion blur, appearance changes. |
Part II: Vision Language Models (VLMs)
| Lecture | Topic | Description |
|---|---|---|
| 9 | Contrastive Learning | Vision-language pretraining, CLIP for relating images and text. Reading: CLIP paper, TIF Chapter 51 |
| 10 | From Retrieval to Generation | BLIP-2, LLaVA for image captioning and Visual Question Answering. |
| 11 | Prompted Vision Models | Meta’s SAM as a worker receiving multimodal prompts from VLM planners. |
Part III: Generative Vision Models
| Lecture | Topic | Description |
|---|---|---|
| 12 | Neural Radiance Fields | NeRF for creating 3D scenes from 2D images, volume rendering concepts. |
| 13 | Diffusion Models | Physics-inspired learning, conditional image generation, DALL-E and Stable Diffusion. |

