- Natural Language Transformers
- Vision Tranformers
- VLA model architectures (RT-1, RT-2, SayCan, etc.)
- Pretraining and grounding techniques
References
- Paul, R., Barbu, A., Felshin, S., Katz, B., Roy, N. (2018). Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context.
- Walsman, A., Bisk, Y., Gabriel, S., Misra, D., Artzi, Y., et al. (2018). Early Fusion for Goal Directed Robotic Vision.

