Skip to main content
  • Natural Language Transformers
  • Vision Tranformers
  • VLA model architectures (RT-1, RT-2, SayCan, etc.)
  • Pretraining and grounding techniques
Key references: (Paul et al., 2018; Walsman et al., 2018)

References

  • Paul, R., Barbu, A., Felshin, S., Katz, B., Roy, N. (2018). Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context.
  • Walsman, A., Bisk, Y., Gabriel, S., Misra, D., Artzi, Y., et al. (2018). Early Fusion for Goal Directed Robotic Vision.