Program and Papers | Vision-based Assistants in the Real World

Location and Schedule

Date: June 3rd, 2026.
Workshop Location: 102/104
Poster Location: Exhibit Hall A, 229 – 236

Time Slot	Event
08:30-08:45	Welcome and Introduction
08:45-09:15	Keynote: Prof. Dr. Michael S. Ryoo
09:15-09:45	Keynote: Prof. Katerina Fragkiadaki
09:45-10:15	Keynote: Prof. Wenhu Chen
10:15-10:30	Paper Talk: Molmo2 by Zixian Ma
10:30-11:00	Keynote: Prof. Ziwei Liu
11:00-11:30	Keynote: Prof. Yao Qin
11:30-12:00	Keynote: Prof. Vicente Ordóñez-Román
12:00-12:20	Challenge Results
12:20-13:00	Posters and Coffee

All times are in local time zone, Mountain Daylight Time (MDT).

Accepted Papers: Challenge Extended Abstracts

Timing-Content Separation for Human-Coach-Like Exercise Feedback Generation
Koki Kawamura, Shuhei Kurita, Taiki Miyanishi, Inoue Nakamasa
pdf

Technical Report of Team MR-CAS
Ruochen Cui, Yuhai Li, Shilong Bao, Boyu Han, Qianqian Xu, Qingming Huang
pdf

Accepted Papers: Main Conference Papers

Streaming Video Instruction Tuning
Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
pdf

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
pdf

Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Yayuan Li, Aadit Jain, Filippos Bellos, Jason Corso
pdf

Vision-Speech Models: Teaching Speech Models to Converse about Images
Amélie Royer, Moritz Böhle, Gabriel de Marmiesse, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez, Patrick Pérez
pdf

From 3D Pose to Prose: Biomechanics-Grounded Vision-Language Coaching
Yuyang Ji, Yixuan Shen, Feng Liu
pdf

Interactive Episodic Memory with User Feedback
Nikesh Subedi, Loris Bazzani, Ziad Al-Halah
pdf

Accepted Papers: Extended Abstracts

A Simple Baseline for Streaming Video Understanding
Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu
pdf

Drive-to-Music: Context-Aware Generative Audio for In-Vehicle Experiences
Cosmin Dragoiu, Nooshin Nabizadeh
pdf

Accepted Papers: Archival

Binary Verification for Zero-Shot Vision
Rongbin Hu, Jeffrey Liu
pdf

StreamMind: Adaptive Temporal Memory for Interactive Question Answering on Live Video Streams
Suresh Kumar Palus, Partha Sarathi Samal, Sai Kiran Padmam, Bhavan Kumar B.R
pdf