Program and Papers
- Date: June 11, 2025
- Location: Room 211 (Level 2), Music City Center, Nashville
All times are in local time zone, Central Daylight Time (CDT).
Tentative Workshop Schedule
Time Slot | Event |
---|---|
13:30-13:45 | Welcome and Introduction |
13:45-14:15 | Keynote: Mike Shou |
14:15-14:45 | Keynote: Kate Saenko |
14:45-15:30 | Demos, Posters and Coffee |
15:30-16:00 | Keynote: Ani Kembhavi |
16:00-16:30 | Keynote: Marc Pollefeys |
16:30-17:00 | Challenge Results |
17:00-17:50 | Panel Discussion |
17:50-18:00 | Closing Remarks |
Accepted Demos
Vision-Language Guided Object Localization in Mixed Reality
Han Xi, Ard Kastrati, Dushan Vasilevski, Roger Wattenhofer
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou
Accepted Papers: Extended Abstracts
HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj
pdf | code | video
Learning to Perceive and Act: Active Event Understanding via Predictive Free Energy Minimization
Zhou Chen, Sanjoy Kundun, Harsimran Baweja, Sathyanarayanan Aakur
pdf
InteractFormer: Modeling Agent Interactions for Multi-Agent Action Anticipation
Yiqi Jin, Simon Stepputtis, Katia Sycara, Yaqi Xie
pdf
Vision-Language Guided Object Localization in Mixed Reality
Han Xi, Ard Kastrati, Dushan Vasilevski, Roger Wattenhofer
pdf | video
Vid2Coach: Transforming How-To Videos into Task Assistants
Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, Amy Pavel
pdf | video
Plan-Action-Reflection: A Three-Role Agentic Framework For Computer Use Agent Task
Xin Su, Man Luo, David Cobbley, Shachar Rosenman, Vasudev Lal, Phillip Howard
pdf
Accepted Papers: CVPR Full Papers
Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
Michele Mazzamuto
pdf
DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos
Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero
pdf | code
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
TianTian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng
pdf
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao
pdf
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
James Burgess, Jeffrey Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Gupte, Jesus Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina Hasan, Alexandra Johannesson, William Leineweber, Malvika Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan Hansen, Manuel Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
pdf
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci, Nicola Strisciuglio
pdf
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou
pdf | code | video
Vision-Language Models Do Not Understand Negation
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi
pdf | code | video
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang
pdf | code | video
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman
pdf