WACV 2021 Tutorial

Overview

Sight and hearing are two of the most important senses for human perception. From cognitive perspective, the visual and auditory information is actually slightly discrepant, but the percept is unified with multisensory integration. What’s more, when there are multiple input senses, human reactions usually perform more exactly or efficiently than single sense. Inspired by this, for computational models, our community has begun to explore marrying computer vision with audition, and targets to address some essential problems of audio-visual learning then further develops them into interesting and worthwhile tasks. In recent years, we were delighted to witness many developments in learning from both visual and auditory data.

This tutorial aims to cover recent advances in audio-visual learning, including audio-visual self-supervised learning, audio-visual sound separation, audio-visual cross-modal generation, and audio-visual video understanding. For each research sub-topic, we will give a concrete introduction of the contained problems/tasks, and the current research progress as well as the open problems. We hope the audience, not only the graduate students but also the researchers new in this area, can benefit from this tutorial and learn the principle problems and cutting-edge approaches of audio-visual learning.

Agenda

08:30 - 08:35	Welcome
08:35 - 09:20	Audio-Visual Self-supervised Learning	Slides
09:20 - 10:05	Audio-Visual Sound Separation	Slides
10:05 - 10:15	Coffee Break
10:15 - 11:00	Audio-Visual Cross-modal Generation	Slides
11:00 - 11:45	Audio-Visual Video Understanding	Slides
11:45 - 11:55	Q&A
11:55 - 12:00	Closing Remarks

January 10
00:30 - 00:35	Welcome
00:35 - 01:20	Audio-Visual Self-supervised Learning	Slides
01:20 - 02:05	Audio-Visual Sound Separation	Slides
02:05 - 02:15	Coffee Break
02:15 - 03:00	Audio-Visual Cross-modal Generation	Slides
03:00 - 03:45	Audio-Visual Video Understanding	Slides
03:45 - 03:55	Q&A
03:55 - 04:00	Closing Remarks

17:30 - 17:35	Welcome
17:35 - 18:20	Audio-Visual Self-supervised Learning	Slides
18:20 - 19:05	Audio-Visual Sound Separation	Slides
19:05 - 19:15	Coffee Break
19:15 - 20:00	Audio-Visual Cross-modal Generation	Slides
20:00 - 20:45	Audio-Visual Video Understanding	Slides
20:45 - 20:55	Q&A
20:55 - 21:00	Closing Remarks

Audio-Visual Scene Understanding