Sight and hearing are two of the most important senses for human perception. From cognitive perspective, the visual and auditory information is actually slightly discrepant, but the percept is unified with multisensory integration. What’s more, when there are multiple input senses, human reactions usually perform more exactly or efficiently than single sense. Inspired by this, for computational models, our community has begun to explore marrying computer vision with audition, and targets to address some essential problems of audio-visual learning then further develops them into interesting and worthwhile tasks. In recent years, we were delighted to witness many developments in learning from both visual and auditory data.
This tutorial aims to cover recent advances in audio-visual learning, including audio-visual self-supervised learning, audio-visual sound separation, audio-visual cross-modal generation, and audio-visual video understanding. For each research sub-topic, we will give a concrete introduction of the contained problems/tasks, and the current research progress as well as the open problems. We hope the audience, not only the graduate students but also the researchers new in this area, can benefit from this tutorial and learn the principle problems and cutting-edge approaches of audio-visual learning.