In this talk I will address the problem of spatio-temporal reconstruction of people that are both seen and heard. The main challenge of audio-visual reconstruction is to extract person-to-speech associations from cameras and from microphones. This is particularly difficult when a number of participants are engaged in multi-party informal dialog, e.g. they speak simultaneously, they move around and they look at each other rather than facing the cameras and the microphones. This means that standard frontal-face-to-clean-speech correlation methods are likely to fail and a novel audio-visual fusion paradigm is necessary. In this talk I will present a number of audio-visual methods that have been developed in the Perception team at INRIA Grenoble for the past 4-5 years, methods based on latent-variable graphical models and Bayesian inference.