A Deep Multiscale Spatiotemporal Network for Assessing Depression from Facial Dynamics
Recently deep learning models have been successfully employed in video-based affective computing applications. One key application is automatic depression recognition from facial expressions. State-of-the-art approaches to recognize depression typically explore spatial and temporal information individually by using convolutional neural networks (CNNs) to analyze appearance information and then by either mapping feature variations or averaging the depression level over video frames. This approach has limitations to represent dynamic information that can help to discriminate between depression levels. In contrast 3D CNN-based models can directly encode the spatio-temporal relationships although these models rely on fixed-range temporal information and single receptive field. This approach limits the ability to capture facial expression variations with diverse ranges and the exploitation of diverse facial areas. In this paper a novel 3D CNN architecture the Multiscale Spatiotemporal Network (MSN) is introduced to effectively represent facial information related to depressive behaviours. The basic structure of the model is composed of parallel convolutional layers with different temporal depths and sizes of receptive field which allows the MSN to explore a wide range of spatio-temporal variations in facial expressions. Experimental results on two benchmark datasets show that our MSN is effective outperforming state-of-the-art methods in automatic depression recognition.