The Human Brain as a Dynamic Mixture of Expert Models in Video Understanding
Christina Sartzetaki · Anne Zonneveld · Pablo Oyarzo · Alessandro Gifford · Radoslaw Cichy · Pascal Mettes · Iris Groen
Abstract
The human brain is the most efficient and versatile system for processing dynamic visual input. By comparing representations from deep video models to brain activity, we can gain insights into mechanistic solutions for effective video processing, important to better understand the brain and to build better models. Current works in model-brain alignment primarily focus on fMRI measurements, leaving open questions about fine-grained dynamic processing. Here, we introduce the first large-scale benchmarking of both static and temporally-integrating deep neural networks on brain alignment to dynamic electroencephalography (EEG) recordings of short natural videos. We analyze 100+ models across the axes of temporal integration, classification task, architecture and pretraining using our proposed Cross-Temporal Representational Similarity Analysis (CT-RSA), which matches the best time-unfolded model features to dynamically evolving brain responses, distilling $10^7$ alignment scores. Our findings reveal novel insights on how continuous visual input is integrated in the brain, beyond the standard temporal processing hierarchy from low to high-level representations. After initial alignment to hierarchical static object processing, responses in posterior electrodes best align to mid-level temporally-integrative action features, showing high temporal correspondence to feature timings. In contrast, responses in frontal electrodes best align with high-level static action representations and show no temporal correspondence to the video. Additionally, temporally-integrating state space models show superior alignment to intermediate posterior activity, in which self-supervised pretraining is also beneficial. We draw a metaphor to a dynamic mixture of expert models for the changing neural preference in tasks and temporal integration reflected in the alignment to different model types across time. We posit that a single best-aligned model would need task-independent training to combine these capacities as well as an architecture that supports dynamic switching.
Successful Page Load