Abstract
We investigated the contribution of low-level saliency to human eye movements in complex dynamic scenes. Eye movements were recorded while naive observers viewed a heterogeneous collection of 50 video clips (46,489 frames; 4-6 subjects per clip), yielding 11,916 saccades of amplitude ≥2°. A model of bottom-up visual attention computed instantaneous saliency at the instant each saccade started and at its future endpoint location. Median model-predicted saliency was 45% the maximum saliency, a significant factor 2.03 greater than expected by chance. Motion and temporal change were stronger predictors of human saccades than colour, intensity, or orientation features, with the best predictor being the sum of all features. There was no significant correlation between model-predicted saliency and duration of fixation. A majority of saccades were directed to a minority of locations reliably marked as salient by the model, suggesting that bottom-up saliency may provide a set of candidate saccade target locations, with the final choice of which location of fixate more strongly determined top-down.