

The main takeaway of these studies is that language learning is much more effective in a social, conversational setting than by passively watching videos (Kuhl et al., 2003 Anderson and Pempek, 2005 Robb et al., 2009), but learning does happen in such contexts. There are many studies on young children learning language by watching videos see Vanderplank ( 2010) for a survey. Speakers often mention concepts that are not present in the immediate perceptual context, or talk about events that are remote in space and/or time (for example past experiences or future plans).

Firstly, speech is only loosely coupled with the visual modality in naturalistic settings (Matusevych et al., 2013 Beekhuizen et al., 2013). Commonly used training data consist of images or videos paired with spoken descriptions of the scene depicted: However, the type of input that a language learner receives from the environment is much more challenging. Current approaches work well enough from an applied point of view, but most are not generalizable to real-life situations that humans or adaptive artificial agents experience. Despite the weak and confounded signal in this training data, our model succeeds at learning aspects of the visual semantics of spoken language.Īttempts to model or simulate the acquisition of spoken language via grounding in the visual modality date to the beginning of this century (Roy and Pentland, 2002) but have gained momentum recently with the revival of neural networks (e.g., Synnaeve et al., 2014 Harwath and Glass, 2015 Harwath et al., 2016 Chrupała et al., 2017 Alishahi et al., 2017 Harwath et al., 2018 Merkx et al., 2019 Havard et al., 2019a Rouditchenko et al., 2021 Khorrami and Räsänen, 2021 Peng and Harwath, 2022).

We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on segments containing descriptive narrations. Here we address this shortcoming by using a dataset based on the children’s cartoon Peppa Pig. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to represent speech and visual data in a joint vector space.
