A Perceptual Similarity Space for Speech: Accounting for Inter-talker Variation in L1 and L2 Speech Intelligibility Beyond Traditional Acoustic-Phonetic Cues
Most theories of speech perception assume that speech is perceived in terms of a relatively small set of researcher-defined spectro-temporal properties that map onto discrete linguistic elements, that is, in terms of “speech cues.” For example, to perceive "lap" versus "lab" an English listener must identify the discrete final segment as /p/ versus /b/ using specific temporal (preceding vowel length) and spectral (first formant frequency offset) cues. Longer utterances are then presumed to be perceived compositionally, based on the combined processing of local acoustic-phonetic cues associated with the sequence of linguistic units that comprises the utterance. Traditional cue-based approaches have allowed for detailed modeling of speech acoustics and have identified some of the acoustic sources of variation in speech intelligibility for both human and machine listeners. However, a large portion of observed inter-talker variation in both first-language (L1) and second-language (L2) speech intelligibility remains unexplained, especially for sentence-length or longer utterances in which a multitude of acoustic dimensions and contextual cues interact. In this talk, I will present an alternative approach to modeling variation in speech intelligibility that emphasizes holistic comparison of utterances in a multidimensional perceptual similarity space, estimated by a neural network trained via self-supervised learning. This approach encodes differences between utterances based on prior experience with a large set of L1 English speech, without assuming pre-specified temporal windows or discrete linguistic units. We test this novel approach by comparing representations of sentences produced by L1 (n=25) and L2 (n=102) English talkers in this high-dimensional perceptual similarity space. We show that L2 English speech is less tightly clustered than L1 English speech in the perceptual similarity space, reflecting the observed variability in English proficiency (and therefore greater phonetic variability) among L2 talkers. Critically, we find that variation in intelligibility across the group of L2 talkers is better explained by average distance in the perceptual similarity space from L1 talkers than by traditional phonetic measures of cue distinctiveness and articulatory precision (e.g., vowel space size, pitch variability, speech rate, etc.). These results suggest that the application of machine-learning techniques for the representation and analysis of speech holds substantial promise for breakthroughs in our understanding of the multitude of acoustic-phonetic dimensions that underlie speech variation and its communicative consequences.