• Keine Ergebnisse gefunden

Contributions of local speech encoding and functional connectivity to audio-visual speech perception

N/A
N/A
Protected

Academic year: 2022

Aktie "Contributions of local speech encoding and functional connectivity to audio-visual speech perception"

Copied!
27
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

*For correspondence:Bruno.

Giordano@glasgow.ac.uk (BLG);

christoph.kayser@glasgow.ac.uk (CK)

Competing interests:The authors declare that no competing interests exist.

Funding:See page 22 Received:31 December 2016 Accepted:07 May 2017 Published:07 June 2017 Reviewing editor: Charles E Schroeder, Columbia University College of Physicians and Surgeons, United States

Copyright Giordano et al. This article is distributed under the terms of theCreative Commons Attribution License,which permits unrestricted use and redistribution provided that the original author and source are credited.

Contributions of local speech encoding and functional connectivity to audio-visual speech perception

Bruno L Giordano1,2*, Robin A A Ince2, Joachim Gross2, Philippe G Schyns2, Stefano Panzeri3, Christoph Kayser2*

1Institut de Neurosciences de la Timone UMR 7289, Aix Marseille Universite´ – Centre National de la Recherche Scientifique, Marseille, France;2Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom;

3Neural Computation Laboratory, Center for Neuroscience and Cognitive Systems, Istituto Italiano di Tecnologia, Rovereto, Italy

Abstract

Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally

entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.

DOI: 10.7554/eLife.24763.001

Introduction

When communicating in challenging acoustic environments we profit tremendously from visual cues arising from the speakers face. Movements of the lips, tongue or the eyes convey significant informa- tion that can boost speech intelligibility and facilitate the attentive tracking of individual speakers (Ross et al., 2007;Sumby and Pollack, 1954). This multisensory benefit is strongest for continuous speech, where visual signals provide temporal markers to segment words or syllables, or provide lin- guistic cues (Grant and Seitz, 1998). Previous work has identified the synchronization of brain rhythms between interlocutors as a potential neural mechanism underlying the visual enhancement of intelligibility (Hasson et al., 2012;Park et al., 2016;Peelle and Sommers, 2015;Pickering and Garrod, 2013; Schroeder et al., 2008). Both acoustic and visual speech signals exhibit pseudo- rhythmic temporal structures at prosodic and syllabic rates (Chandrasekaran et al., 2009;

Schwartz and Savariaux, 2014). These regular features can entrain rhythmic activity in the observ- er’s brain and facilitate perception by aligning neural excitability with acoustic or visual speech fea- tures (Giraud and Poeppel, 2012; Mesgarani and Chang, 2012; Park et al., 2016; Peelle and Davis, 2012; Schroeder and Lakatos, 2009; Schroeder et al., 2008; van Wassenhove, 2013;

Zion Golumbic et al., 2013a). While this model predicts the visual enhancement of speech encoding in challenging multisensory environments, the network organization of multisensory speech encoding remains unclear.

(2)

Previous work has implicated many brain regions in the visual enhancement of speech, including superior temporal (Beauchamp et al., 2004; Nath and Beauchamp, 2011; Riedel et al., 2015;

van Atteveldt et al., 2004), premotor and inferior frontal cortices (Arnal et al., 2009;Evans and Davis, 2015;Hasson et al., 2007b;Lee and Noppeney, 2011;Meister et al., 2007;Skipper et al., 2009; Wright et al., 2003). Furthermore, some studies have shown that the visual facilitation of speech encoding may even commence in early auditory cortices (Besle et al., 2008;

Chandrasekaran et al., 2013;Ghazanfar et al., 2005;Kayser et al., 2010;Lakatos et al., 2009;

Zion Golumbic et al., 2013a). However, it remains to be understood whether visual context shapes the encoding of speech differentially within distinct regions of the auditory pathways, or whether the visual facilitation observed within auditory regions is simply fed forward to upstream areas, perhaps without further modification. Hence, it is still unclear whether the enhancement of speech-to-brain entrainment is a general mechanism that mediates visual benefits at multiple stages along the audi- tory pathways.

Many previous studies on this question were limited by conceptual shortcomings: first, many have focused on generic brain activations rather than directly mapping the task-relevant sensory represen- tations (activation mapping vs. information mapping [Kriegeskorte et al., 2006]), and hence have not quantified multisensory influences on those neural representations shaping behavioral perfor- mance. Those who did focused largely on auditory cortical activity (Zion Golumbic et al., 2013b) or did not perform source analysis of the underlying brain activity (Crosse et al., 2015). Second, while many studies have correlated speech-induced local brain activity with behavioral performance, few studies have quantified directed connectivity along the auditory pathways to ask whether perceptual benefits are better explained by changes in local encoding or by changes in functional connectivity (but see [Alho et al., 2014]). And third, many studies have neglected the continuous predictive structure of speech by focusing on isolated words or syllables (but see [Crosse et al., 2015]). How- ever, this structure may play a central role for mediating the visual benefits (Bernstein et al., 2004;

Giraud and Poeppel, 2012;Schroeder et al., 2008). Importantly, given that the predictive visual context interacts with acoustic signal quality to increase perceptual benefits in adverse environments (Callan et al., 2014; Ross et al., 2007; Schwartz et al., 2004; Sumby and Pollack, 1954), one

eLife digest

When listening to someone in a noisy environment, such as a cocktail party, we can understand the speaker more easily if we can also see his or her face. Movements of the lips and tongue convey additional information that helps the listener’s brain separate out syllables, words and sentences. However, exactly where in the brain this effect occurs and how it works remain unclear.

To find out, Giordano et al. scanned the brains of healthy volunteers as they watched clips of people speaking. The clarity of the speech varied between clips. Furthermore, in some of the clips the lip movements of the speaker corresponded to the speech in question, whereas in others the lip movements were nonsense babble. As expected, the volunteers performed better on a word recognition task when the speech was clear and when the lips movements agreed with the spoken dialogue.

Watching the video clips stimulated rhythmic activity in multiple regions of the volunteers’ brains, including areas that process sound and areas that plan movements. Speech is itself rhythmic, and the volunteers’ brain activity synchronized with the rhythms of the speech they were listening to.

Seeing the speaker’s face increased this degree of synchrony. However, it also made it easier for sound-processing regions within the listeners’ brains to transfer information to one other. Notably, only the latter effect predicted improved performance on the word recognition task. This suggests that seeing a person’s face makes it easier to understand his or her speech by boosting

communication between brain regions, rather than through effects on individual areas.

Further work is required to determine where and how the brain encodes lip movements and speech sounds. The next challenge will be to identify where these two sets of information interact, and how the brain merges them together to generate the impression of specific words.

DOI: 10.7554/eLife.24763.002

(3)

needs to manipulate both factors to fully address this question. Fourth, most studies focused on either the encoding of acoustic speech signals in a multisensory context, or quantified brain activity induced by visual speech, but little is known about the dependencies of neural representations of the acoustic and visual components of realistic speech (but see [Park et al., 2016]). Overcoming these problems, we here capitalize on the statistical and conceptual power offered by naturalistic continuous speech to study the network mechanisms that underlie the visual facilitation of speech perception.

Using source localized MEG activity we systematically investigated how local representations of acoustic and visual speech signals and task-relevant directed functional connectivity along the audi- tory pathways change with visual context and acoustic signal quality. Specifically, we extracted neu- ral signatures of acoustically-driven speech representations by quantifying the mutual information (MI) between the MEG signal and the acoustic speech envelope. Similarly, we extracted neural signa- tures of visually-driven speech representations by quantifying the MI between lip movements and the MEG signal. Furthermore, we quantified directed causal connectivity between nodes in the speech network using time-lagged mutual information between MEG source signals. Using linear modelling we then asked how each of these signatures (acoustic and visual speech encoding; con- nectivity) are affected by contextual information about the speakers face, by the acoustic signal to noise ratio, and by their interaction. In addition, we used measures of information theoretic redun- dancy to test whether the local representations of acoustic speech are directly related to the tempo- ral dynamics of lip movements or rather reflect visual contextual information more indirectly. And finally, we asked how local speech encoding and network connectivity relate to behavioral performance.

Our results describe multiple and functionally distinct representations of acoustic and visual speech in the brain. These are differentially affected by acoustic SNR and visual context, and are not trivially explained by a simple superposition of representations of the acoustic speech and lip move- ment information. However, none of these local speech representations was predictive of the degree of visual enhancement of speech comprehension. Rather, this behavioral benefit was predicted only by changes in directed functional connectivity.

Results

Participants (n = 19) were presented with continuous speech that varied in acoustic quality (signal to noise ratio, SNR) and the informativeness of the speaker’s face. The visual context could be either informative (VI), showing the face producing the acoustic speech, or uninformative (VN), showing the same face producing nonsense babble (Figure 1A,B). We measured brain-wide activity using MEG while participants listened to eight six-minute texts and performed a delayed word recognition task.

Behavioral performance was better during high SNR and an informative visual context (Figure 2): a repeated measures ANOVA revealed a significant effect of SNR (F(3,54) = 36.22, p<0.001, Huynh- Feldt corrected,h2p= 0.67), and of visual context (F(1,18) = 18.95, p<0.001,h2p= 51), as well as a significant interaction (F(3,54) = 4.34, p=0.008,h2p= 0.19). This interaction arose from a significant visual enhancement (VI vs VN) for SNRs of 4 and 8 dB (paired T(18) 3.00, Bonferroni corrected p0.032; p>0.95 for other SNRs).

To study the neural mechanisms underlying this behavioral benefit we analyzed source-projected MEG data using information theoretic tools to quantify the fidelity of local neural representations of the acoustic speech envelope (speech MI), local representations of the visual lip movement (lip MI), as well as the directed causal connectivity between relevant regions (Figure 1C). For both, local encoding and connectivity, we (1) modelled the extent to which they were modulated by the experi- mental conditions, and we (2) asked whether they correlated with behavioral performance across conditions and with the visual benefit across SNRs (Figure 1C).

Widespread speech-to-brain entrainment at multiple time scales

Speech-to-brain entrainment was quantified by the mutual information (speech MI) between the MEG time course and the acoustic speech envelope (not the speech + noise mixture) in individual frequency bands (Gross et al., 2013;Kayser et al., 2015). At the group-level we observed wide- spread significant speech MI in all considered bands from 0.25 to 48 Hz (FWE = 0.05), except between 18–24 Hz (Figure 3—figure supplement 1A). Consistent with previous results

(4)

(Gross et al., 2013;Ng et al., 2013;Park et al., 2016) speech MI was higher at low frequencies and strongest below 4 Hz (Figure 3—figure sup- plement 1C). This time scale is typically associ- ated with syllabic boundaries or prosodic stress (Giraud and Poeppel, 2012; Greenberg et al., 2003). Indeed, the average syllabic rate was 212 syllables per minute in the present material, cor- responding to about 3.5 Hz. Across frequencies, significant speech MI was strongest in bilateral auditory cortex and was more extended within the right hemisphere (Figure 3—figure supple- ment 1A and C). Indeed, peak significant MI val- ues were significantly higher in the right compared to the left hemisphere at frequencies below 12 Hz (paired t-tests; T(18) 3.1, p0.043 Bonferroni corrected), and did not dif- fer at higher frequencies (T(18) 2.78, p0.09).

This lateralization of speech-to-brain entrainment at frequencies below 12 Hz is consistent with previous reports (Gross et al., 2013).

0.05 0.15 0.25

C

Bandpass

filtering

MEG beamforming

Functional connectivity

Speech envelope

A

Speech SNR & visual informativenessParadigm:

Block 1

. . . 1min

+6dB +4dB +2dB +8dB

+6dB +2dB

Visual Informative (VI)

Visual Not informative (VN) Block 8 +8dB +2dB +6dB +8dB +4dB +2dB

B

. . .

Visual context

Speech envelope Lip contour Speech/lip coherence

Frequency (Hz) -20

-10 0

Power (dB re max) Magnitude squared coherence

20 5 1

0.25 40

Lip contour

Lip and speech entrainment

Audio signal

MEG

Speech envelope Lip contour

MI(LipVI;MEGVI)

DI-behaviour correlation DI(τ

Brain,τ

Speech)Seed→Target

MI-behaviour correlation MI(Speech;MEG)

MI =

DI = MI =

Condition modulation

Figure 1.Experimental paradigm and analysis. (A) Stimuli consisted of 8 continuous 6 min long audio-visual speech samples. For each condition we extracted the acoustic speech envelope as well as the temporal trajectory of the lip contour (video frames, top right: magnification of lip opening and contour). (B) The experimental design comprised eight conditions, defined by the factorial combination of 4 levels of speech to background signal to noise ratio (SNR = 2, 4, 6, and 8 dB) and two levels of visual informativeness (VI: Visual context Informative: video showing the narrator in synch with speech; VN: Visual context Not informative: video showing the narrator producing babble speech). Experimental conditions lasted 1 (SNR) or 3 (VIVN) minutes, and were presented in pseudo-randomized order. (C) Analyses were carried out on band-pass filtered speech envelope and MEG signals. The MEG data were source-projected onto a grey-matter grid. One analysis quantified speech entrainment, i.e. the mutual information (MI) between the MEG data and the acoustic speech envelope (speech MI), as well as between the MEG and the lip contour (lip MI), and the extent to which these were modulated by the experimental conditions. A second analysis quantified directed functional connectivity (DI) between seeds and the extent to which this was modulated by the experimental conditions. A final analysis assessed the correlation of either MI or DI with word-recognition performance.

Relevant variables in deposited data (doi:10.5061/dryad.j4567): SE_speech; LE_lip.

DOI: 10.7554/eLife.24763.003

2 4 6 8

SNR (dB) 50

70 90

Percent correct (%)

Visual Informative Visual Not informative

Figure 2.Behavioral performance. Word recognition performance for each of the experimental conditions (mean±SEM across participants n = 19). Deposited data: BEHAV_perf.

DOI: 10.7554/eLife.24763.004

(5)

Importantly, we observed significant speech-to-brain entrainment not only within temporal cortices but across multiple regions in the occipital, frontal and parietal lobes, consistent with the notion that speech information is represented also within motor and frontal regions (Bornkessel- Schlesewsky et al., 2015;Du et al., 2014;Skipper et al., 2009).

Speech entrainment is modulated by SNR within and beyond auditory cortex

To determine the regions where acoustic signal quality and visual context affect the encoding of acoustic speech we modelled the condition-specific speech MI values based on effects of acoustic signal quality (SNR), visual informativeness (VIVN), and their interaction (SNRxVIVN). Random-effects significance was tested using a permutation procedure and cluster enhancement, correcting for mul- tiple comparisons along all relevant dimensions. Effects of experimental factors emerged in multiple regions at frequencies below 4 Hz (Figure 3). Increasing the acoustic signal quality (SNR;Figure 3A) resulted in stronger speech MI in the right auditory cortex (1–4 Hz; local peak T statistic = 4.46 in posterior superior temporal gyrus; pSTG-R; Table 1), right parietal cortex (local peak T = 3.94 in supramarginal gyrus; SMG-R), and right dorso-ventral frontal cortex (IFGop-R; global peak T = 5.06).

We also observed significant positive SNR effects within the right temporo-parietal and occipital cor- tex at 12–18 Hz (local peak right lingual gyrus, T = 5.12). However, inspection of the participant-spe- cific data suggested that this effect was not reliable (for only 58% of participants showed a speech MI increase with SNR, as opposed to a minimum of 84% for the other SNR effects), possibly because the comparatively lower power of speech envelope fluctuations at higher frequencies (c.f.

Figure 1A); hence this effect is not discussed further.

Visual context reveals distinct strategies for handling speech in noise in premotor, superior and inferior frontal cortex

Contrasting informative and not-informative visual contexts revealed stronger speech MI when see- ing the speakers face (VI) at frequencies below 4 Hz in both hemispheres (Figure 3B): the right tem- poro-parietal cortex (0.25–1 Hz; HG; T = 4.75;Table 1), bilateral occipital cortex (1–4 Hz; global T peak in right visual cortex VC-R;=6.01) and left premotor cortex (1–4 Hz; PMC-L; local T peak = 3.81). Interestingly, the condition-specific pattern of MI for VC-R was characterized by an increase in speech MI with decreasing SNR during the VI condition, pointing to a stronger visual enhancement during more adverse listening conditions. The same effect was seen in premotor cor- tex (PMC-L).

Since visual benefits for perception emerge mostly when acoustic signals are degraded (Figure 2) (Ross et al., 2007;Sumby and Pollack, 1954), the interaction of acoustic and visual factors provides a crucial test for detecting non-trivial audio-visual interactions. We found significant interactions in the 0.25–1 Hz band in the right dorso-ventral frontal lobe, which peaked in the pars triangularis (IFGt-R; T = 3.62;Figure 3C;Table 1). Importantly, investigating the SNR effect in the frontal cortex voxels revealed two distinct strategies for handling speech in noise dependent on visual context (Figure 3D): During VI speech MI increased with SNR in ventral frontal cortex (peak T for SNR in pars orbitalis; IFGor-R; T = 5.07), while in dorsal frontal cortex speech MI was strongest at low SNRs during VN (peak T in superior frontal gyrus; SFG-R; T = 3.55). This demonstrates distinct functional roles of ventral and dorsal prefrontal regions in speech encoding and reveals a unique role of supe- rior frontal cortex for enhancing speech representations in a poorly informative context, such as the absence of visual information in conjunction with poor acoustic signals. For further analysis we focused on these regions and frequency bands revealed by the GLM effects (Figure 3E).

Condition effects are hemisphere-dominant but not strictly lateralized

Our results reveal significantly stronger entrainment at low frequencies (c.f.Figure 3—figure supple- ment 1) and a prevalence of condition effects on speech MI in the right hemisphere (c.f.Figure 3).

We directly tested whether these condition effects were significantly lateralized by comparing the respective GLM effects between corresponding ROIs across hemispheres (Table 1). This revealed that only the 1–4 Hz SNR effect in IFGop-R was significantly lateralized (T(18) = 6.03; FWE = 0.05 corrected across ROIs), while all other GLM effects did not differ significantly between hemispheres.

(6)

Noise invariant dynamic representations of lip movements

To complement the above analysis of speech-to-brain entrainment we also systematically analyzed the entrainment of brain activity to lip movements (lip MI). This allowed us to address whether the enhancement of the encoding of acoustic speech during an informative visual context arises from a co-representation of acoustic and visual speech information in the same regions or not. As expected based on previous work, the acoustic speech envelope and the trajectory of lip movements for the

A SNR

1-4 Hz

0 1 2 3

0 1 2 3

0 2 4 6 8 10 1-4 Hz

IFGop-R*

1-4 Hz SMG-R

1-4 Hz pSTG-R

Visual Informative Visual Not informative

2 4 6 8 SNR MI (bits x 10-3)

0.4 0.4 0.3 0.3 0.4 0.6

1-4 Hz 0.25-1 Hz

B VIVN

0 10 20 30 40 50

-0.5 0 0.5 1 1.5 2

0 0.5 1 1.5 2 2.5 0.25-1 Hz

HG-R*

1-4 Hz VC-R*

1-4 Hz PMC-L

0.2 -0.1

-0.10.1 -0.2>0.0 99.5th

percentile

T(18)=2.1 GLM T rescaled within volume

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

0.25-1 Hz 1-41-41-41-41-41-4 Hz Hz Hz Hz Hz Hz Hz 1-41-41-41-41-41-41-41-41-41-41-41-41-41-4 Hz

PMC-L

SFG-R

VC-R SMG-R

pSTG-RHG-R IFGop-R IFGt-R IFGor-R

SNRxVIVN

0.25-1 Hz

C

SNR effect for

Visual Informative

3.5 2 0.5

SNR effect for Visual Not informative

-2.5 -1.5 -0.5 T(18) for SNR modulation

D E

-2 0 2 4 6

0.25-1 Hz SFG-R

0.1-0.4 0

2 4 6 8

0.25-1 Hz IFGor-R 0.5 -0.1

0 2 4 6

0.25-1 Hz IFGt-R*

0.5<0.0

Peaks

MI (bits x 10-3) MI (bits x 10-3)

Figure 3.Modulation of speech-to-brain entrainment by acoustic SNR and visual informativeness. Changes in speech MI with the experimental factors were quantified using a GLM for the condition-specific speech MI based on the effects of SNR (A), visual informativeness VIVN (B), and their interaction (SNRxVIVN) (C). The figures display the cortical-surface projection onto the Freesurfer template (proximity = 10 mm) of the group-level significant statistics for each GLM effect (FWE = 0.05). Graphs show the average speech MI values for each condition (mean±SEM), for local and global (red asterisk) of the T maps. Lines indicate the across-participant average regression model and numbers indicate the group-average standardized regression coefficient for SNR in the VI and VN conditions (>/ < 0.0 = positive/negative, rounded to 0). (D) T maps illustrating the opposite SNR effects within voxels with significant SNRxVIVN effects. MI graphs for the peaks of these maps are shown in (C) (IFGor-R and SFG-R = global T peaks for SNR effects in VI and VN, respectively). (E) Location of global and local seeds of GLM T maps, used for the analysis of directed connectivity. See also Tables 1and2andFigure 3—figure supplements 1–2. Deposited data: SE_meg; SE_speech; SE_miS.

DOI: 10.7554/eLife.24763.005

The following figure supplements are available for figure 3:

Figure supplement 1.Entrainment of rhythmic MEG activity to the speech envelope and lip movements.

DOI: 10.7554/eLife.24763.006

Figure supplement 2.Information theoretic decomposition of speech entrainment.

DOI: 10.7554/eLife.24763.007

Figure supplement 3.Condition-changes in the amplitude of oscillatory activity.

DOI: 10.7554/eLife.24763.008

(7)

present material were temporally coherent, in particular in the delta and theta bands (Figure 1A) (Chandrasekaran et al., 2009;Park et al., 2016;Schwartz and Savariaux, 2014).

Lip-to-brain entrainment was quantified for the visual informative condition only, across the same frequency bands as considered for the speech MI (Figure 3—figure supplement 1B). This revealed wide-spread significant lip MI in frequency bands below 8 Hz, with the strongest lip entrainment occurring in occipital cortex (Figure 3—figure supplement 1B). Peak lip MI values were larger in the right hemisphere, in particular for the 4–8 Hz band (Figure 3—figure supplement 1C), but this effect was not significant after correction for multiple comparisons (T(18)2.53, p0.06). We then asked whether in any regions with significant lip MI the encoding of lip information changed with SNR. No significant SNR effects were found (FWE = 0.05, corrected across voxels and 0–12 Hz fre- quency bands), demonstrating that the encoding of lip signals is invariant across acoustic conditions.

We also directly compared speech MI and lip MI within the ROIs highlighted by the condition effects on speech MI (c.f.Figure 3E). In most ROIs speech MI was significantly stronger than lip MI (Table 2;

T(18) HG-R, pSTG-R, IFGop-R and PMC-L3.58; FWE = 0.05 corrected across ROIs), while lip MI was significantly stronger in VC-R (T(18) = 3.35; FWE = 0.05).

Table 1.Condition effects on speech MI. The table lists global and local peaks in the GLM T-maps. Anatomical labels and Brodmann areas are based on the AAL and Talairach atlases.b= standardized regression coefficient; SEM = standard error of the participant average. ROI-contralat. = T test for a significant difference of GLM betas between the respective ROI and its contralateral grid voxel.

Anatomical label Brodmann area MNI coordinates GLM effect Frequency Band T(18) b(SEM) T(18) ROI-contralat.

HG-R 42 63 21 11 VIVN 0.25–1 Hz 4.75* 0.39 (0.06) 2.00

pSTG-R 22 48 30 8 SNR 1–4 Hz 4.46* 0.48 (0.08) 2.36

SMG-R 40 57 30 38 SNR 1–4 Hz 3.94* 0.29 (0.09) 0.22

PMC-L 6 54 0 32 VIVN 1–4 Hz 3.81* 0.27 (0.06) 0.65

IFGt-R 46 42 33 2 SNRxVIVN 0.25–1 Hz 3.62* 0.29 (0.07) 1.48

IFGop-R 47 51 18 2 SNR 1–4 Hz 5.06* 0.36 (0.08) 6.03*

IFGor-R 47 30 26 16 SNR in VI 0.25–1 Hz 5.07* 0.44 (0.08) 1.92

SFG-R 6 12 30 58 SNR in VN 0.25–1 Hz 3.55* 0.41 (0.09) 2.21

VC-R 17/18 18 102 -4 VIVN 1–4 Hz 6.01* 0.45 (0.06) 1.84

*denotes significant effects (FWE = 0.05 corrected for multiple comparisons). Relevant variables in deposited data (doi:10.5061/dryad.j4567): SE_meg;

SE_speech; SE_miS.

DOI: 10.7554/eLife.24763.009

Table 2.Analysis of the contribution of audio-visual signals in shaping entrainment. For each region / effect of interest (c.f.Table 1) the table lists the comparison of condition-averaged speech and lip MI (positive = greater speech MI); the condition effects (GLM) on the conditional mutual information (CMI) between the MEG signal and the speech envelope, while partialling out effects of lip signals;

and the condition-averaged information theoretic redundancy between speech and lip MI.

Speech vs. lip MI Speech-Lip redundancy Speech CMI

Label T(18) Avg(SEM) T(18) Avg(SEM) Effect T(18) b(SEM)

HG-R 4.27* 28.16 (6.59) 0.73 0.33 (0.44) VIVN 4.37* 0.35 (0.06)

pSTG-R 3.90* 5.42 (1.39) 0.49 0.19 (0.38) SNR 4.66* 0.49 (0.08)

SMG-R 2.95 1.32 (0.45) 1.10 0.51 (0.47) SNR 4.10* 0.29 (0.09)

PMC-L 3.58* 1.06 (0.30) 3.83* 2.42 (0.63) VIVN 3.47* 0.24 (0.06)

IFGt-R 1.21 0.87 (0.72) 2.29 1.75 (0.77) SNRxVIVN 4.07* 0.31 (0.07)

IFGopR 3.68* 1.50 (0.41) 4.69* 1.56 (0.33) SNR 4.70* 0.35 (0.07)

SFG-R 0.88 0.61 (0.70) 4.13* 2.37 (0.57) SNR in VN 3.62* 0.43 (0.09)

VC-R 3.35* 2.19 (0.65) 2.37 0.68 (0.29) VIVN 5.77* 0.45 (0.06)

*denotes significant effects (FWE = 0.05 corrected for multiple comparisons). Deposited data: ID_meg; ID_speech; ID_lip; ID_infoterms.

DOI: 10.7554/eLife.24763.010

(8)

Speech entrainment does not reflect trivial entrainment to lip dynamics

Given that only the speech and not the lip representation were affected by SNR the above results suggest that both acoustic and visual speech signals are represented independently in rhythmically entrained brain activity. To address the interrelation between the representations of acoustic and visual speech signals more directly, we asked whether the condition effects on speech MI result from genuine changes in the encoding of the acoustic speech envelope, or whether they result from a superposition of local representations of the acoustic and the visual speech signals. Given that visual and acoustic speech are temporally coherent and offer temporally redundant information, it could be that the enhancement of speech MI during the VI condition simply results from a superposition of local representations of the visual and acoustic signals arising within the same brain region. Alterna- tively, it could be that the speech-to-brain entrainment reflects a representation of the acoustic speech signal that is informed by visual contextual information, but which is not a one to one reflec- tion of the dynamics of lip movements. We performed two analyses to address this.

First, we calculated the conditional mutual information between the MEG signal and the acoustic speech envelop while partialling out the temporal dynamics common to lip movements and the speech envelope. If the condition effects on speech MI reflect changes within genuine acoustic rep- resentations, they should persist when removing direct influences of lip movements. Indeed, we found that all of the condition effects reported inFigure 3persisted when computed based on con- ditional MI (absolute T(18) 3.47; compareTable 2for CMI with Table 1 for MI; ROI-specific MI and CMI values are shown inFigure 3—figure supplement 2A,B).

Second, we computed the information-theoretic redundancy between the local speech and lip representations. Independent representations of each speech signal would result in small redun- dancy values, while a common representation of lip and acoustic speech signals would reflect in a redundant representation. Across SNRs we found that these representations were significantly redundant in the ventral and dorsal frontal cortex (T(18) 3.83, for SFG-R, IFGop-R, IFGt-Rand PMC-L) but not in the temporal lobe or early auditory and visual cortices (FWE = 0.05 corrected across ROIs; Table 2; Figure 3—figure supplement 2C). However, the actual redundancy values were rather small (condition-averaged values all below 3%). All in all, this suggests that the local rep- resentations of the acoustic speech envelope in sensory regions are informed by visual evidence but in large do not represent the same information that is provided by the dynamics of lip movements.

This in particular also holds for the acoustic speech MI in visual cortex. The stronger redundancy in association cortex (IFG, SFG, PMC) suggests that these regions feature co-representations of acous- tic speech and lip movements.

Directed causal connectivity within the speech network

The diversity of the patterns of speech entrainment in temporal, premotor and inferior frontal regions across conditions shown inFigure 3could arise from the individual encoding properties of each region, or from changes in functional connectivity between regions with conditions. To directly test this, we quantified the directed causal connectivity between these regions of interest. To this end we used Directed Information (DI), also known as Transfer Entropy, an information theoretic measure of Wiener-Granger causality (Massey, 1990;Schreiber, 2000). We took advantage of pre- vious work that made this measure statistically robust when applied to neural data (Besserve et al., 2015;Ince et al., 2017).

We observed significant condition-averaged DI between multiple nodes of the speech network (FWE = 0.05; Figure 4A andFigure 4—figure supplement 1A). This included among others the feed-forward pathways of the ventral and dorsal auditory streams, such as from auditory cortex (HG- R) and superior temporal regions (pSTG-R) to premotor (PMC-L) and to inferior frontal regions (IFGt- R, IFGop-R), from right parietal cortex (SMG-R) to premotor cortex (PMC-L), as well as feed-back connections from premotor and inferior frontal regions to temporal regions. In addition, we also observed significant connectivity between frontal (SFG-R) and visual cortex (VC).

We then asked whether and where connectivity changed with experimental conditions (Figure 4B,Table 3andFigure 4—figure supplement 1B). Within the right ventral stream feed-for- ward connectivity from the temporal lobe (HG-R, pSTG-R) to frontal cortex (IFGt-R, IFGop-R) was enhanced during high acoustic SNR (FWE = 0.05; T(18)3.1). More interestingly, this connectivity was further enhanced in the presence of an informative visual context (pSTG-R fi IFGt-R, VIVN

(9)

effect, T = 4.57), demonstrating a direct influence of visual context on the propagation of informa- tion along the ventral stream. Interactions of acoustic and visual context on connectivity were also found from auditory (HG-R) to premotor cortex (PMC-L, negative interaction; T = 3.01). Here con- nectivity increased with increasing SNR in the absence of visual information and increased with decreasing SNR during an informative context, suggesting that visual information changes the quali- tative nature of auditory-motor interactions. An opposite interaction was observed between the

A

Seed

0.4 0.6 0.8 1 1.2 1.4 1.6

Maximum average DI (zscore)

HG-R pSTG-R SMG-R PMC-L SFG-RIFGt-R IFGop-R VC-R HG-R

pSTG-R

SMG-R

PMC-L

SFG-R

IFGt-R

IFGop-R

VC-R

Target

B

SMG-R HG-R

VC-R PMC-L

IFGt-R SFG-R

pSTG-R IFGop-R 0

0.5 1

HG-R→PMC-L SNRxVIVN

-0.20.2 -0.5

0 0.5 1 1.5

pSTG-R→IFGt-R SNR & VIVN

0.2 0.2

0 0.5 1

HG-R→IFGt-R SNR

0.3 0.2

0 0.5 1 IFGt-R→IFGop-R

VIVN

<0 -0.1

0 0.5 1 1.5

pSTG-R→IFGop-R SNR

0.2

0.3 0

0.5 1 1.5

HG-R→IFGop-R SNR

0.2 0.3

-0.5 0 0.5 1 1.5

IFGop-R→pSTG-R SNR

0.30.3 -0.1

DI (z score)

8 SNR (dB) -0.4

0 0.4 0.8

SFG-R→VC-R SNRxVIVN

0.3 4 6 2

Figure 4.Directed causal connectivity within the speech-entrained network. Directed connectivity between seeds of interest (c.f.Figure 3E) was quantified using Directed Information (DI). (A) Maximum significant condition-average DI across lags (FWE = 0.05 across lags; white = no significant DI).

(B) Significant condition effects (GLM for SNR, VIVN or their interaction) on DI (FWE = 0.05 across speech/brain lags and seed/target pairs). Bar graphs display condition-specific DI values for each significant GLM effect along with the across-participants average regression model (lines). Numbers indicate the group-average standardized betas for SNR in the VI and VN conditions, averaged across lags associated with a significant GLM effect (>/

< 0.0 = positive/negative, rounded to 0). Error-bars =±SEM. See alsoTable 3andFigure 4—figure supplement 1. Deposited data: DI_meg;

DI_speech; DI_di; DI_brainlag; DI_speechlag.

DOI: 10.7554/eLife.24763.011

The following figure supplement is available for figure 4:

Figure supplement 1.Directed functional connectivity within the speech-entrained network.

DOI: 10.7554/eLife.24763.012

Table 3.Analysis of directed connectivity (DI). The table lists connections with significant condition- averaged DI, and condition effects on DI. SEM = standard error of participant average;b= standard- ized regression coefficients. T(18) = maximum T statistic within significance mask. All reported effects are significant (FWE = 0.05 corrected for multiple comparisons). Deposited data: DI_meg; DI_speech;

DI_di; DI_brainlag; DI_speechlag.

DI Condition effects (GLM)

Seed Target T(18) Effect T(18) b(SEM)

HG-R PMC-L 3.38 SNRxVIVN 3.01 0.24 (0.08)

HG-R IFGt-R 3.03 SNR 3.32 0.31 (0.09)

HG-R IFGopR 4.54 SNR 3.19 0.26 (0.07)

pSTG-R IFGt-R 3.39 SNR 3.91 0.32 (0.09)

VIVN 4.57 0.23 (0.05)

pSTG-R IFGopR 4.12 SNR 3.31 0.28 (0.08)

IFGt-R IFGopR 3.76 VIVN 3.56 0.21 (0.06)

IFGopR pSTG-R 4.16 SNR 4.65 0.31 (0.09)

SFG-R VC-R 4.40 SNRxVIVN 3.69 0.28 (0.08)

DOI: 10.7554/eLife.24763.013

(10)

frontal lobe and visual cortex (SFG-R fiVC-R, T = 3.69). Finally, feed-back connectivity along the ventral pathway was significantly stronger during high SNRs (IFGt-RfipSTG-R; T = 4.56).

Does speech entrainment or connectivity shape behavioral performance?

We performed two analyses to test whether and where changes in the local representation of speech information or directed connectivity (DI) contribute to explaining the multisensory behavioral bene- fits (c.f.Figure 2). Given the main focus on the visual enhancement of perception we implemented this analysis only for speech and not for lip MI. First, we asked where speech-MI and DI relates to performance changes across all experimental conditions (incl. changes in SNR). This revealed a sig- nificant correlation between condition-specific word-recognition performance and the strength of speech MI in pSTG-R and IFGt-R (r 0.28; FWE = 0.05;Table 4andFigure 5A), suggesting that stronger entrainment in the ventral stream facilitates comprehension. This hypothesis was further corroborated by a significant correlation of connectivity along the ventral stream with behavioral performance, both in feed-forward (HG-RfiIFGt-R/IFGop-R; pSTG-RfiIFGt-R/IFGop-R; r 0.24, Table 4) and feed-back directions (IFGop-Rfi pSTG-R; r = 0.37). The enhanced quality of speech perception during favorable listening conditions hence results from enhanced speech encoding and the supporting network connections along the temporal-frontal axis.

Table 4.Association of behavioral performance with speech entrainment and connectivity. Perfor- mance: T statistic and average of participant-specific correlation (SEM) between behavioral perfor- mance and speech MI / DI. Visual enhancement: correlation between SNR-specific behavioral benefit (VI-VN) and the respective difference in speech-MI or DI.

Speech MI

Performance Visual enhancement

T(18) r(SEM) T(18) r(SEM)

HG-R 1.27 0.13(0.10) 0.21 0.04(0.15)

pSTG-R 3.43 * 0.30(0.09) 0.53 0.07(0.11)

SMG-R 2.35 0.23(0.09) -0.39 -0.07(0.14)

PMC-L 0.47 0.04(0.08) 0.13 0.03(0.16)

IFGt-R 3.09 * 0.28(0.09) 1.25 0.29(0.18)

IFGopR 2.38 0.24(0.09) -0.25 -0.05(0.17)

SFG-R -0.47 -0.04(0.08) 1.61 0.35(0.17)

VC-R 1.55 0.18(0.10) -0.82 -0.14(0.14)

Directed connectivity

Performance Visual enhancement

Seed Target T(18) r(SEM) T(18) r(SEM)

HG-R PMC-L 0.90 0.06(0.06) -0.07 -0.01(0.14)

HG-R IFGt-R 4.83 * 0.31(0.07) 2.55 * 0.28(0.11)

HG-R IFGopR 3.19 * 0.24(0.07) 1.86 0.31(0.17)

pSTG-R IFGt-R 4.28 * 0.27(0.06) 1.28 0.16(0.12)

pSTG-R IFGopR 3.59 * 0.29(0.08) 1.82 0.32(0.17)

IFGt-R IFGopR 1.11 0.08(0.07) 2.27 0.33(0.14)

IFGopR pSTG-R 4.51 * 0.37(0.08) 2.55 * 0.37(0.15)

SFG-R VC-R -0.04 0.00(0.08) 0.90 0.17(0.18)

*denotes significant effects (FWE = 0.05 corrected for multiple comparisons). Deposited data: BEHAV_perf;

SE_meg; DI_meg; SE_miS; DI_di; NBC_miS; NBC_di.

DOI: 10.7554/eLife.24763.015

(11)

Second, we asked whether and where the improvement in behavioral performance with an infor- mative visual context (VI-VN) correlates with an enhancement in speech encoding or connectivity.

This revealed no significant correlations between the visual enhancement of local speech MI and per- ceptual benefits (all T values < FWE = 0.05 threshold;Table 4). However, changes in both feed-for- ward (HG-R fi IFGt-R; r = 0.28; Figure 5B) and feed-back connections (IFGop-R fi pSTG-R;

r = 0.37) along the ventral stream were significantly correlated with the multisensory perceptual ben- efit (FWE = 0.05).

Changes in speech entrainment are not a result of changes in the amplitude of brain activity

We verified that the reported condition effects on speech MI are not simply a by-product of changes in the overall oscillatory activity. To this end we calculated the condition averaged Hilbert amplitude for each ROI and performed a GLM analysis for condition effects as for speech entrainment (FWE = 0.05 with correction across ROIs and frequency bands;Table 5;Figure 3—figure supple- ment 3). This revealed a reduction of oscillatory activity during the visual informative condition in the occipital cortex across many bands (VC-R, 4–48 Hz), in the inferior frontal cortex (IFG-R and

Percent correct (%)

Visual Informative

Visual Not informative Speech to noise

SNR

2 dB 4 dB 6 dB 8 dB

B

-1 0 1 2 3 4 5 6

50 70 90 IFGt-R

Speech MI (bits 10-3) perform. r = 0.28 vis. enhanc. r = 0.29

2 4 6 8 10

50 70 90

pSTG-R

perform. r = 0.30 vis. enhanc. r = 0.07 Speech MI (bits 10-3)

0 0.5 1

50 70 90

DI (zscore) HG-R→IFGt-R

perform. r = 0.31 vis. enhanc. r = 0.28

-0.5 0 0.5 1 1.5

50 70 90

DI (zscore) IFGop-R→pSTG-R

perform. r = 0.37 vis. enhanc. r = 0.37

A Speech

entrainment

Speech connectivity

Percent correct (%)Percent correct (%)

Percent correct (%)

Figure 5.Neuro-behavioral correlations. (A) Correlations between behavioral performance and condition-specific speech MI (perform. (r), and correlations between the visual enhancement of performance and the visual enhancement in MI (vis. enhanc. (r). (B) Same for DI. Only those ROIs or connections exhibiting significant correlations are shown. error-bars =±SEM. See alsoTables 2–3. Deposited data: BEHAV_perf; SE_meg; DI_meg;

SE_miS; DI_di; NBC_miS; NBC_di.

DOI: 10.7554/eLife.24763.014

(12)

IFGop-R, 24–48 Hz), and in the pSTG-R at 4–8 Hz and 18–24 Hz. No significant effects of SNR or SNRxVIVN interactions were found (FWE = 0.05). Importantly, none of these VIVN effects over- lapped with the significant changes in speech MI (0.25–4 Hz) and only the reduction in pSTG-R power overlapped with condition effects in connectivity. All in all this suggests that the reported changes in speech encoding and functional connectivity are not systematically related to changes in the strength of oscillatory activity withy acoustic SNR or visual context.

Changes in directed connectivity do not reflect changes in phase- amplitude coupling

Cross-frequency coupling between the phase and amplitudes of different rhythmic brain signals has been implicated in mediating neural computations and communication (Canolty and Knight, 2010).

We asked whether the above results on functional connectivity are systematically related to specific patterns of phase-amplitude coupling (PAC). We first searched for significant condition-average PAC between each pair of ROIs across a wide range of frequency combinations. This revealed significant PAC within VC-R, within pSTG-R and within SMG-R, as well as significant coupling of the 18–24 Hz VC-R power with the 0.25–1 Hz IFGop-R phase (FWE = 0.05; seeTable 6). However, we found no Table 5.Changes in band-limited source signal amplitude with experimental conditions. The table lists GLM T-statistics, participant averaged standardized regression coefficients (and SEM) for signifi- cant VIVN effects on MEG source amplitude (FWE = 0.05 corrected across ROIs and frequency bands).. Effects of SNR and SNRxVIVN interactions were also tested but not significant Deposited data: SE_meg; AMP_amp.

ROI Band T(18) b(SEM)

pSTG-R 4–8 Hz 3.66 0.38 (0.09)

pSTG-R 18–24 Hz 4.11 0.40 (0.08)

IFGt-R 24–36 Hz 3.91 0.40 (0.06)

IFGt-R 30–48 Hz 4.49 0.39 (0.08)

IFGop-R 24–36 Hz 4.44 0.40 (0.07)

IFGop-R 30–48 Hz 4.14 0.41 (0.07)

VC-R 4–8 Hz 3.70 0.55 (0.08)

VC-R 8–12 Hz 4.53 0.70 (0.05)

VC-R 12–18 Hz 5.20 0.70 (0.05)

VC-R 18–24 Hz 5.57 0.66 (0.06)

VC-R 24–36 Hz 5.57 0.55 (0.08)

VC-R 30–48 Hz 4.54 0.46 (0.10)

DOI: 10.7554/eLife.24763.016

Table 6.Analysis of phase-amplitude coupling (PAC). The table lists the significant condition-aver- aged PAC values for all pairs or ROIs and frequency bands (FWE = 0.05 corrected across pairs of phase and power frequencies). SEM = standard error of participant average. None of these changed significantly with conditions (no GLM effects at FWE = 0.05). Deposited data: SE_meg.

Phase ROI (band) Power ROI (band) T(18) Pac(SEM)

pSTG-R (1–4 Hz) pSTG-R (8–12 Hz) 3.26 0.22 (0.07)

SMG-R (4–8 Hz) SMG-R (30–48 Hz) 3.58 0.27 (0.07)

IFGop-R (0.25–1 Hz) VC-R (18–24 Hz) 3.08 0.22 (0.07)

VC-R (4–8 Hz) VC-R (8–12 Hz) 3.06 0.35 (0.11)

VC-R (1–4 Hz) VC-R (12–18 Hz) 3.44 0.48 (0.13)

VC-R (4–8 Hz) VC-R (24–36 Hz) 3.76 0.26 (0.07)

DOI: 10.7554/eLife.24763.017

(13)

significant changes in PAC with experimental conditions, suggesting that the changes in functional connectivity described above are not systematically related to specific patterns of cross-frequency coupling.

Discussion

The present study provides a comprehensive picture of how acoustic signal quality and visual context interact to shape the encoding of acoustic and visual speech information and the directed functional connectivity along speech-sensitive cortex. Our results reveal a dominance of feed-forward pathways from auditory regions to inferior frontal cortex under favorable conditions, such as during high acoustic SNR. We also demonstrate the visual enhancement of acoustic speech encoding in auditory cortex, as well as non-trivial interactions of acoustic quality and visual context in premotor and in superior and inferior frontal regions. Furthermore, our results reveal the superposition of acoustic and visual speech signals (lip movements) in association regions and the dominance of visual speech representations in visual cortex. These patterns of local encoding were accompanied by changes in directed connectivity along the ventral pathway and from auditory to premotor cortex. Yet, the behavioral benefit arising from seeing the speaker’s face was not related to any region-specific visual enhancement of acoustic speech encoding. Rather, changes in directed functional connectivity along the ventral stream were predictive of the multisensory behavioral benefit.

Entrained auditory and visual speech representations in temporal, parietal and frontal lobes

We observed functionally distinct patterns of speech-to-brain entrainment along the auditory path- ways. Previous studies on speech entrainment have largely focused on the auditory cortex, where entrainment to the speech envelope is strongest (Ding and Simon, 2013; Gross et al., 2013;

Keitel et al., 2017;Mesgarani and Chang, 2012;Zion Golumbic et al., 2013a), and only few stud- ies have systematically compared speech entrainment along auditory pathways (Zion Golumbic et al., 2013b). This was in part due to the difficulty to separate distinct processes reflecting entrain- ment when contrasting only few experimental conditions (e.g. forward and reversed speech [Ding and Simon, 2012;Gross et al., 2013]), or based on the difficulty to separate contributions from visual (i.e. lip movements) and acoustic speech signals (Park et al., 2016). Based on the suscep- tibility to changes in acoustic signal quality and visual context, the systematic use of region-specific temporal lags between stimulus and brain response, and the systematic analysis of both acoustic and visual speech signals, we here establish entrainment as a ubiquitous mechanism reflecting dis- tinct acoustic and visual speech representations along auditory pathways.

Entrainment to the acoustic speech envelope was reduced with decreasing acoustic SNR in tem- poral, parietal and ventral prefrontal cortex, directly reflecting the reduction in behavioral perfor- mance in challenging environments. In contrast, entrainment was enhanced during low SNR in superior frontal and premotor cortex. While there is strong support for a role of frontal and premo- tor regions in speech processing (Du et al., 2014; Evans and Davis, 2015; Heim et al., 2008;

Meister et al., 2007;Morillon et al., 2015;Rauschecker and Scott, 2009;Skipper et al., 2009;

Wild et al., 2012), most evidence comes from stimulus-evoked activity rather than signatures of neu- ral speech encoding. We directly demonstrate the specific enhancement of frontal (PMC, SFG) speech representations during challenging conditions. This enhancement is not directly inherited from the temporal lobe, as temporal regions exhibited either no visual facilitation (pSTG) or visual facilitation without an interaction with SNR (HG).

We also observed significant entrainment to the temporal trajectory of lip movements in visual cortex, the temporal lobe and frontal cortex (Figure 3—figure supplement 1). This confirms a previ- ous study, which has specifically focused on the temporal coherence between brain activity and lip movements (Park et al., 2016). Importantly, by comparing the local encoding of both the acoustic and visual speech information, and conditioning out the visual signal from the speech MI, we found that sensory cortices and the temporal lobe provide largely independent representations of the acoustic and visual speech signals. Indeed, the information theoretic redundancy between acoustic and visual representations was small and was significant only in association regions (SFG, IFG, PMC).

This suggests that early sensory cortices contain largely independent representations of acoustic and visual speech information, while association regions provide a superposition of auditory and visual

(14)

speech representations. However, the condition effects on the acoustic representation in any of the analyzed regions did not disappear when factoring out the representation of lip movements, sug- gesting that these auditory and visual representations are differentially influenced by sensory con- text. These findings extend previous studies by demonstrating the co-existence of visual and auditory speech representations along auditory pathways, but also reiterate the role of PMC as one candidate region that directly links neural representations of lip movements with perception (Park et al., 2016).

Multisensory enhancement of speech encoding in the frontal lobe

Visual information from the speakers’ face provides multiple cues that enhance intelligibility. In sup- port of a behavioral multisensory benefit we found stronger entrainment to the speech envelope during an informative visual context in multiple bilateral regions. First, we replicated the visual enhancement of auditory cortical representations (HG) (Besle et al., 2008; Kayser et al., 2010;

Zion Golumbic et al., 2013a). Second, visual enhancement of an acoustic speech representation was also visible in early visual areas, as suggested by prior studies (Nath and Beauchamp, 2011;

Schepers et al., 2015). Importantly, our information theoretic analysis suggests that this representa- tion of acoustic speech is distinct from the visual representation of lip dynamics, which co-exists in the same region. The visual enhancement of acoustic speech encoding in visual cortex was strongest when SNR was low, unlike the encoding of lip movements, which was not affected by acoustic SNR.

Hence this effect is most likely explained by top-down signals providing acoustic feedback to visual cortices (Vetter et al., 2014). Third, speech representations in ventral prefrontal cortex were selec- tively involved during highly reliable multisensory conditions and were reduced in the absence of the speakers face. These findings are in line with suggestions that the IFG facilitates comprehension (Alho et al., 2014;Evans and Davis, 2015;Hasson et al., 2007b;Hickok and Poeppel, 2007) and implements multisensory processes (Callan et al., 2014,2003;Lee and Noppeney, 2011), possibly by providing amodal phonological, syntactic and semantic processes (Clos et al., 2014;Ferstl et al., 2008;McGettigan et al., 2012). Previous studies often reported enhanced IFG response amplitudes under challenging conditions (Guediche et al., 2014). In contrast, by quantifying the fidelity of speech representations, we here show that speech encoding is generally better during favorable SNRs. This discrepancy is not necessarily surprising, if one assumes that IFG representations are derived from those in the temporal lobe, which are also more reliable during high SNRs. Notewor- thy, however, we found that speech representations within ventral IFG are selectively stronger dur- ing an informative visual context, even when discounting direct co-representations of lip movements. We thereby directly confirm the hypothesis that IFG speech encoding is enhanced by visual context.

Furthermore, we demonstrate the visual enhancement of speech representations in premotor regions, which could implement the mapping of audio-visual speech features onto articulatory repre- sentations (Meister et al., 2007; Morillon et al., 2015; Morı´s Ferna´ndez et al., 2015;

Skipper et al., 2009;Wilson et al., 2004). We show that that this enhancement is inversely related to acoustic signal quality. While this observation is in agreement with the notion that perceptual ben- efits are strongest under adverse conditions (Ross et al., 2007;Sumby and Pollack, 1954), there was no significant correlation between the visual enhancement of premotor encoding and behavioral performance. Our results thereby deviate from previous work that has suggested a driving role of premotor regions in shaping intelligibility (Alho et al., 2014;Osnes et al., 2011). Rather, we support a modulatory influence of auditory-motor interactions (Alho et al., 2014; Callan et al., 2004;

Hickok and Poeppel, 2007;Krieger-Redwood et al., 2013;Morillon et al., 2015). In another study we recently quantified dynamic representations of lip movements, calculated when discounting influ- ences of the acoustic speech, and reported that left premotor activity was significantly predictive of behavioral performance (Park et al., 2016). One explanation for this discrepancy may be the pres- ence of a memory component in the present behavioral task, which may engage other brain regions (e.g. IFG) more than other tasks. Another explanation could be that premotor regions contain, besides an acoustic speech representation described here, complementary information about visual speech that is not directly available in the acoustic speech contour, and is either genuinely visual or correlated with more complex acoustic properties of speech. Further work is required to disentangle the multisensory nature of speech encoding in premotor cortex.

Referenzen

ÄHNLICHE DOKUMENTE

The significant three-way interaction between source of information, the other play- er’s visibility, and context suggests that different playing speeds in the two social

But the process of colour hearing is different with regard to the external auditory stimuli. Tempo for instance affects the shape of the photism: the faster the music, the sharper

For statistical analysis we suggest Sequential Testing ac- cording to Wald, in order to minimize population size and still show the statistical significance of low empirical

Overall, This thesis demonstrates that: (1) although MT representation of motion is precise, a change in motion direction is overestimated in MT, (2) visual attention not only

In the top panel showing the perceived syllabic prominence depending on the position of the F0 peak, the syllables are aligned along the x axis, and the nine

Gesture forms in different representation techniques To investigate how different gesture form features are used and combined, we explored the SaGA data separately for

On-line integration of semantic information from speech and gesture: Insights from event-related brain potentials. Temporal window of integration in

On the basis of electrophysiological measurements of the motion-induced membrane potential at various locations of the axon and the dendritic tree of VS cells, it