1. Introduction
1.1. Co-speech movements, co-speech actions and co-speech gestures
Speech often co-occurs with simultaneous movement. We speak while we are performing a wide range of intentional actions like walking, cycling, driving, gardening, playing games or constructing something. These co-speech movements often involve our upper and/or lower limbs, which have been shown to be biomechanically coupled in their respective production mechanisms, resulting in a cross-modal temporal coordination which may underlie multimodal speech planning (Pouw, de Jonge-Hoekstra, Harrison, Paxton, & Dixon, Reference Pouw, de Jonge-Hoekstra, Harrison, Paxton and Dixon2021; Serré, Dohen, Fuchs, Gerber, & Rochet-Capellan, Reference Serré, Dohen, Fuchs, Gerber and Rochet-Capellan2022). However, the exact nature and limits of this cross-modal coordination process are hitherto not well understood.
Gesture research also investigates spontaneous movements that accompany speech: co-speech gestures. In fact, communicative gestures are typically expected to be synchronized with affiliated speech to fulfill a communicative function (McNeill, Reference McNeill1992), but cf. Novack, Wakefield, and Goldin-Meadow (Reference Novack, Wakefield and Goldin-Meadow2016)). Obviously, not all co-speech movements – for example, the examples mentioned above – qualify as instances of gesturing. Therefore, co-occurrence with speech appears to be a necessary but not sufficient condition for a co-speech movement to be interpreted as a gesture. However, the distinction between co-speech movements that are perceived as gestures and co-speech movements that are not is not straightforward. In their study on the perceived quality of movements, Novack et al. (Reference Novack, Wakefield and Goldin-Meadow2016) showed that observers tend to interpret movements as gestural representations of actions when the movement did not literally act on objects, for example, by grasping them, but was performed near objects. More ambiguous movements, which could not directly be interpreted as goal-directed and intentional, tended to be seen as gestural representations of actions when they were accompanied by speech.
These findings corroborate the assumption that co-occurring speech indeed aids an interpretation of a movement as a gesture and are in line with Hostetter and Alibali’s (Reference Hostetter and Alibali2008, Reference Hostetter and Alibali2019) model of gestures as ‘simulated actions.’ While such simulated actions would typically be interpreted as iconic or metaphoric gestures, for example, describing the shape or size of a referent mentioned in co-occurring speech, Hostetter and Alibali (Reference Hostetter and Alibali2019) argue that their model can also be applied to deictics and beats, with deictics being the ‘simulation of touching.’
In order to understand whether intentional co-speech actions (e.g., throwing a dart while speaking) are synchronized differently with speech as compared to simulated actions, or co-speech gestures (e.g., simulating how to throw a dart while speaking), Breckingridge Church, Kelly, and Holcombe (Reference Breckingridge Church, Kelly and Holcombe2014) conducted an experiment in which they compared the level of synchronization between speech and these two different types of co-speech movements. Their findings show that both gestures and actions tend to precede the onset of concurrent speech, and both conditions show strong temporal overlap of speaking and moving, but there is less temporal distance between movement onsets and speech onsets when speakers are gesturing.
So far, little is known about the details in which these two types of co-speech movements (actions versus gestures) differ with respect to their temporal coordination with (partly) co-occurring speech.
1.2. Cross-modal coordination and synchronization of speech and gesture
When it comes to the exact way that speech and gesture are temporally coordinated, there is by now an extensive range of empirical studies pointing out that prosodic structure provides anchor points for the temporal cross-modal synchronization in production and perception: In languages that use pitch accents as means of indicating prominence, a strong temporal synchronization of pitch accents and gesture strokes has been reported (Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Leonard & Cummins, Reference Leonard and Cummins2010; Loehr, Reference Loehr2012; Treffner, Peter, & Kleidon, Reference Treffner, Peter and Kleidon2008; Yassinik, Renwick, & Shattuck-Hufnagel, Reference Yassinik, Renwick and Shattuck-Hufnagel2004). Co-speech gestures also reflect and are tied to prosodic phrasing (Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Rohrer, Prieto, & Delais-Roussarie, Reference Rohrer, Prieto and Delais-Roussarie2019; Türk & Calhoun, Reference Türk and Calhoun2023). It should be noted, though, that cross-modal coordination does not always result in cross-modal synchrony. Rather, the various modalities are often coordinated sequentially, with gestures typically preceding the corresponding event in the speech stream. Such sequential effects have been particularly described between stroke onsets and the onsets of corresponding words or ‘lexical affiliates’ (Bergmann, Aksu, & Kopp, Reference Bergmann, Aksu and Kopp2011; Chu & Hagoort, Reference Chu and Hagoort2014; Esteve-Gibert, Borràs-Comes, Asor, Swerts, & Prieto, Reference Esteve-Gibert, Borràs-Comes, Asor, Swerts and Prieto2017).
The strong gesture–prosody link is not exclusive for manual but also exists for head gestures (Esteve-Gibert et al., Reference Esteve-Gibert, Borràs-Comes, Asor, Swerts and Prieto2017). Furthermore, duration modulations in the prosodic domain (pitch accentuation and boundary marking) are paralleled in co-speech gesturing (Krivokapić, Tiede, & Tyrone, Reference Krivokapić, Tiede and Tyrone2017), temporal disruption in one modality triggers prolongations in the other modality (Chu & Hagoort, Reference Chu and Hagoort2014; Pouw & Dixon, Reference Pouw and Dixon2019), and cognitive load triggers hesitations both in speech and in gesture (Betz, Bryhadyr, Türk, & Wagner, Reference Betz, Bryhadyr, Türk and Wagner2023). Also, a strong link between prosody and movement has been confirmed in perception tasks, where listeners have been shown to express prosodic structure in tapping or drumming behaviors (Parrell, Goldstein, Lee, & Byrd, Reference Parrell, Goldstein, Lee and Byrd2014; Rathcke, Lin, Falk, & Dalla Bella, Reference Rathcke, Lin, Falk and Dalla Bella2021; Wagner, Ćwiek, & Samlowski, Reference Wagner, Ćwiek and Samlowski2019), albeit in highly individual manners (Bruggeman, Schade, Włodarczak, & Wagner, Reference Bruggeman, Schade, Włodarczak and Wagner2022).
1.3. The stability and potential function of cross-modal synchronization
Taken together, these findings are in line with a view in which prosodic structure provides the temporal scaffold for cross-modal coordination described above (Pouw & Dixon, Reference Pouw and Dixon2019). It has been argued in Wagner, Malisz, and Kopp (Reference Wagner, Malisz and Kopp2014)) that prosody serves as a particularly useful cross-modal anchor, since both prosody and gestures are loosely coupled with the corresponding verbal stream and serve similar functions such as highlighting and phrasing. One of the most well-described functions of synchronized gesture and speech is an increase in prosodic prominence on the lexical affiliate or corresponding pitch accent (AlMoubayed, Beskow, & Granström, Reference AlMoubayed, Beskow and Granström2009; Ambrazaitis & House, Reference Ambrazaitis and House2017; Krahmer & Swerts, Reference Krahmer and Swerts2007) as well as a generally improved intelligibility of the conveyed message (AlMoubayed et al., Reference AlMoubayed, Beskow and Granström2009), even though this picture appears to be more complex in less controlled data sets (Berger & Zellers, Reference Berger and Zellers2022).
Since gestures appear to be synchronized stronger with speech than co-speech actions (cf. Section 1.1), the question remains as to whether cross-modal synchronization serves a communicative function in its own right, consciously chosen as a communicative strategy as part of speech planning (Kisa, Goldin-Meadow, & Casasanto, Reference Kisa, Goldin-Meadow and Casasanto2022), or rather is the quasi-automatic consequence of their co-expressivity, caused by similar production processes and a biomechanical linkage, as supported by Pouw et al. (Reference Pouw, de Jonge-Hoekstra, Harrison, Paxton and Dixon2021)) and Serré et al. (Reference Serré, Dohen, Fuchs, Gerber and Rochet-Capellan2022)).
There is indeed evidence that the production of co-speech movements is closely connected to the form and function of accompanying prosody beyond timing, as co-speech gestures (and maybe actions) are also linked to prosodic shape and the underlying information structure: Im and Baumann (Reference Im and Baumann2020) found a higher gesture frequency both on new, accessible or contrastive words and on highly prominent, rising pitch accents. Similarly, Kügler and Gregori (Reference Kügler and Gregori2023) and Wagner and Bryhadyr (Reference Wagner and Bryhadyr2017) report a tighter synchrony of manual co-speech gestures or actions with corresponding pitch accents, when these are information structurally relevant. For Turkish, Türk and Calhoun (Reference Türk and Calhoun2023) show a connection between information structural phrasing and gestural phrasing. For Dutch, repetitive referential gestures lead to their ‘reduction’ in gesture space (Hoetjes et al., Reference Hoetjes, Koolen, Goudbeck, Krahmer and Swerts2015), mirroring findings about the less prominent prosodic expression of ‘given’ or predictable referents in various Germanic languages (Baumann, Röhr, & Grice, Reference Baumann, Röhr and Grice2015; Baumann & Winter, Reference Baumann and Winter2018; Watson, Arnold, & Tanenhaus, Reference Watson, Arnold and Tanenhaus2008, inter alia). However, high prosodic prominence (e.g., triggered by information structure) does not necessarily lead to a tighter cross-modal synchronization across modalities: In Australian English, eyebrow movements occur earlier when followed by a word in narrow focus, as compared to a word in wide focus (Kim, Cvejic, & Davis, Reference Kim, Cvejic and Davis2014). This implies certain degrees of freedom in the way that the modalities are temporally coordinated in speech production.
There are further findings that are compatible with a perspective on temporal cross-modal synchronization that is selectively fine-tuned according to communicative needs. In fact, the abovementioned strong cross-modal synchronization may be particularly reserved for ‘beat gestures,’ which do not carry referential meaning (Shattuck-Hufnagel & Ren, Reference Shattuck-Hufnagel and Ren2018). Rather, their main function may be a highlighting that resembles the corresponding prosodic prominence. This point of view has received support from studies showing that representational gestures and interactive gestures are used more frequently when interlocutors can see each other, while beat gestures are used similarly often (Alibali, Heath, & Myers, Reference Alibali, Heath and Myers2001; Bavelas, Gerwing, Sutton, & Prevost, Reference Bavelas, Gerwing, Sutton and Prevost2008), pointing to their strong cross-modal linkage in speech planning, irrespective of communicative benefit. Wagner and Bryhadyr (Reference Wagner and Bryhadyr2017) report that co-speech actions tend to be produced overall less synchronously with corresponding verbal actions as well as earlier, if interlocutors cannot see each other, but the authors concede that their design may have been confounded due to different recording setups across conditions.
In fact, many studies that have investigated cross-modal synchronization have focused on beat or deictic gestures, which tend to share a rather simple shape mostly reduced to a gesture ‘stroke’ that can be superimposed on more complex representational gestures such as iconics or metaphorical gestures. Representational gestures appear to be more loosely synchronized with the verbal channel: Unlike beat gestures (Leonard & Cummins, Reference Leonard and Cummins2010; Özyürek, Willems, Kita, & Hagoort, Reference Özyürek, Willems, Kita and Hagoort2007), iconics can be perceptually integrated by listeners even if they occur much later than their lexical affiliate (Kirchhof, Reference Kirchhof2017). This may even have consequences for the type of co-speech gesture selected by speakers: For French, which expresses information structure either syntactically or prosodically, Ferré (Reference Ferré2014) found that syntactically marked information structure more often co-occurs with metaphorical gestures, while prosodically expressed information structure tends to be aligned with other types of co-speech gestures. In German, iconic gestures often accompany elements that are not focused (Kügler & Gregori, Reference Kügler and Gregori2023), while Turkish speakers prefer to pair iconic and metaphoric gestures with foci and deictic gestures with topics (Türk & Calhoun, Reference Türk and Calhoun2023).
Taken together, we have plenty of empirical evidence for prosody to provide a major anchor for cross-modal synchronization, which is tied to prosodic highlighting. This prosodic highlighting often serves the expression of information structure. What is less clear is how automatic the process of cross-modal prosodic anchoring is, whether it can be activated selectively whenever prosodic highlighting – or a ‘beat function’ – serves communication or whether the synchronization persists irrespective of its visibility and hence its potential communicative benefit. Also unclear is whether the degree of synchronization found between speech and co-speech gestures also applies to co-speech actions, possibly as a function of their potential communicative benefit. These two questions will be investigated in the subsequent empirical study.
1.4. The present study
In our analysis, we strive to examine the assumption that prosody is a cross-modal anchor that serves as a temporal attractor to – practically any – co-speech actions, thereby turning them into potential co-speech gestures. Depending on communicative context, this temporal coordination may be selectively activated as a communicative resource, for example, (1) when listeners can see the co-speech action (Kisa et al., Reference Kisa, Goldin-Meadow and Casasanto2022) or (2) when the information structure of a message licenses an increased prosodic prominence.
Obviously, mutual visibility of co-speech movements is a prerequisite to their becoming potentially relevant in communication (Kisa et al., Reference Kisa, Goldin-Meadow and Casasanto2022). First, we therefore investigate whether the mutual visibility of hand movements increases the cross-modal synchronization of speech and co-speech actions. Such a result would strengthen a position in which any co-speech action that co-expresses an ongoing verbalization serves as a communicative affordance, which can selectively be produced in such a way that allows for its interpretation as a co-speech gesture. A useful strategy for this would be a stronger synchronization of the co-speech action with the verbalization it expresses.
Second, we investigate whether information structure-related prosodic modifications also trigger cross-modal synchronization of speech and co-speech actions. In line with previous research on gesture–prosody coordination, we expect that information structural conditions that cause an increase in prosodic prominence will also lead to a stronger cross-modal synchronization of prosody and co-speech actions, thereby turning them into potential co-speech gestures. This result would strengthen a view of tightly coupled cross-modal processing, where a prosodic anchor such as an increased prosodic prominence attracts movements in another modality, even if these were not originally intended as communicative gestures.
Our main hypotheses are that speakers perform co-speech actions in tighter synchrony with their corresponding verbalizations if
-
1. interlocutors can see each other’s relevant co-speech actions performed with their hands (H1) or
-
2. a verbal message licenses prosodic highlighting (prominence) due to its information structure (H2).
2. Methods
We test our hypotheses using an analysis of a quasi-spontaneous game-based multimodal interaction data, during which participants play a series of TicTacToe games on a shared vertical game board and verbalize their interactions while playing, which can be described as a multimodal version of the design in Watson et al. (Reference Watson, Arnold and Tanenhaus2008). TicTacToe is a competitive game, in which two players mark their respective moves on a shared 3 × 3 grid with their individual shape or color, with the aim to have three identical vertical, horizontal or diagonal marks and to prevent the opponent from succeeding in the same aim.
Below, we describe our data collection, annotation and feature extraction as well as the operationalization of our hypotheses. In particular, we will describe our operationalization of visibility of hands (cf. Section 2.2 and H1), information structure (cf. Section 2.3 and H2) and ‘tighter synchrony of co-speech actions and corresponding verbalizations’ (cf. Section 2.5, H1 and H2). Since we measure cross-modal synchrony based on prosodic anchors, it is necessary to understand the prosodic realizations of our data: H2 relies on the assumption that our manipulation of information structure has an effect on prosody. It is furthermore possible that mutual visibility of the hands influences prosodic realizations independently, thus being a potential confound in our synchronization analysis. For these reasons, we also perform an acoustic–phonetic analysis of the prosodic realizations prior to the synchronization analysis (cf. Section 2.7).
2.1. Recording setup
We recorded 40 participants (native speakers of the Northern German Standard Variety) forming 20 dyadic pairs. Interlocutors within dyads were of roughly equal social status, typically friends of similar age. Lacking any clear-cut hypotheses, no data on participant gender, ethnicity, age, height or personality were collected. Recordings of one speaker were excluded from further analyses due to technical problems, which resulted in a poor recording quality.
The audio recordings were carried out at the faculty’s recording studio using Sennheiser neckband microphones, and each participant was recorded in a separate audio channel. Due to the task (see below), little cross-talk occurred. Additionally, videos were recorded with five different camera angles: One camera recorded both participants from the side, two cameras captured both sides of the game board, to have an online assessment of the ongoing game moves, and two cameras captured the faces of the participants as close-ups in a
$ 3/4 $
perspective between front and side (cf. Figure 1).

Figure 1. The recording setup in the full visibility condition.
Each player received a set of cutouts in the form of blue or red felt squares to mark their moves on a shared vertical grid (cf. Figure 1). The game board looked like a regular
$ 3\times 3 $
TicTacToe grid, however with every cell being visibly numbered. This numbering was introduced in order to enable the interlocutors to unambiguously refer to the different cells on the game board using the numbers
$ 1-9 $
. The participants were asked to use the felt squares to mark their individual game moves on the shared game board and to also verbalize their moves. This design ensures that each manual game move can be related to a matching verbalized game move. At the same time, it provides the other player about the current game move, whether they see it or not, and can be played in different conditions of mutual visibility (cf. Section 2.2). In the conditions in which the players could not see, but only hear the opponent’s game move, they also had to use felt squares to mark these on their own grid. It should also be noted that the design may reduce the frequency of producing co-speech gestures, as participants are highly involved in performing their manual game moves. Prosodically, a typical verbalized move is produced by placing a nuclear pitch accent (sentence accent) on the target of the move, which corresponds to one of the numbers available on the game board and is realized sentence-finally in the vast majority of cases (94%), for example,
Mein nächster Zug geht auf FÜNF.
‘My next move goes on FIVE’.
The verbalizations of move targets on the game board (‘1–9’) were later analyzed with respect to their prosodic realizations. However, speakers were not instructed to use a particular sentence structure or use specific words to refer to their targets. Thus, some speakers would only use single-word utterances, or many variations of the message above, for example, und jetzt FÜNF (‘and now FIVE’), or similar. To introduce variation across games, the initial move was preset (quasi-randomly) by the experimenter and loudly communicated to both participants prior to each game. The order of game openings rotated between participants.
The participants were not given any instructions (how) to coordinate their game move verbalizations with their corresponding manually performed game moves. They were simply asked to play the game in an explicitly bimodal fashion.
Due to the spontaneous condition, some dyads did not finish their interactions. In a few cases, participants forgot to verbalize their moves. This led to some data loss. Also, if a game did not end in a tie (as the majority did), participants obviously produced fewer than the maximal number of possible game moves. The full data set comprises 2736 verbalized moves in total.
2.2. Visibility conditions
In each dyad, participants performed four games of TicTacToe in each of four different visibility conditions. The order of these conditions shifted systematically between dyads, to control for order effects. The four different visibility conditions are specified below (cf. Figure 2):
-
1. Manual and facial visibility: transparent game board, full view of interlocutor’s hands and face.
-
2. Manual visibility, no facial visibility: transparent game board, but obstructed view of interlocutor’s head (with the help of a curtain).
-
3. Facial visibility, no manual visibility: nontransparent game board, but full view of interlocutor’s head and face.
-
4. Neither manual nor facial visibility: nontransparent game board, obstructed view of interlocutor’s head (with the help of a curtain).

Figure 2. Manipulation of the four different within-subject visibility conditions. 1. Full visibility (upper left): transparent game board between interlocutors; 2. manual visibility (upper right): transparent game board and curtain at head height; 3. facial visibility (lower left): nontransparent game board between interlocutors; and 4. no visibility (lower right): nontransparent game board and curtain at head height.
With respect to H1, we expect that the visibility of the hands should make cross-modal synchronization of manually performed and verbalized game moves potentially informative. We therefore expect a tighter cross-modal synchronization in conditions 1 (full visibility) and 2 (manual visibility) and contrast the factor manual visibility (conditions 1 and 2) with no manual visibility (conditions 3 and 4) in our subsequent analyses. Please note that although we controlled for facial visibility and manual visibility independently in our recording, we refrain from studying the independent impact of facial visibility. This is because our main interest lies in the impact of the visibility of hands on cross-modal synchronization, as the hands perform the co-speech actions, which also are co-expressive with the verbalizations of game moves. When seen, these automatically become communicative relevance. While one could argue that lip movements also reveal the content of the acoustic speech signal, these cannot be desynchronized from the verbal channel and are less interesting from the perspective of cross-modal synchronization of co-expressive modalities. We do not consider the concentration on manual visibility a major drawback, as facial visibility and manual visibility are not confounded in our design: Both conditions of manual visibility contain data points with and without facial visibility between the interlocutors.
2.3. Information structure conditions
Largely following Watson et al. (Reference Watson, Arnold and Tanenhaus2008), we used the TicTacToe game setting to disentangle two aspects of information structure, namely contextual unpredictability and importance. As mentioned above, the initial game moves were predefined by the experimenter and openly announced as the mandatory starting field to both participants. As these initial moves were simply repeated by the participant’s verbalizations, they were defined as fully predictable/least unpredictable (‘8’). The second move was annotated as most unpredictable (‘1’), as the participants still have many options to choose from on the game board. The following moves were annotated with decreasing unpredictability (or increasing predictability) (‘2–8’) in the course of the game, as the options on the game board get fewer. For subsequent statistical analysis, unpredictability was recoded in a binary fashion, with unpredictability
$ >4 $
as ‘predictable’ and
$ <4 $
as ‘unpredictable’. Importance was operationalized in a binary fashion, with moves that prevent or constitute a winning move being annotated as ‘important’ and others as ‘unimportant’ (for the outcome of the game). As game decisive, relevant moves tend to come later in the game and tend to be predictable, and the presence of either feature may predict (to some degree) the absence of the other. If this were the case, we could have merged both unpredictability and importance into a single variable (as done in Watson et al., Reference Watson, Arnold and Tanenhaus2008). However, an analysis of nominal association revealed a high degree of independence between the factors (Theil’s
$ U=0.018 $
, cf. Table 1), which were accordingly treated as separate. This result is less surprising as it may seem, since actually early on in the game, when still many cells are available, game moves can already be decisive, and somewhat later in the game, non-decisive (but relatively predictable) moves may still occur. First moves in games and last moves of games ending in ties (the vast majority of last moves) are coded as both predictable and unimportant. For an illustration of unpredictability and importance within the game, cf. Figure 3.
Table 1. Contingency table of game moves classified as (un)predictable or (un)important in the full data set. Numbers in brackets indicate the reduced number of game moves that are used in the analyses of cross-modal synchronization (cf. 2.5)


Figure 3. Manipulation of importance and unpredictability in the recordings. The game move on the left (field number 5) is unpredictable but unimportant for the outcome of the game. The move on the right (field number 7) is more predictable, but game decisive, hence important.
The motivation for differentiating importance and unpredictability is also motivated by findings in American English. Here, importance and unpredictability have been found to be expressed differently in the acoustic–prosodic domain (Watson et al., Reference Watson, Arnold and Tanenhaus2008), with unpredictability having effects on pitch and importance showing a marked increase in intensity. As pitch accent peaks have been found to be strong cross-modal anchors, such a functional differentiation in the acoustic–prosodic domain may have an independent influence on cross-modal synchronization.
With respect to our hypotheses, we regard both unpredictability and importance as factors of information-structural relevance. In line with our H2, we expect both conditions to increase prosodic prominence and hence strengthen the cross-modal synchronization between manually and verbally performed game moves.
2.4. Annotations
In each dyad, the verbalizations of the game move targets (numbers
$ 1-9 $
) as well as the corresponding move’s importance and unpredictability were annotated manually using Praat (Boersma & Weenink, Reference Boersma and Weenink2025). In this study, we only annotated manual game moves of the active player, not those moves where a player tracks their opponent’s game moves if mutual visibility is lacking (cf. Section 2.4). In the vast majority of cases (94%), these targets corresponded to the final word of an utterance, coinciding with a (nuclear) accent. We did not perform manual annotations of precise pitch accent type or shape. Rather, we performed a sanity check on our prosodic realizations to rule out systematic deaccentuations and analyzed the f0 trajectories that point to qualitative differences in intonation contours with a functional relevance (cf. Section 2.5). It should be noted that the pitch peaks thus identified are not necessarily identical to the prominent tonal targets of pitch accents within the framework of autosegmental-metrical phonology. That is, a falling pitch accent with a low tonal mid-syllable target is likely to have its pitch peak measured at the beginning of the syllable, where the highest point of the falling contour is realized. This approach makes our results comparable to much existing research on cross-modal synchronization (cf. Section 1) but currently does not take into account whether different pitch accent types show preferences in their cross-modal alignment.
In order to determine the cross-modal synchronization between hand movements and prosody, we annotated the timing of the manually performed game moves. Each game move started with the participants initiating the move toward the game board, and ended at the point in time, at which the felt square came into contact with the vertical game board. This also coincides with the maximum amplitude of the manual action. These movements therefore closely resemble the ‘stroke’ phase of a gesture. It should be noted that movement initializations were often difficult to determine, as players would often initiate a move and then rest or change their minds about their game move target during execution. Movement annotations were carried out manually using ELAN (Wittenburg, Brugman, Russel, Klassmann, & Sloetjes, Reference Wittenburg, Brugman, Russel, Klassmann and Sloetjes2006).
2.5. Feature extraction and cross-modal synchronization measurement
Using a Praat script (Boersma & Weenink, Reference Boersma and Weenink2025), we automatically extracted a series of measurements on the annotated verbalizations of target moves (i.e., the participants’ verbalizations of the numbers ‘1–9’) and manual performances of the target moves, including
-
• Duration of verbalized target move (ms).
-
• Average f0 of verbalized target move (st rel to 1 Hz).
-
• Pitch range of verbalized target move (st).
-
• Root-mean-square (RMS) intensity of verbalized target move (dB).
-
• Time and amplitude (st rel to 1 Hz) of maximal pitch excursion in verbalized target move (= pitch accent peak).
-
• Onset time of verbalized game move target (= word initializations ‘1–9’).
-
• Time of manual contact on the game board (= point of manual game move execution).
We also calculated the distances between the manually annotated points of manual game board contact (cf. Section 2.4) and (1) pitch accent peaks of corresponding target item verbalizations as numbers ‘1–9’ (henceforth: distPEAK) and (2) the starting points of these number verbalizations (henceforth: distWORDINIT). Pitch accent peaks and lexical word initializations were selected as anchor points, in line with previous work on cross-modal coordination (cf. Section 1). If a manual contact precedes the anchor (pitch peak or word initialization), this results in a negative distance; if a manual contact follows the anchor, this results in a positive distance (cf. Figure 4). Distances are measured in milliseconds. We chose the point of manual contact made with the game board as a potential anchor in the domain of co-speech actions, as it corresponds most closely in form to the gesture apex in (beat) gestures, thereby serving as the most promising candidate anchor for cross-modal coordination in the manual domain. We refrained from using movement initializations as potential anchors, as these had often been vague and difficult to determine during manual annotation. Since the distance analyses are restricted to the identification of a pitch accent peak, and our data show considerable amounts of creaky or (partially) unvoiced productions, the data set is reduced to
$ n=1620 $
game moves. Our two distance measures between potential anchors in the speech domain and corresponding co-speech actions are illustrated in Figure 4.

Figure 4. Illustration of distance measurements between pitch accent peaks (distWORDINIT), word initializations (distPEAK) and the time of manual contact with the game board when performing a game move. Both prosodic anchors (word initializations = dashed line; PA peak = dotted line) create different ‘0’ points on the time axis. Moves preceding those anchor points result in negative values, and moves following those anchor points result in positive values. In this illustration, a manual game move is executed following the word initialization (left), resulting in a positive value for the distWORDINIT measure, and precedes the PA peak (right), resulting in a negative value for the distPEAK measure.
Our analysis rests on the assumption that participants are to some degree flexible with respect to their placement of co-speech actions relative to the verbalized game move. If this were fixed, the prosodic shape of a pitch accent (e.g., with an early or late pitch peak) or the duration of the verbalized game move would determine the distance measures. We therefore checked the global distribution of co-speech actions relative to the corresponding verbalized target item on the game board and found a wide, neither normal nor uniform, distribution of co-speech actions (Kolmogorov–Smirnov,
$ D=0.64 $
,
$ p<0.001 $
), which is slightly skewed toward the initial half of the verbalized game move.
The acoustic features and measurements of cross-modal distance thus obtained served as dependent variables in the subsequent statistical analyses.
2.6. Statistical analyses
Throughout, we performed our statistical analyses using R (R Core Team, 2022) version 4.1.3, together with the (main) packages lme4 (Bates, Mächler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2015), emmeans, (Lenth, Reference Lenth2022) and buildmer (Voeten, Reference Voeten2022). Graphical plots were generated using ggplot2 (Wickham, Reference Wickham2016) and ggeffects (Lüdecke, Reference Lüdecke2018). Relevant data and analysis scripts can be accessed here: https://osf.io/bty8q/overview.
The following procedure was used: For each dependent variable, we determined a model that has been optimized using the R-package buildmer (Voeten, Reference Voeten2022): For each dependent variable, we built a linear mixed effects model by first determining a model that converged but used the maximal set of predictors, interactions and random effects with slopes and intercepts. In a second step, this model was then reduced in a stepwise fashion until the least complex model was reached that did not compromise model accuracy (as determined by a significant change in model log-likelihood). As fixed factors, we used (1) manual visibility, (2) importance and (3) unpredictability. As random factors, we used (1) participant and (2) item (i.e., numbers ‘1–9’). The models thus optimized were subsequently analyzed and reported below. In cases of significant interactions, the main effects were further compared post hoc for the different factor levels involved in the interactions using emmeans. If the random factors did not remain in the optimal models, we calculated and reported linear models.
In one case of nonnormally distributed residuals and a bimodally distributed interval-scaled dependent variable, we recoded it into a categorical variable with two levels. This was achieved by calculating a local minimum within the bimodal distribution and using this to split the data. We then reanalyzed the data with the help of a generalized linear mixed-effects model, otherwise using the same procedure as described above.
The next sections describe the different analyses thus performed.
2.7. Influence of information structure and visibility on prosodic shape: a sanity check
In order to ensure that the task design indeed modified the information structure as intended, we checked for well-known effects of information structure on established acoustic–phonetic correlates of prosodic prominence on the target verbalizations (duration [ms], average f0 [st rel to 1 Hz], pitch range [st], RMS intensity [dB]), which were entered as the dependent variables in our models. For all dependent variables, a first set of derived models yielded nonnormally distributed residuals, as evidenced by a Q–Q plot. A manual inspection of our original data showed that these were likely to be caused by skewed data (positively skewed duration and positively and negatively skewed RMS intensity and mean pitch). We therefore applied log transformations to all dependent acoustic–prosodic variables prior to subsequent analyses, resulting in normally distributed residuals in the models.
Given that both German and American English are Germanic intonation languages both of which use prosodic prominence to mark information structure, we particularly expected longer verbalizations and potentially slightly higher pitch accent peaks in unpredictable game moves, and louder verbalizations of important target moves, in line with previous work on American English (Watson et al., Reference Watson, Arnold and Tanenhaus2008; Watson, Arnold, & Tanenhaus, Reference Watson, Arnold and Tanenhaus2010). It is well known that the shape of the intonation contour is used to differentiate different levels of information structure and prominence in German (Baumann et al., Reference Baumann, Röhr and Grice2015; Baumann & Röhr, Reference Baumann and Röhr2015; Baumann & Winter, Reference Baumann and Winter2018). Since we did not analyze these shapes, we do not expect strong results for pitch features.
We also checked whether mutual visibility has any effect on prosodic expression. If this were the case, any subsequent findings on visibility influencing cross-modal synchronization may be to some extent explicable through its influence on prosodic prominence. Here, we expected a lack of mutual visibility of hands to potentially enhance prosodic prominence, for example, lead to louder, longer, higher pitch and larger pitch excursions, to compensate for the lack of visibly accessible information.
2.8. Cross-modal synchronization as a function of information structure and visibility
In order to test whether information structure or mutual visibility had an influence on cross-modal synchronization, we built linear mixed effects models for the two different levels of cross-modal distance described in Figure 4 as dependent variables.
In line with our main hypotheses, we expect less cross-modal distance in the case of a mutual visibility of the hands (H1) or the information structural relevance (unpredictable, important) of the game move (H2) for both of our distance measurements. A stronger cross-modal synchronization should therefore lead to values closer to
$ 0 $
distance.
First models revealed that the residuals for distPEAK were not normally distributed. Also, density plots showed a strong bimodal distribution of distPEAK (cf. Figure 5). Using the R-package ggpmisc (Aphalo, Reference Aphalo2022), we hence determined the local minimum between the two visual maxima in the distribution at
$ -119\;\mathrm{ms} $
. Based on this, we split the data into a binary variable distPEAK(Bin), representing either pitch peak preceding movements (with the manual actions occurring around a local maximum of
$ -242 $
relative to the corresponding pitch peak) or pitch peak synchronized movements (with manual actions occurring around a local maximum of
$ -28\;\mathrm{ms} $
relative to the corresponding pitch peak). The resulting categorical variable with two levels (preceding, synchronized) was entered as a dependent variable into a generalized linear mixed model.

Figure 5. Density plot of distances between pitch accent peaks (‘0’) and the timing of corresponding manual game board contacts. The data show a strong bimodal distribution, with local maxima occurring 241 and 28 ms before the corresponding pitch peak position (dashed lines). The data are split into pitch peak preceding and pitch peak synchronized co-speech movements using the local minimum (solid line) between both maxima (preceding the corresponding pitch peak by 119 ms).
3. Results
3.1. Results on the influence of visibility and information structure on prosodic realizations
Notice that all models show log-normalized effects. For the prediction of the duration of the verbalized target moves, the following optimal model emerged:
$$ {\displaystyle \begin{array}{l}\mathrm{Duration}\sim 1+\mathrm{Unpredictability}+\mathrm{ManualVis}+\mathrm{Importance}+\left(1|\mathrm{Participant}\right)\\ {}\hskip1em +\hskip2px \left(1|\mathrm{Item}\right)\end{array}} $$
The resulting model shows a main lengthening effect of unpredictability (
$ \beta =0.06, se=0.01,t=5.4,p<0.001 $
) and importance (
$ \beta =0.03, se=0.01,t=2.45,p=0.014 $
) as well as a main shortening effect of manual visibility on word duration (
$ \beta =-0.03, se=0.01,t=-2.34,p=0.02 $
).
For the prediction of the mean pitch of the verbalized target moves, the optimal model did not yield any relevant random effects. The following optimal model emerged:
This model yielded a significant effect of importance, with unimportant words showing a lower average pitch (
$ \beta =-0.006, se=0.003,t=-2.03,p=0.04 $
).
For the prediction of the pitch range of the verbalized target moves, the following optimal model emerged:
$$ {\displaystyle \begin{array}{l}\mathrm{PitchRange}\sim 1+\mathrm{Unpredictability}+\mathrm{Importance}+\mathrm{ManualVis}\\ {}\hskip1em +\hskip2px \left(1|\mathrm{Participant}\right)+\left(1|\mathrm{Item}\right)\end{array}} $$
The resulting model showed an increase in pitch range in unpredictable (
$ \beta =0.16, se=0.04,t=4.65,p<0.001 $
) or important (
$ \beta =0.13, se=0.04,t=3.86,p<0.001 $
) words and a decrease in pitch range if interlocutors could see each other’s hands (
$ \beta =-0.07, se=0.03,t=-1.97,p=0.05 $
).
For the prediction of RMS intensity, the following optimal model emerged:
$$ {\displaystyle \begin{array}{l}\mathrm{RMS}\sim 1+\mathrm{ManualVis}+\mathrm{Importance}+\mathrm{Unpredictability}+\Big(1+\mathrm{ManualVis}\\ {}\hskip1em +\hskip2px \mathrm{Unpredictability}+\mathrm{Importance}|\mathrm{Participant}\Big)\end{array}} $$
The random effects point to a complex interaction of predictors and individual participants’ behaviors. The model summary shows that if speakers can see each other’s hands, they overall tend to speak less loudly (
$ \beta =-0.07, se=0.02,t=-3.66,p<0.001 $
). A manual inspection of our data showed that this trend is visible for 25 out of 39 participants.
Summing up, our analysis of the influence of information structure and manual visibility revealed the following:
-
• Unpredictable verbalized game moves are comparatively longer and have a larger pitch excursion.
-
• Important verbalized game moves are comparatively longer and show a higher average pitch and a higher pitch excursion.
-
• If speakers can see each other’s game moves, verbalized game moves are shorter, show less pitch excursion and are produced less loudly by the majority of speakers.
These results clearly show an effect of our information structure manipulation on the participants’ prosodic realizations in the expected directions, with important or unpredictable words being produced in a way that is likely to increase their perceptual prominence. The results differ from previous results on American English (Watson et al., Reference Watson, Arnold and Tanenhaus2008, Reference Watson, Arnold and Tanenhaus2010), which revealed an acoustic–phonetic cue specialization for different dimensions of information structure (unpredictable accents are longer and important ones are louder). Since we did not carry out an in-depth analysis of the pitch contours, we refrain from further interpretations in connection with information structure-related aspects.
Our results also show that manual visibility influences prosodic realizations. If listeners can see each other’s game moves, the – also visibly given – verbal information is realized less prominently across three dimensions of acoustic–prosodic prominence expression: pitch excursion, duration and RMS intensity.
3.2. Results on the influence of visibility and information structure on cross-modal distances
Since our distPEAK measure depends on the detection of a pitch accent peak, distances could only be determined in those cases where a pitch contour (and a peak) was detectable. This reduced the analyzed data in comparison with the prosodic analyses in Section 2.7. Generally, manual game moves clearly precede the pitch accent almost twice as often (
$ n=1035 $
) as being synchronized with it (
$ n=576 $
). However, there is an overall tendency for manual game move actions to be realized in temporal proximity, within a temporal window of
$ 500\;\mathrm{ms} $
before their corresponding pitch peak (cf. Figure 5).
For the prediction of the distance between pitch accent peaks and corresponding manual actions (distPEAK), the optimal resulting model is
$$ {\displaystyle \begin{array}{l}\mathrm{distPEAK}\left(\mathrm{Bin}\right)\sim 1+\mathrm{Importance}+\mathrm{Unpredictability}+\\ {}\mathrm{Importance}:\mathrm{Unpredictability}+\left(1\;|\;\mathrm{Item}\right)+\left(1\;|\;\mathrm{Participant}\right)\end{array}} $$
The fixed factors identify a main effect of unpredictability (
$ \beta =0.65, se=0.20,z=3.22,p<0.001 $
) and a significant interaction between unpredictability and importance (
$ \beta =-0.77, se=0.27,z=-2.89,p=0.004 $
). A post hoc pairwise comparison revealed that the main effect of unpredictability is exclusive to the context of important game moves. In those, unpredictable game moves have a higher probability to be realized late (
$ \beta =-0.72, se=0.19,z=-3.72,p<0.001 $
); that is, they have a high chance to be realized almost co-occurring with a corresponding pitch accent peak (cf., Figure 6). Manual visibility conditions did not have any effects.

Figure 6. Interaction plot showing the predicted probabilities (marginal means and 95% confidence intervals) of manual game moves to be closely aligned with their corresponding pitch accent peaks (i.e., synchronized distPEAK events). If a game move is important and unpredictable, this probability increases.
For the prediction of the distance between the beginnings of verbalizations and corresponding manual actions (distWORDINIT), the optimal resulting model is
$$ {\displaystyle \begin{array}{l}\mathrm{distWORDINIT}\sim 1+\mathrm{Unpredictability}+\mathrm{Importance}\\ {}\hskip2.12em +\mathrm{Unpredictability}:\mathrm{Importance}+\mathrm{ManualVisi}\\ {}\hskip2.12em +\left(1+\mathrm{ManualVisi}\;|\;\mathrm{Participant}\right)+\left(1\;|\;\mathrm{Item}\right)\end{array}} $$
The fixed effects structure points to unimportant manual game moves being realized on average 384 ms earlier than important ones (
$ \beta =-384, se=52.42,t=-7.36,p<0.001 $
). If hands are visible, manual game moves occur significantly earlier (
$ \beta =-122, se=50.4,t=2.43,p=0.021 $
). As the model also revealed an interaction between participants and manual visibility, this finding may be to some degree modulated by individual differences. Since the model also revealed a significant interaction between unpredictability and importance (
$ \beta =-403, se=75.9,t=5.32,p<0.001 $
), a post hoc pairwise comparison was carried out for predictable and unpredictable contexts separately. This indicated that the effect of importance is restricted to predictable contexts, where unimportant and predictable co-speech actions are realized considerably earlier (
$ \beta =-384, se=52.3,t=7.34,p<0.001 $
). In other words, game moves of little information structural relevance (predictable and unimportant) are realized very early, others later; game moves are realized slightly earlier when they can be seen. This plot also shows that the model generally predicts game moves to occur after the corresponding word has started being pronounced. All effects are illustrated in the interaction plot (cf. Figure 7).

Figure 7. Interaction plot showing the predicted timing (marginal means and 95% confidence intervals) of manual game board contact relative to the corresponding initializations of verbalized game moves (distWORDINIT) in milliseconds. Predictable and unimportant moves are realized considerably earlier than important or unpredictable game moves. Manual game moves are generally realized earlier when interlocutors can see each other’s hands (ManualVisi).
4. Discussion
In this study, we investigated whether speakers perform co-speech actions in tight synchrony with verbalizations that co-express the content of the performed action (as a game move). We were particularly interested in the question of how stable the link between verbal and manual actions is under different circumstances. Here, we tested whether the degree of cross-modal synchrony depends on different levels of prosodic highlighting (due to information structural needs) and whether it is stable across different conditions of mutual visibility.
We argued that in case that prosody and co-speech actions are indeed strongly linked in their execution, their temporal coordination should remain fairly stable, regardless of variations in mutual visibility. Besides, if changes in the level of prosodic prominence strengthen cross-modal synchronization, this would be evidence that prosodic prominence provides a flexible cross-modal anchor for co-speech movements, which can be strengthened or loosened depending on underlying linguistic functions such as information structure.
Our results indeed produce evidence for a strong cross-modal synchronization of prosody and co-speech actions: Across conditions, manual actions are performed in close proximity to the pitch accent peaks of the co-expressive verbal content, with manual actions typically preceding prosodic events. At the same time, they showed a strong tendency to perform their manual action after they had started producing the corresponding word. Thus, even though our speakers were not explicitly instructed to coordinate their verbal and manual actions, they did – corroborating earlier findings by Breckingridge Church et al. (Reference Breckingridge Church, Kelly and Holcombe2014). The fact that our data show a particularly strong cross-modal synchronization may be caused by our study design, in which participants predominantly verbalized their game move targets with nuclear pitch accents, even when these were contextually given or unimportant. Also, the bimodal task did not allow participants the freedom to choose between a verbalization of a game move, a manual performance of a game move on the shared game board or both. Participants had the freedom to synchronize their bimodal actions or not. Therefore, while we see that prosody acts as a strong temporal anchor to co-speech actions, we cannot shed further light on the question of whether prosodic prominence actually elicits synchronized co-speech actions or other movements, which lend themselves as being interpreted as co-speech gestures.
Interestingly, for one of our distance measures (distPEAK), our data revealed a bimodal distribution within the cross-modal synchronization, with the majority of co-speech actions showing the typical ‘gesture lead’ effect (cf. Section 1). This temporal precedence of co-speech actions can be neutralized in favor of an almost perfect cross-modal synchronization of co-speech actions and pitch accent peaks, which preferably occurs in contexts where information structure licenses a particularly high prosodic prominence (unpredictable and important), which supports our H2. However, unlike hypothesized (H1), mutual visibility did not contribute to this cross-modal synchronization of pitch accent peaks and manual game moves. As the synchronization of pitch accent peaks and co-speech actions is largely independent of mutual visibility, it is not modified strategically in those situations only, where it may serve a communicative benefit. Still, we believe it is a safe assumption to say that many of our co-verbal actions are realized in a way that contributes to their being perceived as prosodically highlighted, similar to a ‘beat gesture,’ even if this highlighting may not be seen.
However, manual visibility did influence the timing of co-speech actions in relation to a different anchor point, namely the word onsets of co-expressive verbalizations (distWORDINIT). This result is more difficult to interpret in light of our hypotheses. Manual game moves were realized earlier, and closer to word onsets, if interlocutors could see each other’s hands. They were produced even earlier in cases where the information structural relevance was low (predictable and unimportant), and hence, the prosodic prominence of a verbalized game move was probably also low. An obvious interpretation of this finding would be to say that this effect is caused by a decrease in prosodic prominence on the verbalizations, as both manual visibility and a lack of informativeness led to a decrease in acoustic–prosodic prominence (cf. Section 2.7). However, if a lack of prosodic highlighting merely caused a temporal desynchronization of the two modalities, an expected result would be more overall variation in cross-modal coordination (Leonard & Cummins, Reference Leonard and Cummins2010). Rather than that, we find a preference for a particular order, with manual actions being realized earlier than otherwise. We therefore argue that mutual visibility strengthens another anchor in cross-modal coordination, namely word onsets. The early realization of manual game moves leads to an early disambiguation of a co-expressive verbalization and reduces the communicative burden on the verbal channel (by making it redundant), which makes sense if listeners can actually see the game move. This interpretation is in line with the findings by Bergmann et al. (Reference Bergmann, Aksu and Kopp2011) on co-speech gestures. They found a stronger synchronization of gestures and word onsets if both channels were semantically congruent – as is the case in our design. We also know from research on the interpretation of co-speech actions that similar synchronizations increase the probability of co-speech actions to be interpreted as representational co-speech gestures (Novack et al., Reference Novack, Wakefield and Goldin-Meadow2016). We argue that these earlier produced co-speech actions fulfill a predominantly communicative function, by conveying that a certain game move is realized, and have much in common with ‘representational gestures,’ as they probably increase communicative robustness (de Ruiter, Bangerter, & Dings, Reference de Ruiter, Bangerter and Dings2012).
However, it remains unclear why co-speech actions are realized substantially earlier if the verbal message is predictable and unimportant. Again, we would have expected to see that these contexts mostly lead to more overall variation within the cross-modal synchronization. At this point, we can only speculate that a weak prosodic prominence also weakens the ability of pitch accent peaks to serve as a cross-modal anchor, and word onsets serve as the alternative cross-modal anchor under such circumstances.
Alternatively, it is possible that individual pitch accent types also show a different distribution with respect to their cross-modal synchronization. We know that prosodic prominence is to a certain degree linked to different pitch accent shapes in German (Baumann & Röhr, Reference Baumann and Röhr2015; Baumann & Winter, Reference Baumann and Winter2018), and individual pitch accents show different degrees of shape-related variation (Schweitzer, Reference Schweitzer2012). It could be the case that a prominence-related lack of stability with respect to pitch peak placement may also weaken the cross-modal synchronization. A closer look at the distributions of pitch accent types and cross-modal synchronization is warranted, but as our study is limited in this regard, we refrain from any further speculation at this point.
We also suspect that the result may be influenced by our study design, where speakers started playing their initial, unimportant and predictable game moves right after they had been informed by the experimenter where to begin. Impressionistically, this game opening phase constituted a part of the dialogue that involved less interaction between the interlocutors but was rather an interaction between experimenter and initial player. Follow-up studies are therefore necessary to investigate, whether this really is a finding that generalizes to other communicative situations. Obviously, this is the case for all of our results, as our data constitute highly controlled, semi-spontaneous material, in which speakers were ‘forced’ into bimodal communicative actions in a very limited domain.
Summing up, we find that our study provides support for the assumption that pitch accent peaks serve as a strong anchor for co-speech actions – and not just co-speech gestures – and that a high degree of prosodic prominence increases the strength of this anchoring. That way, cross-modal synchronization serves prosodic highlighting and allows for the production of co-speech actions that fulfill similar communicative functions as ‘beat gestures.’ Interestingly, this cross-modal synchronization appears to be largely independent of whether or not it actually benefits communication (as under mutual visibility), which supports the view of a strong and stable, but not strategically modified link between prosody and co-speech actions.
However, we see that when co-speech actions communicate meaning, for example, by visibly performed game moves, we find a stronger synchronization of these actions and word onsets in the speech stream. It is likely that these co-speech actions are strategically realized early, to help an early message disambiguation and to contribute to the overall robustness of communication, fulfilling a function similar to ‘representational gestures.’
The exact interplay between various cross-modal anchors and co-speech actions in the speech stream needs further exploration, in less controlled settings, with a more varied prosody, and more freedom of speakers to decide whether they want to move while speaking or not.
Data availability statement
The anonymous data extracted from the audio and video recordings described above (acoustic–prosodic features, cross-modal distances, visibility conditions, information structure, item and participant IDs) are accessible via the following link, together with the analysis scripts: https://osf.io/bty8q/overview.
Acknowledgements
I wish to thank Nataliya Bryhadyr and Marin Schröer for their help with the data recordings and annotations used in the presented analyses and the Special Issue Editors as well as two anonymous reviewers for their invaluable feedback and comments on earlier versions of this paper.
Funding statement
This research has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – CRC-1646, project number 512393437 (A03), and TRR 318/1 2021–438445824 (A02).




