Zhuang (ISO 639-3, zha) is a group of languages belonging to the Tai language family (Diller, Edmondson & Luo Reference Diller, Edmondson and Luo2008: 7), spoken by the Zhuang people, who form the largest minority group in China with a population of approximately 17 million.Footnote 1 Most Zhuang speakers live in Guangxi Zhuang Autonomous Region, with over 14.4 million permanent residents.Footnote 2 It is estimated that more than 20 million people speak a variety of Zhuang (Wei, Qin & Wei Reference Wei, Qin, Wei and Minority Languages2009: 7), including some other ethnic minorities, such as the Yao and Maonan, who live in the same regions as the Zhuang. A small number of Zhuang speakers inhabit regions in provinces adjacent to Guangxi, like Wenshan (Yunnan province) and Lianshan, in the northwest of Guangdong province (see the map in Figure 1).
Zhuang can be divided into two branches which are divided ethnographically rather than strictly linguistically: Northern Zhuang and Southern Zhuang, with the Youjiang River (右江) as an approximate demarcation line.Footnote 3 That is, Northern Zhuang is located to the north of the river, which also means it is spoken to the north of Nanning, the provincial capital city, while Southern Zhuang dialects are located on the south side. The two branches of Zhuang belong to different branches of the Tai language family; see Figure 2,Footnote 4 which is based on Diller et al. (Reference Diller, Edmondson and Luo2008). More specifically, Northern Zhuang includes eight dialect areas: Hongshui He dialect (红水河土语, the dialect considered in the present Illustration, ISO 639-3: zch), Guibian dialect (桂边土语), Guibei dialect (桂北土语), Lianshan dialect (连山土语), Qiubei dialect (丘北土语), Youjiang dialect (右江土语), Yongbei dialect (邕北土语), Liujiang dialect (柳江土语). Southern Zhuang consists of five dialect regions: Zuojiang dialect (左江土语), Yongnan dialect (邕南土语), Wenma dialect (文马土语), Yanguang dialect (砚广土语), and Dejing dialect (德靖土语) (for the specific areas, see Figure 3). Northern Zhuang accounts for around two thirds of the entire Zhuang population (Wei & Qin Reference Wei and Qin2006). Standard Zhuang is based on the Shuangqiao (双桥) dialect (itself part of Yongbei dialect; see Figure 2), spoken in Wuming, which belongs to the Northern group. Yang (Reference Yang2017) presents a phonetic description of Standard Zhuang.

Figure 1. The distribution of Zhuang languages (Castro, Hansen & Guangxi Zhuang Autonomous Region Minority Language Commission 2010).

Figure 2. Language genealogy of Zhuang.

Figure 3. The distribution of Zhuang languages (CASS Institute of Linguistics, CASS Institute of Ethnology and Anthropology & City University Language Information Science Research Center 2012).
Modern Zhuang has a writing system based on the Roman alphabet which was initially devised by Chinese linguists after the founding of the People’s Republic of China. The complete Zhuang Language System was officially approved by the State Council in 1957, which marked the first time that Zhuang had a uniform writing system (for ancient Zhuang characters, see Bauer (Reference Bauer2000)). After small revisions in 1982, the Zhuang Language System was widely promoted in Guangxi Zhuang Autonomous Region, and Zhuang dictionaries (e.g. Luo et al. Reference Luo, Qin, Lu and Chen2005) also use this writing system (Zhang et al. Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999: 429–430). Literacy rates in written Zhuang are low, however.
Previous studies on the Zhuang dialects in Du’an County include a detailed morphological description and systematic illustrations of the phonetic characteristics of the sounds, tones, and rhythm of the Dongmiao (东庙) dialect (Li Reference Li2011). Earlier work on the implosives of this dialect was reported on by Zao (Reference Zao1997).
The language consultant for the present illustration is a 26-year-old female native speaker of Hongshui He Zhuang (Glottolog: east2363), Baima dialectFootnote 5 ), who is an elementary school teacher in Shenzhen, China. Zhuang is her first language. She was born and raised in Baima Town (part of Hechi City, Guangxi Zhuang Autonomous Region), using Zhuang to communicate with her family members and friends in their hometown on a daily basis. She has a positive attitude towards Zhuang and wrote her MA thesis (Gan Reference Gan2021) about ideophones in Zhuang. When she speaks to her family and when she meets classmates from her hometown, she uses either Zhuang, Southwestern Mandarin or Standard Chinese (Putonghua).
The recordings were made in a sound-attenuated booth at Shenzhen University, using a Zoom H4n portable recorder with an AKG C520L head-mounted microphone at the sampling rate of 44.1 kHz. All recordings were manually segmented and transcribed. Acoustic measurements were obtained through Praat (Boersma & Weenink Reference Boersma and Weenink2024), while R (R Core Development Team 2023 [v4.2.3]) was used for statistical analysis and data visualization. The phonological representations for all words in this Illustration are based on the description of Du’an in Zhang et al. (Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999), with adjustments to reflect the phonology of Baima.
Consonants


As the consonant chart shows, the Baima dialect has 17 single consonants.
Plosives
Plosives in the Baima dialect occur at three places of articulation: bilabial, alveolar and velar. There is a two-way voicing contrast between bilabial and alveolar plosives, viz. /p/ vs. /b/ and /t/ vs. /d/. Figure 4 displays a boxplot for the voice onset time for each of the five plosives. The black lines indicate that the medians for all five plosives fall in the range between 0 to 30 milliseconds, characteristic of unaspirated voiceless plosives. However, the interquartile range for /b/ and /d/ are considerably larger than those of the other plosives, with the majority extending below zero. As illustrated in Figure 5, there are two distinct types of phonetic realizations for /b/ and /d/. In Figure 5a and 5b, the amplitude of voicing gradually increases during the stop closure towards the stop release, a characteristic of canonical implosives (Ladefoged & Maddieson Reference Ladefoged and Maddieson1996). In contrast, the /b/ token in Figure 5c shows no prevoicing and exhibits a similarly short VOT to the /t/ in Figure 5d and is therefore considered a voiceless egressive.

Figure 4. Boxplot for VOT of each plosive phoneme.

Figure 5. Examples of different stop types for Baima Zhuang with broad, phonemic transcription. (Words illustrated: (a): /daːt3/ ‘hot’ (T7 long), (b): /buːn33/ ‘bed’ (T5), (c): /baː23/ ‘pack basket’ (T4), (d): /teː41/ ‘he, she, it’ (T1)).
Table 1. Mean (standard deviation) VOT (ms) for implosive and egressive tokens for /b/ and /d/

To further examine the phonetic quality of /b/ and /d/, an auditory coding was performed to categorize all tokens into either egressive or implosive. Table 1 shows the mean and standard deviation in VOT (ms) for egressive and implosive tokens for both phonemes. The overall pattern is rather consistent between /b/ and /d/: the egressive tokens have short (<30ms) positive VOTs, whereas the implosive tokens have longer, negative VOTs.
For other members of the Tai language family, it has been documented (Zhang & Wei Reference Zhang and Wei1991) that implosive realizations of plosives often occur exclusively in odd-numbered tones. This pattern is largely borne out in our data as well, with more consistency for /d/ than /b/, as illustrated in Figure 6. Five of the eight words exhibiting voiceless /b/ production in the odd-numbered tone context also demonstrate expected production in other instances, suggesting that the speaker retains the implosive token in her repertoire. In addition, words with /p/ and /t/ onsets are predominantly found in odd-numbered tones, with 91.9% for /p/ and 73.5% for /t/. Consequently, the tonal cue serves to distinguish between potentially ambiguous voiced and voiceless phonemes with short positive VOT.

Figure 6. Airstream mechanism for /b/ and /d/ across tone categories.
Nasals
Zhuang nasals occur in four places of articulation and are always voiced. The nasals /m n ŋ/ can be used in word-initial position or occur after the vowels /a e i o u ɯ/ in word-final position, where they typically nasalize the preceding vowels. The contrast between initial and final nasals is illustrated in (1):

The palatal nasal [ɲ] only occurs initially, as shown in (2) below. We tentatively analyze this as derived from an onset cluster /nj/ (without making claims as to any historical development involved here), which explains why it does not occur in final position, unlike the other nasals, since in final position clusters are not permitted.

Figure 7. Averaged spectra for each phoneme (with number of tokens), high-pass filtered at 300 Hz and low-pass filtered at 16,000 Hz. Gray areas present 95% confidence intervals.

Fricatives and affricates
Across Zhuang languages, the ‘s’ sound exhibits regional variation, including variants such as [θ], [s], [ɬ], and [ɹ] (Guangxi Language Commission Research Office 1994; Zhang et al. Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999). In the variety presented here, it is realized as [θ].
We present averaged spectra for all fricative phonemes and the frication portion of the affricate /tɕ/ in Figure 7. The pattern here is consistent with previous research regarding the distinction between non-sibilant and sibilant obstruents (Shadle et al. Reference Shadle, Chen, Koening and Preston2023). In particular, both sibilants demonstrate a clear spectral peak in the higher frequencies. Note that what we transcribe as alveolo-palatal in both /ɕ/ and /tɕ/ often sounds more like retracted [s̠] and [t̠s̠]. We nevertheless opt to use the alveolo-palatal symbols for typographical convenience and because some tokens do retain the palatalized quality typically associated with alveolo-palatals.
Approximants
Zhuang has three approximants: /j/, /w/, and /ɹ/, all of which can function as onsets. The first two approximants also serve as glides linking consonants and vowels, as discussed below.
As an onset, /w/ exhibits variation between labial-velar [w] and labiodental [ʋ]. A key distinction between these two variants lies in that [w] involves lip rounding, whereas [ʋ] does not; this rounding results in a slightly lower F2 for /w/ compared to /ʋ/ (Reetz & Jongman Reference Reetz and Jongman2011). Figure 8 displays the spectrograms of the two variants when the same word is pronounced. At the temporal midpoint, the F2 for [w] (left panel) is 672Hz, while that for [ʋ] (right panel) is 1033Hz.

Figure 8. Two different phonetic realizations of /wei24/ ‘seat’: (a) [wei24], (b) [ʋei24].
The ‘r’ sound exhibits notable regional variation in Northern Zhuang (Guangxi Language Commission Research Office 1994; Qin Reference Qin1996; Zhang et al. Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999). As shown in Figure 9, in the Hongshui He (Baima) dialect, this phoneme displays a clear formant structure without frication, characterized by a very low F3, consistent with the acoustic features of the alveolar approximant [ɹ] (Ladefoged Reference Ladefoged1993). Discussion of other variants in Northern and Southern Zhuang is presented in Luo (Reference Luo, Diller, Edmondson and Luo2008: 321).

Figure 9. Examples of the /ɹ/ onset in Hongshui He (Baima) Zhuang. (Words illustrated: (a): /ɹaː231/ ‘sesame’, (b): /ɹuː231/ ‘boat’, (c): /ɹoːi41/ ‘to comb’).
Consonant–glide combinations
Zhuang has a small number of consonant–glide combinations, historically derived from consonant clusters (Li Reference Li1954), which are preserved in some dialects as pl-, ml-, kl- e.g. in Binyang and Hengxian (Northern Zhuang), or pƔ-, mƔ-, kƔ- (e.g. Laibin (Northern Zhuang), Luzhai) (see Li Reference Li1994). In some areas, the original clusters have become completely simplified as p-, m-, k-, especially for younger speakers, see Liang (Reference Liang1987). Our speaker’s dialect possesses at least three such consonant–glide combinations, presented in (3), all of which only occur initially:

As noted above, the palatal nasal can be analyzed as a nasal plus /j/ cluster, since its distribution is different from the other nasals.
Apart from these examples, a glide /j/ can also be detected after the alveolo-palatal affricate /tɕ/ before the vowels /a u/. As shown in Figure 10, the formants at consonant-vowel transition are approximately 300–400 Hz for F1 and 2,200–2,500 Hz for F2. These values suggest a high and front position in the mouth, which contrasts with the expected formant characteristics for the vowels /a/ and /u/. It is also notable that the transition from /j/ to /ai/ is less pronounced than for /aːu/ and /aː/. This difference can be attributed to the fronted and raised phonetic realization for /ai/, which is elaborated upon below.

Figure 10. Example words with /tɕ/ onset. (Words illustrated: (a) /tɕjaːu41/ ‘spider’, (b) /tɕjaː41/ ‘frozen stiff’, (c) /tɕjuː41/ ‘salt’, (d) /tɕjai231/ ‘love’).
Vowels
Monophthongs

Note that the symbol /a/ is used for an open, central vowel.

Zhuang has six monophthongs /i e a u ɯ o/. In addition, the back unrounded vowel /ɤ/ mainly occurs in words borrowed from Mandarin Chinese and does not appear in closed syllables; it can only be short. In a few regions, /ɤ/ varies with /ɯ/, e.g. in Rong’an and Sanjiang (Zhang et al. Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999), and there is also other cross-dialectal variation. The other five vowels show phonemic length contrast. Vowel positions of all vowels (all available tokens) are plotted in Figure 11.

Figure 11. Scatterplot of F1 by F2 for monophthongs. Phoneme labels indicate the positions of mean F1 and F2 values for each vowel and the ellipses represent one standard deviation. See Table 2 for the number of tokens used.
Open syllables with a monophthong show no length contrast, but are phonetically long, as indicated in the examples above. Since Zhuang has a phonemic vowel length contrast in closed syllables, we assume that open syllables have long vowels phonemically. Examples of the length contrast in closed syllables will be presented below.
Table 2 illustrates the average length of vowels in open syllables, long vowels in closed syllables, and short vowels in closed syllables. Note that long vowels in closed syllables are roughly twice as long as short vowels in the same position, except for /e/ and /ɯ/, where fewer tokens are available. As noted in Table 2, the phoneme /ɤ/, probably a loan from Mandarin Chinese, does not appear in closed syllables.
Table 2. Mean (standard deviation) for duration (ms) by syllable structure and vowel length, with number of tokens

We also measured F1 and F2 in all vowels as a function of syllable position (open syllable, short in closed syllable, long in closed syllable): see Table 3. The results indicate that short vowels are more centralized than long vowels in this context. This pattern is further illustrated in Figure 12, which shows that the vowel space for long vowels in closed syllables (dashed line) is considerably larger than that for short vowels in closed syllables (dotted line).
Table 3. Mean (standard deviation) for F1 and F2 values (Hz) by syllable structure and vowel length


Figure 12. Mean F1 and F2 values for each vowel by syllable structure and vowel length.
Diphthongs
Hongshui He (Baima) Zhuang has 11 diphthongs (see Figure 13), all of which have offglides [i] or [u]. Examples are presented below:


Similar to many varieties in the Zhuang languages, the length contrast is preserved phonemically in diphthongs only when the first element is an /a/ in the Baima dialect (Zhang et al. Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999). Table 4 presents the results for a series of acoustic comparisons between the two pairs of /aːi/ vs. /ai/ and /aːu/ vs. /au/. Vector length was calculated as the Euclidean distance between 20% and 80% temporal points, while trajectory length referred to the cumulative Euclidean distance over four vowel sections, namely, 20–35%, 35–50%, 50–65%, and 65–80%. Both measures captured the amount of change from the beginning to the end of the vowel trajectory (Fox & Jacewicz Reference Fox and Jacewicz2009). The various comparisons shown in Table 4 indicate that vowel trajectory, rather than duration, is the main acoustic correlate to the vowel length contrast.
Vowels followed by a nasal
Both short and long vowels can be followed by a coda nasal /m/, /n/ or /ŋ/. Examples of syllables of this type are given below:

Table 4. Acoustic differences (mean (standard deviation)) between long and short diphthongs.


Figure 13. Formant trajectory of Hongshui He (Baima) Zhuang diphthongs in the F1 × F2 plane, measured at 20%, 35%, 50%, 65% and 80% portions of the vowels.
The vowel length contrast in the last pair of examples was illustrated above (Tables 2 and 3). See also Figure 14.

Figure 14. Vowel length contrast in Zhuang before nasals: taŋ41 ‘light’ (vowel duration: 11.9ms) vs. taːŋ41 ‘soup’ (vowel duration: 26.0ms).
Vowels followed by stops
In checked syllables, vowels (short or long) are followed by stops with no audible release, at bilabial, alveolar and velar place of articulation, as presented below:

Prosodic features
Zhuang is a tone language. Standard Zhuang has eight tones (Wei Reference Wei2015: 33). The numeric values of tones presented below represent Chao numerals (Chao Reference Chao1930). In the Hongshui He (Baima) dialect, there are also eight phonemic tones. All tone values are presented in the table below, with the pitch contours for the example words shown in Figure 15. Tones 1 to 6 are classified as smooth tones, whereas Tones 7 and 8 are checked tones, occurring exclusively with codas -p, -t, and -k. The two checked tones are further divided into two sub-categories depending on vowel length, long or short.


Figure 15. Plot of smooth tones (left panel) vs. checked tones (right panel) in non-normalized time for example tokens. Pitch contours are loess-smoothed.

Figure 16. Pitch contours (loess-smoothed) of the eight lexical tones of Hongshui He Zhuang (Baima dialect). Gray areas represent 95% confidence intervals (with number of tokens).
Figure 16 depicts the pitch contours in normalized time to allow for a better comparison of the pitch contours. Only vocalic portions of each word were used in the f0 measurements. Pitch tracking was exported in 10 time-normalized subsections, which were then log-transformed and converted into a T measure according to the formula in (4), consistent with recent research on tonal language contours (Zhu Reference Zhu2004; Shi & Wang Reference Shi and Wang2006; Lin, Yao & Luo Reference Lin, Yao and Luo2021). Here, f0min and f0max denote the minimum and maximum f0 values from the speaker’s production. Theoretically, the T measure can range from 0 to 5, aligning with Chao’s (Reference Chao1930) five-point tonal representation system.
T1 and T3 are both falling tones, with T1 starting at a higher pitch. However, closer examination of individual tokens reveals a secondary cue to the tonal contrast: the position of the pitch maximum. As illustrated in Figure 17, T1 reaches its maximum around the 30% normalized time point and maintains it until 40%, whereas T3 reaches its peak at 20% and declines shortly thereafter. T2 is the only convex tone in this variety, which is in the low-mid register. T4 and T6 are both rising tones with a low start; the former ends as in the mid-pitch range, whereas the latter reaches a high-pitch range. Additionally, the two tones differ in their trajectories. T4 exhibits a steady rise across the vocalic portion. T6, in contrast, remains low until approximately the 50% normalized time point, followed by a sharp rise in the second half. T5 is the only level-tone among the smooth tones. Across Zhuang languages, it is common for checked tones to exhibit similar pitch contours to those of the corresponding smooth tones in the same odd/even categories (Zhang et al. Reference Zhang, Liang, Ouyang, Zheng, Li and Xie1999: 25). This pattern is also observed in the Baima dialect, where three of the four checked tones correspond to contours in the smooth tones: T7 long to T5, T8 short to T6, and T8 long to T4. Notably, T7 short is the only tone in the dialect that occurs solely in the high register.
Syllable structure
The syllable structure of Hongshui He Zhuang (Baima dialect) is (C)(G)V(ː)(C), where the final consonant is either a voiceless stop or a nasal. Consonant–glide combinations were presented above in (3). A vowel-initial word may be preceded by a glottal stop, which is not phonemic.
Transcription
In the supplementary material, we provide the version of the North Wind and the Sun in English that we used, its translation into Zhuang orthography (see above), and a broad phonemic transcription of the story in Hongshui He Zhuang (Baima dialect).
The North Wind and the Sun
One day, the North Wind and the Sun were disputing which was the stronger when a traveler came along, wrapped in a warm cloak. They agreed that the one who first succeeded in making the traveler take his cloak off should be considered stronger than the other. Then the North Wind blew as hard as he could, but the more he blew the more closely did the traveler fold his cloak around him; and at last the North Wind gave up the attempt. Then the Sun shined out warmly, and immediately the traveler took his cloak off. And so the North Wind was obliged to confess that the Sun was the stronger of the two.
Orthographic version
Rumzbaek caeuq daengngoenz
Miz baez ndeu, rumzbaek caeuq daengngoenz cingqcaih euq naeuz byawz haemq miz bonjsaeh. Gyoengqde cingqngamj raen miz bouxvunz ndeu byaij gvaq daeuj, boux vunz haenx daenj geu buhbangj ndeu. Gyoengqde couh naeuz lo, byawz ndaej hawj boux vunz haenx duet geu buhbangj de, couh suenq byawz haemq miz bonjsaeh. Yienghneix, rumzbaek couh buekmingh dwk ci. Byawz rox, de ci ndaej yied mengx, boux vunz haenx couh yied aeu buhbangj duk ndaet bonjfaenh. Doeklaeng, rumzbaek mbouj miz banhfap, cijndei cuengq fwngz. Ciep dwk, daengngoenz okdaeuj baez cik, boux vunz haenx couh doq dawz buhbangj bok roengzdaeuj lo. Yienghneix, rumzbaek cijndei nyinh saw lo.
Transcription

Abbreviation (following the Leipzig Glossing Rules)
DEL declarative
Acknowledgments
We would like to thank GAN Jinshan, who served as language consultant for this Illustration and actively participated in the analysis of the sound system. Naturally, all interpretations and any errors remain solely our own.
Funding information
This research was supported by grants National Social Science Fund of China 22&ZD299.





