Studio Flavi | Flavia van Tilburg

The influence of manipulated pitch contour and speaking rate of synthesized speech on perceived affect


Master Thesis

We assume that sounds produced by conversational user interfaces sound like human speech and can therefore convey emotions. However, for now, conversational user interfaces remain incapable of fully fulfilling these expectations. This Master thesis investigated the effect of dynamic pitch contour and static speaking rate manipulations on the perceived affect and emotions displayed by a conversational user interface.

Cause & motivation

Conversational user interfaces (CUI) aim for a turn-by-turn basis of conversation between user and computer. The display of proper emotions by CUIs has great potential to further advance these interactions. The flow of dialogue seems to be the most logical way to display emotion.

Pitch contour, an element of intonation is an aspect within the flow of dialogue that is considered to ease the interaction between human and computer. Flow of dialogue is also affected by speaking rate. Therefore, the following research question was formulated.

What is the effect of pitch contour and speaking rate on perceived emotion of manipulated synthetic speech?


Through a within-subjects design, 50 participants listened to 300 audio fragments of 10 different sentences. Each sentence was manipulated on 5 different speaking rates (70%, 85%, 100%, 115%, 130%) and 6 different intonation patters.

The dependent variables, valence, arousal, dominance, and discrete emotions, were measured using the Pick-A-Mood tool and the Self-Assessment Manikin tool. After listening to an audio fragment, the participant would rate the fragment on the dependent variables through the two scales.


All participants heard at least a slight difference between the different audio fragments with respect to both pitch contour and speaking rate. The semantic content of a sentence had a clear effect on valence but not on arousal.

Speaking Rate

A faster speaking rate is associated with a higher level of activity. So, for a CUI to be recognized as actively taking part of a social interaction, it must use a speaking rate with adequate pace.

Pitch Contour

All patterns ending on a risen pitch are related to the higher mean valence ratings. The higher mean ratings for dominance are associated with pitch contour patterns that end with a pitch below neutral. This is in line with human speech were dominance is often radiated by using a lower tone of voice. According to these results, this idea also applies to synthetic voices.

Where mean valence is highest at the first pitch contour, mean dominance is lowest. This continues throughout the whole series of contour patterns. Applying these insights to the earlier mentioned use of deeper voices to convey power, indicates that perceived power is simultaneously perceived as less pleasurable. For a CUI to have a social and pleasurable conversation, it thus seems important to exclude lower tones of voice.