Flavia

Affective Conversational User Interfaces

The influence of manipulated pitch contour and speaking rate of synthesized speech on perceived affect

PROBLEM STATEMENT


The commercialization of synthetic speech through the use of conversational user interfaces is intensifying. As humans, we suppose these voice interfaces to be fully calibrated with our way of social speaking. It is thus expected that a conversational user interface can convey emotions when the situation at hand requests this. However, for now, conversational user interfaces remain incapable of fulfilling these expectations fully.


— PROJECT NAME

Affective Conversational

User Interfaces


— ROLE

Master Thesis



— DATE

11/05/2021

This Master thesis investigated the effect of dynamic pitch contour and static speaking rate manipulations on the perceived affect and emotions displayed by a conversational user interface. By listening through a set of audio fragments, participants rated levels of valence, arousal, dominance, and discrete emotions. Analysis revealed a positive correlation between speaking rate and valence as well as speaking rate and arousal. Additionally, pitch contour patterns ending on a low pitch were linked to higher dominance ratings while patterns ending on a rising pitch linked to higher valence ratings. The content of a sentence is of great importance when perceiving discrete emotions.

Pitch contour 1

Pitch contour 2

Pitch contour 3

Pitch contour 4

Pitch contour 5

Pitch contour 6

CAUSE AND MOTIVATION

As development of interfaces happened, the latest version of interfaces emerged: conversational user interfaces (CUI). These interfaces aim for a turn-by-turn basis of conversation between user and computer. While it is widely acknowledged that counter intuitive interactions with these objects are common, the display of emotion by the conversational interface has great potential to solve some of these interactions. For CUIs, which have to rely solely on their voice, flow of dialogue is the most apparent way to display emotion.


There are two main aspects within the flow of dialogue that could be considered to ease the interaction between human and computer. These include chosen words and the formation of sentences as well as their intonation. Earlier research showed that predominantly the final part of the intonation pattern in human speech and speaking rates affected perceived emotion (Mozziconacci and Hermes, 2001) as well as the affective message of the sentence. To shape the thesis, the final research question was formulated as follows:


What is the effect of pitch contour and speaking rate on perceived emotion of manipulated synthetic speech?


With the subsequent hypotheses substantiated by previous research:


H1: An increased speaking rate results in higher average ratings of arousal.

H2: An increased speaking rate results in higher average ratings of valence.


METHODS

da

The study entailed a within-subjects design to test for an effect between pitch contour and perceived affect and speaking rate and perceived affect. 50 participants listened to 300 audio fragments of 10 different sentences. Each sentence was manipulated on 5 different speaking rates (70%, 85%, 100%, 115%, 130%) and 6 different intonation patters which can be seen higher up the page. These intonation patterns were substantiated by assumptions, human speech and earlier research (Chen et al., 2011; Turner et al., 2019).


The dependent variables, valence, arousal, dominance, and discrete emotions, were measured using two different scales: the Pick-A-Mood tool (Desmet et al., 2016) and the Self-Assessment Manikin tool (Bradley and Lang, 1994). The Pick-A-Mood tool is displayed according to the circumplex model of affect (Russel, 1980). This model reflects how the traits of valence, arousal and dominance are connected to each other and a specific emotion. After listening to an audio fragment, the participant would rate the fragment on the dependent variables through the two scales.

Dimensions of affect with examples of various adjectives linked to valence and arousal ratings.

The Pick-A-Mood tool used to rate the discrete emotions perceived by participants.

The Self-Assessment Manikin Tool to rate the affective dimensions valence (top), arousal (middle), and dominance (bottom).

RESULTS

hi

The study, among other things, included results with regard to the effectiveness of the manipulations, the influence of the affective sentences on valence and arousal and the effects of speaking rate and pitch contour on valence, arousal, and dominance. To be able to properly compare the ratings for valence, arousal, and dominance, a scale from -2 to 2 was chosen for each attribute of the circumplex model of affect. Additionally, speaking rate is coded from 1 to 5 where 1 represents the slowest rate and 5 represents the fastest.


As can be seen below, all participants heard at least a slight difference between the different audio fragments with respect to both pitch contour and speaking rate. Also, while the ten affective sentences differ in their mean valence value, there is limited difference for their mean arousal value. The plot shows that the semantic content of a sentence has a clear effect on valence and implies that arousal is assessed on form while valence is on form and content.


SPEAKING RATE

For ratings of arousal, a positive association with speaking rate was confirmed, as hypothesized in H1. A faster speaking rate is associated with a higher level of activity. So, for a CUI to be recognized as actively taking part of a social interaction, it must utilize a speaking rate with adequate pace. In addition, the positive association between valence and speaking rate was also confirmed as hypothesized in H2. Finally, analysis yielded a non-significant effect for dominance with speaking rate.

Distribution of the ten affective sentences across the valence and arousal dimensions of affect. Zoomed in on the range -1 to 1 to make differences better visible. Displayed with the standard error per affective sentence. The numbers correspond to those in mentioned above.

Extent to which the manipulations were noticed.

Trendlines for mean arousal per speaking rate and mean valence per speaking rate.

PITCH CONTOUR

The figures below display the six different contours ordered according to their mean valence rating. The difference between these ratings is minimal but nevertheless shows all patterns ending on a risen pitch are related to the higher mean valence ratings. The mean arousal ratings vary barely. Mean dominance ratings are better distinguishable from each other. The higher mean ratings are associated with pitch contour patterns that end with a pitch below neutral. The result can be explained as dominance, also known as power, is often radiated by using a lower tone of voice by humans (Cheng et al., 2016). According to these results, this idea also applies to synthetic voices.


Where mean valence is highest at the first pitch contour presented, mean dominance is lowest. This continues throughout the whole series of contour patterns. Applying these insights to the earlier mentioned use of deeper voices to convey power, indicates that perceived power is simultaneously perceived as less pleasurable. For a CUI to have a social and pleasurable conversation, it thus seems important to exclude lower tones of voice.


CONCLUSIONS

Overall, findings suggest that different situations ask for multiple, different pitch contour and speaking rate manipulations. When CUIs are giving a speech in front of people, utilization of the right speaking rate and pitch contour patterns is beneficial.


With the new information found for this thesis, developers of conversational user interface technologies become better capable of making voice attributed decisions. Turn-by-turn concepts, on which the communication with a CUI is based, have the opportunity to evolve into an intuitive and natural conversation. Thereby diminishing the current complaints about delayed responses and misunderstanding of speech.


For a full report including more analyses, discussion of the results and limitations, please contact me personally.

Mean valence ratings according to pitch contour, sorted from high to low for valence ratings. Error bars indicate standard errors.

Mean arousal ratings according to pitch contour, sorted from high to low for valence ratings. Error bars indicate standard errors.

Mean dominance ratings according to pitch contour, sorted from high to low for valence ratings. Error bars indicate standard errors.

TOP

LEARNINGS

Limiting Yourself – This thesis has solely focused on manipulating pitch contour and speaking rate. However, there are many more parameters of speech that could have been manipulated. Learning how to scope and limit myself was very important during the process. Otherwise this thesis would have become an evergoing project.


Durability of People - I became more aware of what I ask from people who participate in research. Participants explained that the great number of audio fragments made them tired, fatigued, and even undevoted. While I felt I needed a lot of information from them, this might have caused lesser results.


Academics to Implementation - A lot of your work can be substantiated by numbers. However, if you don't relate these numbers back to implementation in daily life, the numbers mean little. This thesis forced me to make this step and adequately focus on bringing the found information back to implementation.


LITERATURE LIST

Bradley, M. M. and Lang, P. J. (1994). Measuring emotion: The self-assessment manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25(1):49 – 59.


Chen, Y., Liu, C., and Jin, S.-H. (2011). Acoustic features of english sentences produced by native and non-native speakers. Journal of the Acoustical Society of America, 130:2523– 2523.


Cheng, J., Tracey, J., Ho, S., and Henrich, J. (2016). Listen, follow me: Dynamic vocal signals of dominance predict emergent social rank in humans. Journal of Experimental Psychology General, 145(5):536–547.


Desmet, P., Vastenburg, M., and Romero, N. (2016). Mood measurement with pick-a-mood: Review of current methods and design of a pictorial self-report scale. Journal of Design Research, 14(3):241–279. cited By 31.


Mozziconacci, S. and Hermes, D. (2001). Role of intonation patterns in conveying emotion in speech.


Russell, J. A. (1980). A circumplex model of affect. Journal of personality and social psychology, 39(6):1161.


Turner, D., Bradlow, A., and Cole, J. (2019). Perception of pitch contours in speech and nonspeech. pages 2275–2279.