Emotion

Temporary emotional conditions such as amusement, anger, contempt, grief, sympathy, suspicion, etc. have an effect on prosody. Just as a film director explains the emotional context of a scene to her actors to motivate their most convincing performance, so TTS systems need to provide information on the simulated speaker s state of mind.

These are relatively unstable properties, somewhat independent of character as defined above. That is, one could imagine a speaker with any combination of social/dialect/gender/age characteristics being in any of a number of emotional states that have been found to have prosodic correlates, such as anger, grief, happiness, etc. Emotion in speech is actually an important area for future research.

A large number of high-level factors go into determining emotional effects in speech. Among these are point of view (can the listener interpret what the speaker is really feeling or expressing); spontaneous vs. symbolic (e.g., acted emotion vs. real feeling); culture-specific vs. universal; basic emotions and compositional emotions that combine basic feelings and effects; and strength or intensity of emotion.

, acted emotion vs. real feeling); culture-specific vs. universal; basic emotions and compositional emotions that combine basic feelings and effects; and strength or intensity of emotion.

We can draw a few preliminary conclusions from existing research on emotion in speech [34]: Speakers vary in their ability to express emotive meaning vocally in controlled situations. Listeners vary in their ability to recognize and interpret emotions from recorded speech. Some emotions are more readily expressed and identified than others.

Similar intensity of two emotions can lead to confusing one with the other. An additional complication in expressing emotion is that the phonetic correlates appear not to be limited to the major prosodic variables (F0, duration, energy) alone. Besides these, phonetic effects in the voice such as jitter (inter-pitch-period microvariation), or the mode of excitation may be important [24].

In a formant synthesizer supported by extremely sophisticated controls, and with sufficient data for automatic learning, such voice effects might be simulated. In a typical time-domain synthesizer, the lower-level phonetic details are not directly accessible, and only F0, duration, and energy are available. Some basic emotions that have been studied in speech include: Anger, though well studied in the literature, may be too broad a category for coherent analysis. One could imagine a threatening kind of anger with a tightly controlled F0, low in the range and near monotone; while a more overtly expressive type of tantrum could be correlated with a wide, raised pitch range.

One could imagine a threatening kind of anger with a tightly controlled F0, low in the range and near monotone; while a more overtly expressive type of tantrum could be correlated with a wide, raised pitch range. Joy is generally correlated with increase in pitch and pitch range, with increase in speech rate. Smiling generally raises F0 and formant frequencies and can be well identified by untrained listeners.

Sadness generally has normal or lower than normal pitch realized in a narrow range, with a slow rate and tempo. It may also be characterized by slurred pronunciation and irregular rhythm. Fear is characterized by high pitch in a wide range, variable rate, precise pronunciation, and irregular voicing (perhaps due to disturbed respiratory pattern).

SYMBOLIC PROSODY Abstract or symbolic prosodic structure is the link between the infinite multiplicity of pragmatic, semantic, and syntactic features of an utterance and the relatively limited F0, phone durations, energy, and voice quality. The output of the prosody module of Figure 15.2 is a set of real values of F0 over time and real values for phoneme durations. Symbolic prosody deals with:

Prosody Breaking the sentence into prosodic phrases, possibly separated by pauses, and Assigning labels, such as emphasis, to different syllables or words within each prosodic phrase. Words are normally spoken continuously, unless there are specific linguistic reasons to signal a discontinuity. The term juncture refers to prosodic phrasing that is, where do words cohere, and where do prosodic breaks (pauses and/or special pitch movements) occur. Juncture effects, expressing the degree of cohesion or discontinuity between adjacent words, are determined by physiology (running out of breath), phonetics, syntax, semantics, and pragmatics. The primary phonetic means of signaling juncture are:

Juncture effects, expressing the degree of cohesion or discontinuity between adjacent words, are determined by physiology (running out of breath), phonetics, syntax, semantics, and pragmatics. The primary phonetic means of signaling juncture are: Silence insertion. This is discussed in Section 15.

4.1. Characteristic pitch movements in the phrase-final syllable.

This is discussed in Section 15.4.4.

Lengthening of a few phones in the phrase-final syllable. This is discussed in Section 15.5.

Irregular voice quality such as vocal fry. This is discussed in 16. Parsed text and phone string Symbolic Prosody Pauses Prosodic Phrases Accent Tone Speaking Style Prosody Attributes Pitch Range Prominence Declination.

