SmartBody : Speech

Description

Characters can synthesize speech and match that speech with lip syncing either from prerecorded audio, or from a text-to-speech (TTS) engine. Note that the <speech> tag is a request to generate speech, but it does not contain a full description of

Requirements

A character must have a Face Definition set up with a set of visemes. If the character does not have a Face Definition with visemes that match those of the speech engine, then the sound of the speech will still be played, but the character will do no lip synchronization.

To generate the sound for prerecorded audio files, SmartBody requires that the sound file (.wav, .au) must be placed in a directory alongside of an XML file that includes the visemes and timings for that audio. Both the XML file and the audio file must have the same name, with different suffices (the suffix for the XML file will be .xml). Please see the documentation on Prerecorded Audio for more details.

For text-to-speech (TTS), a TTS relay must be running or the characters must use the internal Festival TTS engine.Please see the section on Using Text-To-Speech for more details.

Usage

<speech>hello, my name is Utah</speech>

By default, speech BML can be described either by using plain text (type="text/plain") or SSML (type="application/ssml+xml"). If no type attribute is specified, the BML realizer assumes type="text/plain". Note that some speech relay systems can support SSML tags, where you can specify loudness, prosody, speach breaks and so forth, but this depends on the abilities of those relays, and not on SmartBody.

Note that the purpose of the <speech> tag is to either request the TTS engine to generate the audio and the subsequent viseme timings, or to gather an existing audiofile and timings and to play it. Please refer to the section on Using Speech for more details.

Synchronizing Speech Using a Text-to-Speech (TTS) Engine

In order to synchronize other behaviors with speech using TTS, the speech content must be marked with tags as follows:

<speech type="application/ssml+xml" id="myspeech">
   <mark name="T0"/>hello
   <mark name="T1"/>
   <mark name="T2"/>my
   <mark name="T3"/>
   <mark name="T4"/>name
   <mark name="T5"/>
   <mark name="T6"/>is
   <mark name="T7"/>
   <mark name="T8"/>Utah
   <mark name="T9"/>
</speech>
<head type="NOD" start="myspeech:T4"/>

The tags are instructions for the text-to-speech engine to replace those markers with the actual timings of the BML.

The above command will place synchronization markers before and after the spoken text, which allows you to coordinate other behaviors, in this case a head nod, at various points during the speech. Note that the marker immediately before a word is coordinated with the start of that word, while the marker after the word is coordinated with the end of that word. In the above example, the character will start the head nodding at the same type that the word 'name' is beginning to be uttered. In addition, other behaviors can access the start and end synchronization points of a speech, which correspond to the point where the first word is spoken and after when the last word is spoken, respectively.

Synchronizing Speech Using Prerecorded Audio

Speech that uses prerecorded audio can also be synchronized with other behaviors. When using prerecorded audio, we assume that the speech timings for the utterance are already known and have been recorded into an XMLfile. This XML file will also include visemes and their respective timings that are synchronized with the words, like the following:

<bml>
<speech type="application/ssml+xml">
   <sync id="T0" time=".1"/>hello
   <sync id="T1" time=".2"/>
   <sync id="T2" time=".35"/>my
   <sync id="T3" time=".4"/>
   <sync id="T4" time=".6"/>name
   <sync id="T5" time=".72"/>
   <sync id="T6" time=".9"/>is
   <sync id="T7" time=".1.07"/>
   <sync id="T8" time="1.4"/>Utah
   <sync id="T9" time="1.8"/>

   <lips viseme="_" articulation="1.0" start="0" ready="0.0132" relax="0.0468" end="0.06"/>
   <lips viseme="Z" articulation="1.0" start="0.06" ready="0.0952" relax="0.1848" end="0.22"/>
   <lips viseme="Er" articulation="1.0" start="0.22" ready="0.2442" relax="0.3058" end="0.33"/>
   <lips viseme="D" articulation="1.0" start="0.33" ready="0.3586" relax="0.4314" end="0.46"/>
   <lips viseme="OO" articulation="1.0" start="0.46" ready="0.4644" relax="0.4756" end="0.48"/>
   <lips viseme="oh" articulation="1.0" start="0.48" ready="0.4888" relax="0.5112" end="0.52"/>
</speech>
</bml>

The above XMLfile will reside in a directory, and in that same directory the audio file (.wav) should exists with the same name, with the .wav extension. To use such data:

<speech ref="myspeech"/>

Where the files myspeech.bml, and myspeech.wav are in the location designated for audio files (based on the mediapath and the voice code, please see the section on configuring Prerecorded Speech for Characters for details on where this location should be).

In order to coordinate behaviors with the prerecorded speech, you include tags, the same way you would do so using TTS. For example:

<speech type="application/ssml+xml" id="myspeech">
<mark name="T0"/>hello
<mark name="T1"/>
<mark name="T2"/>my
<mark name="T3"/>
<mark name="T4"/>name
<mark name="T5"/>
<mark name="T6"/>is
<mark name="T7"/>
<mark name="T8"/>Utah
<mark name="T9"/>
</speech>
<head type="NOD" start="myspeech:T4"/>

Note that the name attribute in the tags in your BML command must match the id attribute of the <sync> tags in the XML file. Using this example, the head nod will occur at time .4, since that is the <sync> time as described in the XML file.

Please note that prerecorded audio can contain instructions for individual visemes using the <lips> tags as in the above example. In that case, each viseme has an explicit start and end time. Altervatively, the BML file can contain instructions to play an arbitrary curve for each viseme using the <curves> tag, as follows:

<bml>
<speech type="application/ssml+xml">
   <sync id="T0" time=".1"/>hello
   <sync id="T1" time=".2"/>
   <sync id="T2" time=".35"/>my
   <sync id="T3" time=".4"/>
   <sync id="T4" time=".6"/>name
   <sync id="T5" time=".72"/>
   <sync id="T6" time=".9"/>is
   <sync id="T7" time=".1.07"/>
   <sync id="T8" time="1.4"/>Utah
   <sync id="T9" time="1.8"/>

   <curves>
   <curve name="NG" num_keys="9" >0.645000 0.000000 0.000000 0.000000 0.710728 0.607073 0.000000 0.000000 1.011666 0.000000 0.000000 0.000000 1.994999 0.000000 0.000000 0.000000 2.044607 0.020316 0.000000 0.000000 2.061665 0.000000 0.000000 0.000000 2.211665 0.000000 0.000000 0.000000 2.266716 0.262691 0.000000 0.000000 2.311665 0.000000 0.000000 0.000000 </curve>
   <curve name="Er" num_keys="3" >0.028333 0.000000 0.000000 0.000000 0.184402 0.998716 0.000000 0.000000 0.261667 0.000000 0.000000 0.000000 </curve>
   <curve name="F" num_keys="3" >0.445000 0.000000 0.000000 0.000000 0.578196 0.982450 0.000000 0.000000 0.661667 0.000000 0.000000 0.000000 </curve>
   <curve name="Th" num_keys="8" >1.361666 0.000000 0.000000 0.000000 1.450796 0.852155 0.000000 0.000000 1.545923 0.000000 0.000000 0.000000 1.622119 1.000000 0.000000 0.000000 1.694999 0.000000 0.000000 0.000000 2.194999 0.000000 0.000000 0.000000 2.308685 0.990691 0.000000 0.000000 2.411665 0.000000 0.000000 0.000000 </curve>
   <curve name="Z" num_keys="3" >-0.221667 0.000000 0.000000 0.000000 0.028095 0.995328 0.000000 0.000000 0.195000 0.000000 0.000000 0.000000 </curve>
   </curves>
</speech>
</bml>

Note that the curve data is in the format: time, value, tangent1, tangent2 (the tangent data can be ignored). The <curve> data expresses an activation curve for a particular viseme.

Parameters

Parameter

Description

Example

type

Type of content. Either text/plain or application/ssml+xml. Default is text/plain.

<speech type="text/plain">
Four score and seven years ago
</speech>

<speech type="application/ssml+xml">
Four score and seven years ago
</speech>

ref

Speech file reference. Used to determine which sound file and XML file to use that is associated with the speech.

</speech>

Marker used to identify timings of speech before actual timings are known.

Typically, the names are "Tn" where n is a whole number that is incremented

before and after each word. T0, T1, T2, etc.

<speech type="text/plain">

<mark name="T0"/>Four

<mark name="T1"/>

<mark name="T2"/>score

<mark name="T3"/>

<mark name="T4"/>and

<mark name="T5"/>

<mark name="T6"/>seven

<mark name="T7"/>

<mark name="T8"/>years

<mark name="T9"/>

<mark name="T10"/>ago

<mark name="T11"/>
</speech>