Using High-Level Actions

The first tutorial in this series mentioned that character applications typically consist of multiple segments, with each segment characterized by a texture image and a data file for decoding that texture. We can smoothly switch from one segment to another by ensuring that each succesive segment begins in the same state as the previous one left off. For simplicity, most segment begin and end with a single "default" state, or pose, of the character.

In some applications, such as video, there is a continuous stream of action and audio to be delivered. In others, such as interactive agents or gaming, we can distinguish between an "idle" mode and a "presenting" mode. During idle mode, segment are typically chosen at random, on a timer, from a repertoire. The textures and data for idle mode are normally in the browser cache, and can be played back with minimal bandwidth and cpu power. During "presenting" mode, the character runs through one or more segments, typically one sentence in length, from a queue.

By limiting the size of a segment to one sentence, we can keep the time required to create and load each segment to a minimum. Furthermore, by loading a few items into a play queue, the client can easily predict what comes next, and can pre-fetch the resources for the next segment even as the current one is playing. Text-to-Speech is somewhat costly to generate, and the latency increases with the size of the request. Requesting one sentence at a time keeps the latency manageable, and the texture sizes reasonable.

We can distinguish between two types of actions. Automatic actions, such as blinking or random head movement, occur continuously, whether the character is presenting or idle. Deliberate actions, such as hand gestures or emotional responses, are normally associated with the message being delivered. While they can be chosen at random, in general they need to be chosen and timed to match the message.

Authoring a character presentation normally begins with a script. As a practical matter, in can become quite tedious to insert XML actions for automatic actions. Even deliberate actions can be be problematic, both because deliberate movements will often consist of several individual actions involving different parts of the body (e.g. combining Look Right and Gesture Right), and because the availability of actions can vary somewhat from one character style to another.

To simplify the authoring of character scripts, you often want to associate a single, high-level action with an entire sentence, essentially indicating the manner in which the sentence is spoken. For example, scripts appear as follows in the Character Builder service:

The Look At User selection means that no deliberate action takes place, and an empty text field indicates the action is performed silently. Each line in the script tells the character what do Do and what to Say.

If you are using the Character Builder's Agent module to dynamically present material using javascript calls from the surrounding web page, then the equivalent code would be as follows:

myAgent.dynamicPlay({say:"This is a sentence with no deliberate action."});
myAgent.dynamicPlay({do:"gesture-right", say:"This is a sentence in which the character looks and gestures towards the right."});
myAgent.dynamicPlay({do:"happy"});

The same dynamicPlay() API is also available on the Character API Reference Implementation, for clients that use the Character API directly.

It is important to realize that this high-level action system provides the underlying implementation with a license to realize the actual Character API XML tags as it sees fit, with knowledge of the range of animations available for the given character. Additionally, automatic actions, such as blinking, or subtle head movements, can also be added by the system.

For example, the actual XML tags that the above script will pass to the Character API might be as follows:

This is a <blink/> sentence with no deliberate action.
<par><lookleft/><gestureleft/></par>This is a sentence in which <handsbyside/> the character <lookuser/> looks and <blink/> gestures towards the right.
<bigsmile/>

One thing to note is that low-level actions always assume Stage Right and Stage Left in their action names, whereas high-level actions use House Right and House Left, for authoring convenience.

To summarize, both the Character Builder and the Character API Reference Implementation provide a high-level action system that lets you pick the manner in which each sentence is spoken. The system translates this into the XML tags that represent the low-level "micro" actions that the Character API uses to realize the action for a particular character. The allows animators to focus on creating smaller, more reusable units of animation while providing content creators with a simple-to-use and stable set of higher-level actions that is implemented optimally across multiple character styles. It is important to note that high-level tagging is a layer built on top of the Character API.



How it Works

The code to implement high-level actions can be found in charapiclient.js file in the Reference Implementation. To a first approximation, the conversion from high-level to low-level actions occurs via a simple lookup table that takes into account the character style.

For the insertion of automatic action, a seeded random number generator is used. The seed is computed by hashing the text in the sentence. This means that for a given sentence, e.g. "This is a sentence with no deliberate action.", the process always results in the same random expansion, e.g. "This is a <blink/> sentence with no deliberate action." If you change even a single character, the expansion will likely be different. For example "This is a sentence with no deliberate action!" (note the ending punctuation) might always expand to "This is a sentence with no deliberate <blink/> action." (note the different position of the blink tag). The use of the seeded random number improves the caching - without it, a hundred views of the same sentence might result in a hundred slightly different requests to the Character API.

You can use the character testers to get a sense for the low-level actions available for each stock character.

We stressed that high-level actions should match the content being spoken. In certain domains it is possible to automatically choose the high-level actions based on keywords and context. For example, a virtual newscaster can glance down to her notes between headlines, or pick an emotional response (e.g. sad, amused, or concerned), based on sentiment analysis of the story. This could be as simple as detecting the presence of keywords, such as "died", "prank", or "tensions".



Speech Tags

If you are using Text-to-Speech, the low-level XML action tags are stripped out of the sentence before it goes to the TTS engine. TTS systems typically allow the text to be marked up with further XML tags using a the SSML (Simple Speech Markup Language) standard. For example, the tag <break time="1s"/> provides a silent pause in the actual audio. Modern TTS systems use context to generate the correct inflection and prosody, and this works well the vast majority of the time, however it occasionally becomes necessary to use SSML tags to provide more control. For example, you might want to pronounce 123 as "one, two, three" rather than "one-hundred-twenty-three". In SSML, you would write <say-as interpret-as="digits">123</say-as> to provide an explicit hint. You can sometimes avoid using a SAMPA tag by rewriting the text, e.g. "1, 2, 3", however the same text is also often used for closed-captioning, where you might want it to still appear as "123". It is a good practice to use clear prose that is easy to read and easy to speak, and then annotate it only where necessary using speech tags. Unfortunately SSML support can vary from one TTS engine to another, and the precise XML syntax can be complicated and unforgiving.

The Character Builder and Character API Reference Implementation use square-bracket speech tags as an alternative to SAMPA. For example, you can use "A long [silence 1s] pause", or "The prefix [digits]123[/digits]", and the result is converted to the SAMPA tags introduced above.

Speech tags can be used inside the Character Builder's Say field:

and also in dynamic speech:

myAgent.dynamicPlay({say:"A long [silence 1s] pause"});

For a complete list of speech tags, please see KB113.



Custom Commands

The Character Builder script editor uses <cmd/> tags to time additional actions, such as advancing a slideshow. In most cases the commands you need will be available in a dropdown selector:

If you are using the Agent module with dynamicPlay or the Character API Reference Implementation, then you can also embed a command tag directly in your Say text, similar to a speech tag, by using the [cmd] tag. You can include arguments in the tag that will be passed to the event handler in the containing web page.

myAgent.dynamicPlay({say: 'Here [cmd type="reveal" target="hint"] is a hint'})



Low-Level Action Tags

In some cases you might want to invoke a low-level Character API action that is not exposed as a high-level action. Or, you might want to invoke a specific combination of low level actions with more precise timing. Any square bracket commands that are not explicitly recognized as speech tags, or commands, are translated to an XML tag for consideration by the Character API. For example you might know that your character has a custom <grab-coffee/> action.

myAgent.dynamicPlay({say: 'I have my [grabcoffee] mug right here!'})

The resulting Character API action would be:

<say>I have my <grabcoffee/> mug right here!</say>


Next Steps

This tutorial has introduced the high-level action and speech tagging system implemented by the Character Builder, and made available as part of the Character API Reference Implementation. The next tutorial in this series will focus on how the Character API can be used to create mp4 videos.








Copyright © 2020 Media Semantics, Inc. All rights reserved.
Comments? Write webmaster@mediasemantics.com.
See our privacy policy.