Media Semantics - Talking AI Avatar Platform

Building a Caching Server

In the first tutorial in this series we used client-side code to call the Character API directly. This is good for rapid prototyping, but it is actually not the way we recommend using the Character API in a production environment.

Chances are your application already talks to an existing server. Why not let that server call the Character API on the browser's behalf and cache the results? Not only would this lead to reduced usage of the Character API, but it would enable additional server-side processing to occur, such as Text-to-Speech.

The Character API Reference Implementation contains a localhost server built using Node.js and the Express library. It also exposes an 'animate' endpoint for your client, but this version calls AWS Polly and the Character API only when necessary to populate its cache. All cached requests are returned immediately. This tutorial shows how you can move it to an EC2 web instance.

Running it on EC2

The most basic AWS EC2 "t2 micro" server is free-tier eligible, and provides ample processing power for this application. We will assume that you have apache installed, for example by working through the first few steps in this AWS LAMP tutorial. You should also have Node.js installed, as described here.

Begin by following the installation instructions in the Reference Implementation Readme, resulting in a new 'animate' endpoint running on http://localhost:3000/animate on your instance.

It is convenient to host a Node application within Apache by way of a ProxyPass entry in your Apache config file. You can use Apache to host static files, such as a website, while passing certain api calls, such as 'animate', to Node.js for processing. This Apache config file line will map anything under /node to a Node.js app running on port 3000:

ProxyPass /node http://localhost:3000

You can add this line to the end of the Apache config file using:

$ sudo nano /etc/httpd/conf/httpd.conf

Then restart apache with

$ sudo systemctl restart httpd

(On debian, the file is /etc/apache2/apache2.conf and you can restart using "sudo service apache2 restart". If the ProxyPass statement is not recognized, try "sudo a2enmod proxy_http" to enable the proxy module. Note that you can use any port. To check if port 3000 is used, you can use "lsof -i :3000".)

You likely want to install a runner utility such as pm2 to manage your Express service.

$ npm install pm2 -g

then

$ pm2 start server.js

Key pm2 commands during development are:

pm2 restart server
pm2 log server

In production, consider running pm2 with two or more instances with "pm2 start server.js -i 2". This can provide more throughput and allow you to deploy changes with no downtime using "pm2 reload server". You will also want to run "pm2 startup" and follow the instructions to setup your Node service to start automatically on a reboot of your instance.

How the cache works

The easiest place to cache files is on local disk. The Reference Implementation uses the "cache" directory under server.js.

When 'animate' is called, there are 3 different files that are produced: an mp3 audio file, a png or jpeg texture file, and a json file containing the instructions for decoding the texture file. In the Reference Implementation, the client can ask for all three files more or less simultaneously from the 'animate' endpoint. The urls will differ only in the &type parameter, which can be one of &type=audio, &type=image, &type=data.

The first thing we do when we receive the request is to build the complete set of parameters that we will send to the Character API. We then create a 16 character hash from the information in the parameters. We are careful to exclude the &type parameter when building the hash. This hash is effectively a unique signature for this request, and will become the base for our cache file name.

app.get('/animate', function (req, res) {

    // Start with our fixed parameters
    var o = {
        "character":"SusanHead",
        "version":"3.1",
        "return":"true",
        "recover":"true",
        "format":"png",
        "width":250,
        "height":200,
        "webgl":"true",
        "fps":"24",
        "quality":"95",
        "backcolor":"ffffff"
    };
    
    // Add to that all the parameters that are variable, from the client
    if (req.query.do) o.do = req.query.do;
    if (req.query.say) o.say = req.query.say;
    
    // Now compute a hash based on all the CharacterAPI parameters
    var crypto = require('crypto');
    var hash = crypto.createHash('md5');
    for (var key in o)
        hash.update(o[key]||'');
    var filebase = hash.digest("hex"); // the base for our filename
    
    ...

From the filename base, we can obtain several different file names in our cache:

function targetFile(filebase, type, format) {
    if (type == "audio") return cachePrefix + filebase + ".mp3";
    else if (type == "image") return cachePrefix + filebase + "." + format;
    else if (type == "data") return cachePrefix + filebase + ".json";
    else if (type == "lock") return cachePrefix + filebase + ".lock";
}

The Reference client can request the texture file as well as the json and audio files associated with a segment more or less simultaneously. Here is how we deal with this on the server:

    lockFile.lock(targetFile(filebase, "lock"), {}, function() {
        let file = targetFile(filebase, type, format);
        fs.exists(file, function(exists) {
            if (exists) {
                lockFile.unlock(targetFile(filebase, "lock"), function() {
                    // optional: "touch" each file we return
                    let time = new Date();
                    fs.utimes(file, time, time, () => { 
                        finish(req, res, filebase, type, o.format);
                    });
                });
            }
            else {
                // Cache miss - do the work to produce matching .mp3, .png, and .json files
                ...
                lockFile.unlock(targetFile(filebase, "lock"), function() {
                    finish(req, res, filebase, type, o.format);
                });
            }
        });
    });    
});

All requests begin by asking for a lock based on the file base. If the required file is already in the cache, the lock can quickly be released and the file is returned, with no additional work. If the file is not in the cache, the work is performed to create up to three files in the cache, all sharing the same file base, but with different file extensions. The lock is then released and the requested file type is returned. During the generation any other requests for other file types will wait their turn to obtain the lock. When they finally do, they will find that their file is cached and can be returned rapidly. In a production implementation it is not uncommon for a single Node.js server to be fielding hundreds of requests simultaneously - the file locks ensures that no duplicate work is done and that partially complete files are never returned.

The finish() function then streams the file out to the client.

function finish(req, res, filebase, type, format) {
    var frstream = fs.createReadStream(targetFile(filebase, type, format));
    res.statusCode = "200";
    if ((req.get("Origin") || "").indexOf("localhost") != -1) res.setHeader('Access-Control-Allow-Origin', req.get("Origin"));
    res.setHeader('Cache-Control', 'max-age=31536000, public'); // 1 year (long!)
    res.setHeader('content-type', targetMime(type, format));
    frstream.pipe(res);        
}

The simple disk-based cache scheme in the Reference Implementation has an important flaw, namely it can grow in an unbounded fashion, leading to exhaustion of all the space on your local disk. A crude way to solve this is to run a 'cron' job on a daily basis that deletes the cache. Alternatively, a simple LRU (Least Recently Used) cache can be implemented by deleting only the older files. This works in the Reference server because a cache hit still results in the file being "touched", i.e. the last-modified timestamp is updated to the current time.

Another flaw with the disk-based cache is that it does not scale as well if you use a load balancer to distribute your load among multiple servers. When scaling out, we recommend switching to a memory-based caching scheme using Redis and AWS Elastic Cache. This results in greater access speed, but also allows the work performed by one instance to be shared by the other instances in your cluster.

The People Builder service is also a client of the Character API. In addition to using these techniques, it further reduces latency by moving files to AWS S3 and AWS CloudFront. If your application is an authoring system that lets users create static character scripts, then a Publish button can be the signal to perform this type of longer-term storage and distribution optimization.

More on Text-to-Speech

While you could conceivably have a Text-to-Speech engine installed directly on your server, the modern way is to simply use another cloud service. The Character API does not provide a Text-to-Speech api, however there are several vendors who do. In particular, the Amazon Polly service is inexpensive and excellent quality. Moreover, it allows you to cache the resulting audio files, which is essential for our purposes.

In the previous tutorial we showed how we could start an audio file concurrently with the animation. This audio file was pre-processed in order to obtain the lipsync information, which was passed along with the &lipsync parameter. With TTS, we simply ask the server for this audio file, along with the rest of the animation. The server will first look at the 'say' parameter to see if there is any speech.

If there is, it pulls out the text and sends it to Polly to obtain the audio. Next, it gets the lipsync information. At this point it can call the Character API animate() endpoint as before, with the lipsync information.

To add TTS we'll need to add some more modules:

$ npm install aws-sdk
$ npm install aws-config
$ npm install node-zip

To set up our call to Amazon Polly, we need to add the following logic at the top of the file (substitute your own API keys as appropriate):

var AWS = require('aws-sdk');
var awsConfig = require('aws-config');

AWS.config = awsConfig({
  region: 'us-east-1',
  maxRetries: 3,
  accessKeyId: 'xxxxxxxxxxxxxx',
  secretAccessKey: 'xxxxxxxxxxxxxx',
  timeout: 15000
});

Here is some logic to call Polly:

    // Get the e.g. "[say]Look [cmd] here.[/say]" -> "Look here."
    var textOnly = removeAllButSpeechTags(o.say);
    
    // Call Polly
    var polly = new AWS.Polly();
    var pollyData = {
      OutputFormat: 'mp3',
      Text: textOnly,
      VoiceId: "Joanna"
    };
    polly.synthesizeSpeech(pollyData, function (err, data) {
    
        // Write the audio
        fs.writeFile(targetFile(filename, "audio"), data.AudioStream, function (err) {
        
    });

In the case of the Amazon Polly service, there is a feature called Speech Marks which lets you download the "viseme", or lipsync, information for a voice request. To get both audio and Speech Mark information, you'll need to call the Polly API twice with the exact same text, once requesting an output in mp3 format, and once requesting an output in JSON format.

If you are using AWS Polly voices, then you will find that requesting Speech Marks results in slightly better quality than the LipSync API. This makes sense, since the phoneme data is an intermediate step in the Speech Synthesis process, so the resulting speech and lipsync data are guaranteed to match very faithfully. By contrast, the Character API lipsync information is generated from acoustic models that don't take into account the language or the text. Not all speech vendors expose phoneme information, but you should definitely use it when it is available.

The TTS call is likely to be the slowest component of a request. For example Polly requests for a single sentence are on the order of a few hundred milliseconds. If you look at the actual code in server.js, you will see that the two calls are performed in parallel so as to reduce the overall latency.

Locking it down

Normally a server protects its endpoints with an authentication system. For example the client might be handed a token as part of the login process, and that token is included with each api call. If the token is invalid then the server immediately fails the request. If your application includes a login, then you should certainly add authentication to your endpoint.

But what if your application appears on the public web, without authentication? This is hard problem in general, but there are some mitigations.

The Character API normally uses CORS to restrict the domains that can consume resources generated by the API. These domains can be set in the Character API Control Panel. It is a good idea to protect your caching server endpoint using CORS as well. To do so, modify the finish() function to look for an Origin header in the request - if it comes from your domain, then pass that origin back in the CORS Access-Control-Allow-Origin header.

if ((req.get("Origin")||"").indexOf("yourdomain.com") != -1) res.setHeader('Access-Control-Allow-Origin', req.get("Origin"));

You are now protected from someone lifting your client code and placing it on another domain. Note however that when server side code consumes a Character API resource, it is NOT subject to CORS protection. The Character API contains some measures to throttle run-away usage of your api and to detect other malicious activity, but ultimately one of the best protections in this case is to fix as many of the animation parameters as possible in your server code, to make it less attractive as a generic source of animation.

Next Steps

This tutorial has showed you the key elements in creating your own 'animate' endpoint that will call other endpoints as necessary to populate a cache. This cache maps urls containing animation parameters to the image, data, and audio resources they represent. Because your server will invariably serve many duplicate requests, you will decrease the latency for these requests and save on overall infrastructure costs.

The next tutorial will focus on how the Character API animation actions can be used to generate video.