Haiku Camera: Take photo, hear a haiku. Using Reddit, AWS Rekognition, and Polly.

Three weeks ago Amazon had their annual AWS re:Invent event, where new products were launched. I wanted a quick way to test some new products, and picked something practical: Take a photo, understand what’s in the photo using Amazon Rekognition, find a relevant haiku on Reddit /r/haiku, and read it out loud using Amazon Polly. Why not.

Here’s a video demonstrating Haiku Camera:

I wrote a simple web application in ASP.Net MVC, using the AWS .Net SDK. They work quickly, and the NuGet packages for all new services were ready for use shortly after the announcement.

Amazon Rekognition

Rekognition’s DetectLabels can receive a photo and return a list of things it identified in the photo.
The API is straightforward and works nicely. While developing I tested it mostly with photos I took on my trip two years ago, and overall it did quite well, finding relevant keywords for every photo.
The code is fairly short (not handling the case of >5MB images):

public async Task<List<string>> DescribeImage(System.Drawing.Image image)
{
    using (var stream = new MemoryStream())
    {
        image.Save(stream, ImageFormat.Jpeg);
        stream.Seek(0, SeekOrigin.Begin);
        var rekognition = new AmazonRekognitionClient();
        var content = await rekognition.DetectLabelsAsync(
        new DetectLabelsRequest
        {
            MaxLabels = 5,
            Image = new Amazon.Rekognition.Model.Image
            {
                Bytes = stream,
            },
        });
        return content.Labels.Select(l => l.Name).ToList();
    }
}

Here are a few good examples using the Rekognition demo page (AWS account required):

As I’ve said, I had a specific goal – I wanted to find a haiku related to the photo. I limited the API to five keywords, assuming that would be enough, and focusing only on the most relevant features in the photo. I could have also used the confidence to remove less likely labels, but I chose not to bother.

After using enough test photos, I noticed I was getting a lot of the same keywords, for example:

It was obvious:

Unfortunately, as you probably already know, people

Horse ebooks, July 25, 2012

All of these photos have the keywords People, Person, and Human. Arguably, it is only useful in the photo of dancing people, where the people are really the subject. I search for a haiku based on all keywords, and people are a popular subject among haiku poets. People are spamming my results, and I keep getting the same haiku.
Additionally, the photos of the lion statue and Disneyland have exactly the same labels, adding Art, Sculpture, and Statue.

Confidence is not enough

Besides correctness, another issue is percision and specificness. Consider the results ["bird", "penguin"], or ["food", "sushi"]. It is clear to us, people-person-humans, that Bird ⊃ Penguin, and Food ⊃ Sushi – but how can we choose the more specific word automatically? If I’m using a black-box product like Amazon Rekognition, I probably don’t have the resources to build my own corpus. Furthermore, this data is clearly already contained in Rekognition – but it is not exposed in any way. This is a complementary service – had I used Rekognition to tag many photos and build an index, and wanted to answer questions like “find all photos with [Bird]”, I would not have had this problem. There is difficulty in choosing the best label when describing the content of a single photo.
I did not test it, but AWS has a new chatbot service called Amazon Lex (currently at limited preview) – maybe a chatbot service can help and choose the more specific words.

Technically correct

What’s in this photo?

Image source getyourguide.com.

Image source getyourguide.com.

Ask a few people, and chances are they’ll say The Eiffel Tower, or Paris.
Rekognition gives us {"Architecture", "Tower"}. Now, if a person gave you this answer there are two options: either they’re 3 and know what architecture is, or they have a superb sense of humor. And that’s really the problem: without proper names, Rekognition is being a smart-ass.

Rekognition – conclusion

Rekognition works well enough, and is easy to use. Its face recognition capabilities seem much more advanced than its modest image labeling and content identification.

What could be better:

  • A simple way to represent hierarchy between keywords.
  • Besides confidence, expose the relevance of the keyword. As reference, this is how Elasticsearch handles relevance:

    Inverse document frequency: How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms

    This just makes sense. If you have "People" in 60% of the photos (not a real number), you can be confident that a photo will have people in it, but it would not be very interesting.

  • Relevance can also be influenced by the composition of the photo: is there a huge tower in the center and a few small people below it? Maybe the tower is more interesting. It would be nice if the API returned relative areas of the labels, or bounding-boxes where possible (also useful for cropping).
  • A few proper names would be nice.

It is fair to mention that this is a first version, and even a small update can make much more useful for my use case.

Finding a haiku

This was the easy part. There are many collections of haiku on the interent. I chose Reddit /r/haiku because Reddit has a simple RSS API for content, a build-in search engine, and a huge variety of crazy creative haiku.

var keywords = String.Join(" OR ", subject.Select(s => $"({s})"));
var url = $"https://www.reddit.com/r/haiku/search.rss?q=title%3A({Uri.EscapeUriString(keywords)})&restrict_sr=on&sort=relevance&t=all";
var all = XDocument.Load(url).Descendants().Where(e => e.Name.LocalName == "title")
    .Select(e => e.Value).Where(h => h?.Count('/'.Equals) == 2).ToList();
// retrun a random haiku.

Using the keywords I build a URL for the search API. The filter looks like "title:((Dog) OR (Bear) OR (Giant Panda))".
If I used these haiku publicly or commercially (!?), I would have also check the license, and would have extracted the author and link to the haiku.

Amazon Polly

Another new service is Amazon Polly, which is a voice synthesizer: it accepts text and returns a spoken recording of that text.

public async Task CreateMp3(string text, string targetMp3FilePath)
{
    var polly = new AmazonPollyClient();
    var speak = await polly.SynthesizeSpeechAsync(new SynthesizeSpeechRequest
    {
        OutputFormat = OutputFormat.Mp3,
        Text = text,
        VoiceId = "Joanna",
    });
    using (var fileStream = File.Create(targetMp3FilePath))
    {
        speak.AudioStream.CopyTo(fileStream);
    }
}

Again, the code is simple, and Polly’s SynthesizeSpeech works easily. Polly has a variety of English voices and accents, including British, American, Australian, Welsh, and Indian, and I pick a random English one each time.
I am not the target audience, but I found the American and British voices to be clearer and of higher quality than the other English voices. (I mean in Amazon Polly, of course. Not in general.)

A minor challenge was to get the punctuation right. The poems in Reddit are presented as three lines separated by slashes and usually spaces. For example:

of all the virtues / patience is the single most / irritating one

This is not a format that is too suitable for Polly. Polly pretty much ignores the slash when it is surrounded by spaces, but reads a verbal “slash” when it is not surrounded by spaces (“warrior/poet”).
Haiku tend to have minimal punctuation, but in cases the punctuation is there I prefer to keep it. When there is no punctuation at the end of a line I add commas and periods:

of all the virtues,
patience is the single most,
irritating one.

This is not ideal, but renders nicely in Polly. I add newlines just for presentation. Newlines are ignored by Polly, as far as I could see.

Polly is obviously not meant for reading haiku, but in this case its quirks are part of the charm.
I did not try SSML at all – it probably requires better semantic understanding than I have.

Other

This is fairly little code to achieve what I wanted – understand what’s in a photo, find a haiku, and play it. I wrapped it all in a small mobile-friendly web page:

  • Turns out an <input type="file"> field can trigger the mobile camera. That’s neat.
  • I used CSS for all styling and animations. CSS can do a lot.
  • There is just a little JavaScript. Most of it deals with updating elements and CSS classes – it is not as fun as using an MVVM/MVC framework.

See also

Thanks!

Advertisements

Two Years with Amazon Simple Workflow (SWF)

AWS

June 12 mark two years of us using Amazon Simple Workflow Service (SWF) in production, and I thought I’d share the experience.

First, let’s get this out of the way:

What is SWF not?

  • SWF does not execute any code.
  • SWF does not contain the logic of the workflow.
  • SWF does not allow you to draw a workflow or a state machine.

So what is it?

SWF is a web service that keeps the state of your workflow.
That’s pretty much it.

What are we using it for?

Our project is based on C#. We are using the AWS API directly (using the .Net SDK).
If you are using Java Or Ruby amazon provider a higher level library for SWF called Flow Framework. For C#, I wrote what I needed myself, or simply used the “low level” API.
Out project processes a large number of files daily, and it was my task to convert our previous batch-based solution to SWF.

How does it work?

SWF is based on polling. Your code runs on your machines on AWS or on-premises – it doesn’t matter. Your code is polling for tasks from the SWF API (where they wait in queues), receives a task, executes it, and sends the result back to the SWF API.
SWF then issues new tasks to your code, and keeps the history of the workflow (state).

If you’ve read any of the documentation, you probably know there are two kind of tasks: Activity Tasks (processed by workers), and Decision Tasks (process by The Decider). This API naturally encourages and leads you to a nice design of your software, where different components do different things.

Workers

Workers handle Activity Tasks.
Workers are simple components that actually do the work of the workflow. These are the building blocks of the workflow, and typically do one simple thing:

  • Take an S3 path as input and calculate the hash of the file.
  • Add a row to the database.
  • Send an email.
  • Take a S3 path to an image and create a thumbnail.

All of my workers implement a simple interface:

public interface IWorker<TInput, TOutput>
{
    Task<TOutput> Process(TInput input);
}

An important property of workers is that all the data it needs to perform its task is included in its input.

The Decider

When I first read about SWF I had a concept of tiny workers and deciders working together like ants to achieve a greater goal. Would that it were so simple.
While workers are simple, each type of workflow has a decider with this operation:

  • Poll for a decision task.
  • Receive a decision task with all new events since the previous decision task.
  • Optically load the entire workflow history to get context.
  • Make multiple decisions based on all new events.

For a simple linear workflow this isn’t a problem. The decider code is typically:

if workflow started
  Schedule activity task A
else if activity task A finished
  Schedule activity task B
else if activity task B finished
  Schedule activity task C
else if activity task C finished
  Complete workflow execution.

However, when the logic of the workflow is complicated a decider may be required to handle this event:

Since the previous decision, Activity Task A completed successfully with this result, Activity Task B failed, we wouldn’t start the child workflow you’ve requested because you were rate limited, oh, and that timer you’ve set up yesterday finally went off.

This is a pretty complicated scenario. The Decider has just one chance of reacting to these events, and they all come at the same time. There are certainly many approaches here, but either way the decider is a hairy piece of code.

Code Design

As I’ve mentioned earlier, my workers are simple, and don’t use the SWF API directly – I have another class to wrap an IWorker. This is a big benefit because any programmer can write a worked (without knowing anything of SWF), and because it is easy to reuse the code in any context. When the worker fails I expect it to simply throw an exception – my wrapper class registers the exception as an activity task failure.

To make writing complicated deciders easier I’ve implemented helper functions to get the history of the workflow, parse it, and make new decisions. My decider is separated to a base class that uses the SWF API, and child classes (one for each workflow type) that accept the workflow history and return new decisions. My deciders do not connect to a database or any external resource, and have no side-effects (excepts logs and the SWF API, of course). This allows me to easily unit-test the decider logic – I can record a workflow history at a certain point to JSON, feed it to the decider, and see what decisions it makes. I can also tweak the history to make more test cases easily. These tests are important to me because the decided can contain a lot of logic.

Scalability

In either case, for deciders and for workers, I keep zero state in the class instance. All state comes from the workflow history in the decider’s case, and task input in the worker’s case. There is no communication between threads and no shared memory. This approach makes writing scalable programs trivial: there are no locks and no race conditions. I can have as many processes running in as many machines as I’d like, and it just works – there is no stage of discovery or balancing. As a proof of concept, I even ran some of the treads in Linux (using Mono), and it all worked seamlessly.

Retries

The Flow Framework has built-in retries, but it only took me a few hours to implement retries to failed activity tasks, and a few more hours to add exponential backoff. This works nicely – the worker doesn’t know anything. The decider schedules another activity tasks or fails the workflow. The retry will wait a few minutes, and may run in another server. This does prove itself, and many errors are easily resolved.

Timeouts

SWF has many types of timeouts, and I’ve decided early on that I would use them everywhere. Even on manual steps we have timeouts of a few days.
Timeouts are important. They are the only way the workflow can detect a stuck worker or decider tasks because a process crashed. They also encourage you to think about your business process a little better – what does it mean when it takes four days to handle a request? Can we do something automatically?
Another good (?) property of timeouts is that timeouts can purge your queues when the volume gets too high for your solution.

Integration with other AWS services

Lambda

SWF can execute an AWS Lambda instead of an activity task, which is a compelling idea. It saves the trouble of writing the worker, polling for tasks, and reduces the overhead of a large number of threads and open connections. In the simple worker examples I gave above, all of them can be written as Lambda functions (except maybe adding a database row, depending on your database and architecture). The combination of Lambda serverless execution and SWF state-fullness can make a robust, and trivially scalabale system.
But – while you can use Lambda to replace your workers, you still need to implement a decider the uses the API, and runs as a process in your servers. This is a shame. Decision tasks are quick and self contained, and deciders can easily be implemented as Lambda functions – if they didn’t have to poll for tasks.
I predict Amazon are going to add this feature: allowing AWS Lambda to work as a decider is a logical next step, and can make SWF a lot more appealing.

CloudWatch

CloudWatch show accumulated metadata about your workflows and activity tasks. For example, this chart shows the server CPU (blue) and executions of an Activity Task (orange):
CloudWatch - CPU and an Activity Task
This is nice for seeing exclusion time and how the system handles large volumes. The downside is that while it should accumulated data – there is no drill-down. I can clearly see 25 “look for cats in images” workflows failed, but there is no way of actually seeing them. More on that below.

What can be better

Rate Limiting and Throttling

More specifically, limiting number of operations per second. I don’t get rate limiting. Mostly, rate limiting feels like this:
Little Britain - Computer Says No

I understand rate limiting can be useful, and it’s a good option when faulty code is running amok. However, even when I just started it felt like the SWF rate limiting was too trigger-happy. As a quick example – if I have a workflow that is setting a timer, and I start that workflow several hundreds of times, some workflow will fail setting the timer because of a rate limit. I then have to ask for a timer again and again until I succeed. I can’t even wait before asking for a timer again because, well, waiting means setting a timer… (to add insult to injury, the request to set the time is removed from the history, so I can’t really know exactly which timer failed)
For this reason when I’ve implemented exponential backoff between failures I didn’t use timers at all – I used a dummy activity task with a short schedule-to-start timeout. Activity tasks are not rate-limited per time (looking at the list again – this statement doesn’t look accurate, but that list wasn’t public at the time).
I just don’t get the point. The result isn’t better for Amazon or for the programmers. I understand the motive behind rate limiting, but it should be better tuned.

SWF Monitoring API

The API used for searching workflows is very limiting. A few examples:

  • Find all workflows of type Customer Request – Supported.
  • Find all failed workflows – Supported.
  • Find all failed workflows of type Customer Request – Not supported.
  • Find all workflows that used the “Send SMS” activity task – Nope.
  • Find the 6 workflows where the “Send SMS” activity task timed out – No way, not even close.

This can get frustrating. CloudWatch can happily report 406 workflows used the “Send SMS” activity task between 13:00 and 13:05, and 4 activity tasks failed. There is no way of finding these workflows.
So sure, it isn’t difficult to implement it myself (we do have logs), but a feature like this is missing.

The AWS Console

The AWS management console in poor in general. The UI is dated and riddled with small bugs and oversights: JavaScript based links do not allow middle-clicking, bugs when the workflow history is too big, or missing links where they are obvious, like clicking on RunId of parent or child workflow, number of decision task should link to that decision, link from queue name can count of pending tasks, etc.
And of course, the console is using the API, so everything the API cannot do, the console can’t either.
Working with the console leaves a lot to be desired.

Community

There is virtually no noteworthy discussion on SWF. I’m not sure that’s important.

Conclusion

While SWF has its quirks, I am confident and happy with our solution.