Two Years with Amazon Simple Workflow (SWF)

AWS

June 12 mark two years of us using Amazon Simple Workflow Service (SWF) in production, and I thought I’d share the experience.

First, let’s get this out of the way:

What is SWF not?

  • SWF does not execute any code.
  • SWF does not contain the logic of the workflow.
  • SWF does not allow you to draw a workflow or a state machine.

So what is it?

SWF is a web service that keeps the state of your workflow.
That’s pretty much it.

What are we using it for?

Our project is based on C#. We are using the AWS API directly (using the .Net SDK).
If you are using Java Or Ruby amazon provider a higher level library for SWF called Flow Framework. For C#, I wrote what I needed myself, or simply used the “low level” API.
Out project processes a large number of files daily, and it was my task to convert our previous batch-based solution to SWF.

How does it work?

SWF is based on polling. Your code runs on your machines on AWS or on-premises – it doesn’t matter. Your code is polling for tasks from the SWF API (where they wait in queues), receives a task, executes it, and sends the result back to the SWF API.
SWF then issues new tasks to your code, and keeps the history of the workflow (state).

If you’ve read any of the documentation, you probably know there are two kind of tasks: Activity Tasks (processed by workers), and Decision Tasks (process by The Decider). This API naturally encourages and leads you to a nice design of your software, where different components do different things.

Workers

Workers handle Activity Tasks.
Workers are simple components that actually do the work of the workflow. These are the building blocks of the workflow, and typically do one simple thing:

  • Take an S3 path as input and calculate the hash of the file.
  • Add a row to the database.
  • Send an email.
  • Take a S3 path to an image and create a thumbnail.

All of my workers implement a simple interface:

public interface IWorker<TInput, TOutput>
{
    Task<TOutput> Process(TInput input);
}

An important property of workers is that all the data it needs to perform its task is included in its input.

The Decider

When I first read about SWF I had a concept of tiny workers and deciders working together like ants to achieve a greater goal. Would that it were so simple.
While workers are simple, each type of workflow has a decider with this operation:

  • Poll for a decision task.
  • Receive a decision task with all new events since the previous decision task.
  • Optically load the entire workflow history to get context.
  • Make multiple decisions based on all new events.

For a simple linear workflow this isn’t a problem. The decider code is typically:

if workflow started
  Schedule activity task A
else if activity task A finished
  Schedule activity task B
else if activity task B finished
  Schedule activity task C
else if activity task C finished
  Complete workflow execution.

However, when the logic of the workflow is complicated a decider may be required to handle this event:

Since the previous decision, Activity Task A completed successfully with this result, Activity Task B failed, we wouldn’t start the child workflow you’ve requested because you were rate limited, oh, and that timer you’ve set up yesterday finally went off.

This is a pretty complicated scenario. The Decider has just one chance of reacting to these events, and they all come at the same time. There are certainly many approaches here, but either way the decider is a hairy piece of code.

Code Design

As I’ve mentioned earlier, my workers are simple, and don’t use the SWF API directly – I have another class to wrap an IWorker. This is a big benefit because any programmer can write a worked (without knowing anything of SWF), and because it is easy to reuse the code in any context. When the worker fails I expect it to simply throw an exception – my wrapper class registers the exception as an activity task failure.

To make writing complicated deciders easier I’ve implemented helper functions to get the history of the workflow, parse it, and make new decisions. My decider is separated to a base class that uses the SWF API, and child classes (one for each workflow type) that accept the workflow history and return new decisions. My deciders do not connect to a database or any external resource, and have no side-effects (excepts logs and the SWF API, of course). This allows me to easily unit-test the decider logic – I can record a workflow history at a certain point to JSON, feed it to the decider, and see what decisions it makes. I can also tweak the history to make more test cases easily. These tests are important to me because the decided can contain a lot of logic.

Scalability

In either case, for deciders and for workers, I keep zero state in the class instance. All state comes from the workflow history in the decider’s case, and task input in the worker’s case. There is no communication between threads and no shared memory. This approach makes writing scalable programs trivial: there are no locks and no race conditions. I can have as many processes running in as many machines as I’d like, and it just works – there is no stage of discovery or balancing. As a proof of concept, I even ran some of the treads in Linux (using Mono), and it all worked seamlessly.

Retries

The Flow Framework has built-in retries, but it only took me a few hours to implement retries to failed activity tasks, and a few more hours to add exponential backoff. This works nicely – the worker doesn’t know anything. The decider schedules another activity tasks or fails the workflow. The retry will wait a few minutes, and may run in another server. This does prove itself, and many errors are easily resolved.

Timeouts

SWF has many types of timeouts, and I’ve decided early on that I would use them everywhere. Even on manual steps we have timeouts of a few days.
Timeouts are important. They are the only way the workflow can detect a stuck worker or decider tasks because a process crashed. They also encourage you to think about your business process a little better – what does it mean when it takes four days to handle a request? Can we do something automatically?
Another good (?) property of timeouts is that timeouts can purge your queues when the volume gets too high for your solution.

Integration with other AWS services

Lambda

SWF can execute an AWS Lambda instead of an activity task, which is a compelling idea. It saves the trouble of writing the worker, polling for tasks, and reduces the overhead of a large number of threads and open connections. In the simple worker examples I gave above, all of them can be written as Lambda functions (except maybe adding a database row, depending on your database and architecture). The combination of Lambda serverless execution and SWF state-fullness can make a robust, and trivially scalabale system.
But – while you can use Lambda to replace your workers, you still need to implement a decider the uses the API, and runs as a process in your servers. This is a shame. Decision tasks are quick and self contained, and deciders can easily be implemented as Lambda functions – if they didn’t have to poll for tasks.
I predict Amazon are going to add this feature: allowing AWS Lambda to work as a decider is a logical next step, and can make SWF a lot more appealing.

CloudWatch

CloudWatch show accumulated metadata about your workflows and activity tasks. For example, this chart shows the server CPU (blue) and executions of an Activity Task (orange):
CloudWatch - CPU and an Activity Task
This is nice for seeing exclusion time and how the system handles large volumes. The downside is that while it should accumulated data – there is no drill-down. I can clearly see 25 “look for cats in images” workflows failed, but there is no way of actually seeing them. More on that below.

What can be better

Rate Limiting and Throttling

More specifically, limiting number of operations per second. I don’t get rate limiting. Mostly, rate limiting feels like this:
Little Britain - Computer Says No

I understand rate limiting can be useful, and it’s a good option when faulty code is running amok. However, even when I just started it felt like the SWF rate limiting was too trigger-happy. As a quick example – if I have a workflow that is setting a timer, and I start that workflow several hundreds of times, some workflow will fail setting the timer because of a rate limit. I then have to ask for a timer again and again until I succeed. I can’t even wait before asking for a timer again because, well, waiting means setting a timer… (to add insult to injury, the request to set the time is removed from the history, so I can’t really know exactly which timer failed)
For this reason when I’ve implemented exponential backoff between failures I didn’t use timers at all – I used a dummy activity task with a short schedule-to-start timeout. Activity tasks are not rate-limited per time (looking at the list again – this statement doesn’t look accurate, but that list wasn’t public at the time).
I just don’t get the point. The result isn’t better for Amazon or for the programmers. I understand the motive behind rate limiting, but it should be better tuned.

SWF Monitoring API

The API used for searching workflows is very limiting. A few examples:

  • Find all workflows of type Customer Request – Supported.
  • Find all failed workflows – Supported.
  • Find all failed workflows of type Customer Request – Not supported.
  • Find all workflows that used the “Send SMS” activity task – Nope.
  • Find the 6 workflows where the “Send SMS” activity task timed out – No way, not even close.

This can get frustrating. CloudWatch can happily report 406 workflows used the “Send SMS” activity task between 13:00 and 13:05, and 4 activity tasks failed. There is no way of finding these workflows.
So sure, it isn’t difficult to implement it myself (we do have logs), but a feature like this is missing.

The AWS Console

The AWS management console in poor in general. The UI is dated and riddled with small bugs and oversights: JavaScript based links do not allow middle-clicking, bugs when the workflow history is too big, or missing links where they are obvious, like clicking on RunId of parent or child workflow, number of decision task should link to that decision, link from queue name can count of pending tasks, etc.
And of course, the console is using the API, so everything the API cannot do, the console can’t either.
Working with the console leaves a lot to be desired.

Community

There is virtually no noteworthy discussion on SWF. I’m not sure that’s important.

Conclusion

While SWF has its quirks, I am confident and happy with our solution.

Advertisements

16 thoughts on “Two Years with Amazon Simple Workflow (SWF)

  1. We used SWF on my old team and had a terrible experience with it. Can you elaborate why you chose SWF over Map Reduce or some other parallel computing framework? I think that if the average execution time of a workflow is less than 5 minutes than you can convert almost directly convert a decider into an MR job and have better performance.

    Here was our list of complaints:
    – Each team member needed workflow an individual workflow domain.
    – Need to cancel local workflow from the UI if you killed the runners via the IDE.
    – Always needed to update the workflow version if an activity API changed.
    – Testing was a ugly.
    – Bad activity scheduling with large inputs. If we tried to run 100x workflows to backfill, then activities would timeout because there wasn’t enough hardware. MR doesn’t have this problem.
    – We ran into rate limits all the time, even after increasing the limit multiple times.

    Here is what we liked:
    – Activity retries. MR can only retry the entire workflow.
    – Graphical interface that shows activity failures. In MR one must rely on logs.

    • Our use case is parallel processing of multiple, unrelated files, not parallel computing of large tasks. We didn’t consider Map Reduce mainly because we were focused on .Net compatible frameworks. Alternatives to SWF would have been, for example, Windows Workflow Foundation, or a solution around a messaging framework (e.g. RabbitMQ).
      You are definitely right about the performance, but execution time is a secondary priority to us compared to robustness.
      I almost included it in the post – AWS requires at least 3 HTTPS requests for anything you want to do, which can become an overhead for small tasks.

      As for your pains – some of them are non-issues for us. Maybe you’ve used the Flow Framework which probably offers less flexibility, but here’s how we handle it:

      • Our system supports multi-environment per computer for each developer/QA. We have a prefix for each environment (used for workflow types and task list names) and just one domain for everything. I chose this approach because there is a limit of 100 domains (about enough for just 2 years for us). Our CI scripts set this prefix automatically.
      • Our scenario is processing a large number of unrelated files, and each file gets a new execution. If you killed a runner while debugging, which happens, you can just run a new file (common), or wait a few minutes until the task times out, and we retry it (rare).
      • I don’t use versions at all. Everything is version “1”.
        Breaking changes are only a problem with long-ruinning workflows, and I solve it in two iterations (2 sprints). For example: Original: {"Minutes":5}, Hybrid: {"Minutes":5, "Seconds":300}, Refactored: {"Seconds":300}. This requirs too much thinking, and sometimes I get it wrong. Tests help.
      • Testing is particularly elegant, and I have some of my best tests around SWF:
        • Workers are just regular classes. Easy to test.
        • Deciders are tested using serialized JSON. We can test the core logic of the system – for example: “Assert the workflow doesn’t fail if a row was not added to the database yet”.

        In both cases we don’t invoke SWF at all while testing (unit tests).

      Large input and processing time can be a problem. We mitigate it using very long timeouts for slow activities (16 hours schedule-to-start in one case). Thinking about timeouts and their implications can be a strength, and I tend to think of them as a positive mechanism in our flow.

      I’ve mentioned how I don’t get rate limiting, and feel it is too limiting for a supposedly scalabe framework.

  2. Hi Kobi,

    Great article. Just had a quick question. How do you scale the workers? Create many tiny EC instances is it? Or have one instance and have multiple threads running the same worker action?

    • Thanks!

      We don’t have auto-scaling in place yet, but it is not expected to be difficult (?). The plan is to have 0 to n servers depending on current load.
      Currently we have a couple of EC2 instances running several processes, and each process running many threads. I my case most threads are idle most of the time, so it would be a waste to have an instance for each worker. It also depends on the activity – some workers are waiting for an external resource like a database action or S3 file upload – they don’t take much local resources (almost no CPU or memory), and you can place many of them on the same machine.

      • Thanks a lot Kobi. Appreciate the quick reply. Made my day (or night). 🙂

        Your scaling is similar to what I was thinking, but wasn’t sure and to get a detailed reply from an engineer who has already gone thru’ this technique is great. I’ll take this approach and see how it works.

        I just wish that AWS allows deploying .NET workers in Elastic Beanstalk, would be so much easier and easily scalable. But unfortunately, AWS doesn’t support that (yet).

        Thanks again!

  3. Dear Kobi,

    Any chance that you could share the code ?
    (base decider, helper function for history, base worker etc..)

    Cheers,
    Wim

    • Hi Wim!
      As you can imagine, the code isn’t mine, it belongs to my company. I’m not sure they’ll be willing to open source it.
      I did mention it in the past and they didn’t hate the idea. Maybe I’ll ask again, but either way it will take time (at least a few important people need to approve something like that.

  4. Hi,

    Nice article.
    I had one question. I have a use-case to create multiple workflows all the time and SWF just starts them all. However, I want to restrict the count of running workflows. Also, this is needed in FIFO fashion. I can do this via a queueing but would like to know if it is possible in SWF.
    One way is we create a waiting activity that just waits for current executions to finish however, on completion all may be picked up. Or else it wont consider the FIFO (will pick up any 1). Let me know your thoughts.
    Thanks.

    • Hello Sakshi.

      I’m not sure this is the best place for these questions, but anyway, I can think of a few options:

      • Have a central workflow scheduling the smaller workflows.
      • Similar: Have each workflow start the next workflow.
      • Each workflow can signal the next workflow
      • Use task priority. Note that priority between identical activity tasks, not whole workflows, and is not guaranteed to be accurate.
  5. Thanks for the article. How do you handle activities that require waiting? For instance, if you have limits on concurrency or things like a maximum data pipeline instance of 1? Would you name your workflow after a compositekey of the thing that needs to be bottlenecked and then somehow retry the workflow when it fails? Or how would you pause a workflow if another asynchronous workflow must happen? (Like don’t process daily data loads until a certain once daily data load has completed.) Thanks

    • Hi, Joe thanks.
      Our workflow tend to focus on a large number of small tasks, not daily bulk operations – we moved away from that approach a few years ago because it caused bottlenecks. Still:

      1. I do tend to have predictable workflow execution IDs (composite keys) – this allows us to use signals, and make sure the same workflow doesn’t start twice.
      2. The workflow and its execution ID can be used as a synchronization lock between different workflows when you need one (I try very hard to model the code in a way that doesn’t require locking though).
      3. Scheduling is only partly handled by SWF, by using timers (and timeouts). Most of the scheduling, and finding the right “host” for your code, is left up to the developer. There are several options I can think of, for example:
        • Only running the synchronized task in a single thread.
        • Or similarly: only running the daily-bulk tasks at a certain time, and the other tasks on another times. You can simply not call PollForActivityTask for the tasks you don’t want to run, they will wait for later.
        • Using multiple workflow executions, but using the same exclusion ID as a lock (as you’ve suggested).
        • Having a central long-running workflow that schedules a single task at a time, and communicate with it.
        • Implementing an external locking mechanism (though that would be working against SWF, not with it)

      Thanks,
      Kobi

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s