Two Years with Amazon Simple Workflow (SWF)


June 12 mark two years of us using Amazon Simple Workflow Service (SWF) in production, and I thought I’d share the experience.

First, let’s get this out of the way:

What is SWF not?

  • SWF does not execute any code.
  • SWF does not contain the logic of the workflow.
  • SWF does not allow you to draw a workflow or a state machine.

So what is it?

SWF is a web service that keeps the state of your workflow.
That’s pretty much it.

What are we using it for?

Our project is based on C#. We are using the AWS API directly (using the .Net SDK).
If you are using Java Or Ruby amazon provider a higher level library for SWF called Flow Framework. For C#, I wrote what I needed myself, or simply used the “low level” API.
Out project processes a large number of files daily, and it was my task to convert our previous batch-based solution to SWF.

How does it work?

SWF is based on polling. Your code runs on your machines on AWS or on-premises – it doesn’t matter. Your code is polling for tasks from the SWF API (where they wait in queues), receives a task, executes it, and sends the result back to the SWF API.
SWF then issues new tasks to your code, and keeps the history of the workflow (state).

If you’ve read any of the documentation, you probably know there are two kind of tasks: Activity Tasks (processed by workers), and Decision Tasks (process by The Decider). This API naturally encourages and leads you to a nice design of your software, where different components do different things.


Workers handle Activity Tasks.
Workers are simple components that actually do the work of the workflow. These are the building blocks of the workflow, and typically do one simple thing:

  • Take an S3 path as input and calculate the hash of the file.
  • Add a row to the database.
  • Send an email.
  • Take a S3 path to an image and create a thumbnail.
  • …

All of my workers implement a simple interface:

public interface IWorker
    Task Process(TInput input);

An important property of workers is that all the data it needs to perform its task is included in its input.

The Decider

When I first read about SWF I had a concept of tiny workers and deciders working together like ants to achieve a greater goal. Would that it were so simple.
While workers are simple, each type of workflow has a decider with this operation:

  • Poll for a decision task.
  • Receive a decision task with all new events since the previous decision task.
  • Optically load the entire workflow history to get context.
  • Make multiple decisions based on all new events.

For a simple linear workflow this isn’t a problem. The decider code is typically:

if workflow started
  Schedule activity task A
else if activity task A finished
  Schedule activity task B
else if activity task B finished
  Schedule activity task C
else if activity task C finished
  Complete workflow execution.

However, when the logic of the workflow is complicated a decider may be required to handle this event:

Since the previous decision, Activity Task A completed successfully with this result, Activity Task B failed, we wouldn’t start the child workflow you’ve requested because you were rate limited, oh, and that timer you’ve set up yesterday finally went off.

This is a pretty complicated scenario. The Decider has just one chance of reacting to these events, and they all come at the same time. There are certainly many approaches here, but either way the decider is a hairy piece of code.

Code Design

As I’ve mentioned earlier, my workers are simple, and don’t use the SWF API directly – I have another class to wrap an IWorker. This is a big benefit because any programmer can write a worked (without knowing anything of SWF), and because it is easy to reuse the code in any context. When the worker fails I expect it to simply throw an exception – my wrapper class registers the exception as an activity task failure.

To make writing complicated deciders easier I’ve implemented helper functions to get the history of the workflow, parse it, and make new decisions. My decider is separated to a base class that uses the SWF API, and child classes (one for each workflow type) that accept the workflow history and return new decisions. My deciders do not connect to a database or any external resource, and have no side-effects (excepts logs and the SWF API, of course). This allows me to easily unit-test the decider logic – I can record a workflow history at a certain point to JSON, feed it to the decider, and see what decisions it makes. I can also tweak the history to make more test cases easily. These tests are important to me because the decided can contain a lot of logic.


In either case, for deciders and for workers, I keep zero state in the class instance. All state comes from the workflow history in the decider’s case, and task input in the worker’s case. There is no communication between threads and no shared memory. This approach makes writing scalable programs trivial: there are no locks and no race conditions. I can have as many processes running in as many machines as I’d like, and it just works – there is no stage of discovery or balancing. As a proof of concept, I even ran some of the treads in Linux (using Mono), and it all worked seamlessly.


The Flow Framework has built-in retries, but it only took me a few hours to implement retries to failed activity tasks, and a few more hours to add exponential backoff. This works nicely – the worker doesn’t know anything. The decider schedules another activity tasks or fails the workflow. The retry will wait a few minutes, and may run in another server. This does prove itself, and many errors are easily resolved.


SWF has many types of timeouts, and I’ve decided early on that I would use them everywhere. Even on manual steps we have timeouts of a few days.
Timeouts are important. They are the only way the workflow can detect a stuck worker or decider tasks because a process crashed. They also encourage you to think about your business process a little better – what does it mean when it takes four days to handle a request? Can we do something automatically?
Another good (?) property of timeouts is that timeouts can purge your queues when the volume gets too high for your solution.

Integration with other AWS services


SWF can execute an AWS Lambda instead of an activity task, which is a compelling idea. It saves the trouble of writing the worker, polling for tasks, and reduces the overhead of a large number of threads and open connections. In the simple worker examples I gave above, all of them can be written as Lambda functions (except maybe adding a database row, depending on your database and architecture). The combination of Lambda serverless execution and SWF state-fullness can make a robust, and trivially scalabale system.
But – while you can use Lambda to replace your workers, you still need to implement a decider the uses the API, and runs as a process in your servers. This is a shame. Decision tasks are quick and self contained, and deciders can easily be implemented as Lambda functions – if they didn’t have to poll for tasks.
I predict Amazon are going to add this feature: allowing AWS Lambda to work as a decider is a logical next step, and can make SWF a lot more appealing.


CloudWatch show accumulated metadata about your workflows and activity tasks. For example, this chart shows the server CPU (blue) and executions of an Activity Task (orange):
CloudWatch - CPU and an Activity Task
This is nice for seeing exclusion time and how the system handles large volumes. The downside is that while it should accumulated data – there is no drill-down. I can clearly see 25 “look for cats in images” workflows failed, but there is no way of actually seeing them. More on that below.

What can be better

Rate Limiting and Throttling

More specifically, limiting number of operations per second. I don’t get rate limiting. Mostly, rate limiting feels like this:
Little Britain - Computer Says No

I understand rate limiting can be useful, and it’s a good option when faulty code is running amok. However, even when I just started it felt like the SWF rate limiting was too trigger-happy. As a quick example – if I have a workflow that is setting a timer, and I start that workflow several hundreds of times, some workflow will fail setting the timer because of a rate limit. I then have to ask for a timer again and again until I succeed. I can’t even wait before asking for a timer again because, well, waiting means setting a timer… (to add insult to injury, the request to set the time is removed from the history, so I can’t really know exactly which timer failed)
For this reason when I’ve implemented exponential backoff between failures I didn’t use timers at all – I used a dummy activity task with a short schedule-to-start timeout. Activity tasks are not rate-limited per time (looking at the list again – this statement doesn’t look accurate, but that list wasn’t public at the time).
I just don’t get the point. The result isn’t better for Amazon or for the programmers. I understand the motive behind rate limiting, but it should be better tuned.

SWF Monitoring API

The API used for searching workflows is very limiting. A few examples:

  • Find all workflows of type Customer Request – Supported.
  • Find all failed workflows – Supported.
  • Find all failed workflows of type Customer Request – Not supported.
  • Find all workflows that used the “Send SMS” activity task – Nope.
  • Find the 6 workflows where the “Send SMS” activity task timed out – No way, not even close.

This can get frustrating. CloudWatch can happily report 406 workflows used the “Send SMS” activity task between 13:00 and 13:05, and 4 activity tasks failed. There is no way of finding these workflows.
So sure, it isn’t difficult to implement it myself (we do have logs), but a feature like this is missing.

The AWS Console

The AWS management console in poor in general. The UI is dated and riddled with small bugs and oversights: JavaScript based links do not allow middle-clicking, bugs when the workflow history is too big, or missing links where they are obvious, like clicking on RunId of parent or child workflow, number of decision task should link to that decision, link from queue name can count of pending tasks, etc.
And of course, the console is using the API, so everything the API cannot do, the console can’t either.
Working with the console leaves a lot to be desired.


There is virtually no noteworthy discussion on SWF. I’m not sure that’s important.


While SWF has its quirks, I am confident and happy with our solution.

2018 Update

An important comment is that SWF doesn’t seem to be in active development. From the FAQs – When should I use Amazon SWF vs. AWS Step Functions?:

AWS customers should consider using Step Functions for new applications. If Step Functions does not fit your needs, then you should consider Amazon Simple Workflow (SWF).

AWS will continue to provide the Amazon SWF service, Flow framework, and support all Amazon SWF customers

So it still work, and our code still works, but SWF is not getting any new features. This is certainly something to consider when choosing a major component in your system.

What is better in 2018 is visibility into of rate limits: There are CloudWatch metrics that show you your limit, usage, and throttled events, and there is a structured support form for increasing the rate limits.


31 thoughts on “Two Years with Amazon Simple Workflow (SWF)

  1. We used SWF on my old team and had a terrible experience with it. Can you elaborate why you chose SWF over Map Reduce or some other parallel computing framework? I think that if the average execution time of a workflow is less than 5 minutes than you can convert almost directly convert a decider into an MR job and have better performance.

    Here was our list of complaints:
    – Each team member needed workflow an individual workflow domain.
    – Need to cancel local workflow from the UI if you killed the runners via the IDE.
    – Always needed to update the workflow version if an activity API changed.
    – Testing was a ugly.
    – Bad activity scheduling with large inputs. If we tried to run 100x workflows to backfill, then activities would timeout because there wasn’t enough hardware. MR doesn’t have this problem.
    – We ran into rate limits all the time, even after increasing the limit multiple times.

    Here is what we liked:
    – Activity retries. MR can only retry the entire workflow.
    – Graphical interface that shows activity failures. In MR one must rely on logs.

    • Our use case is parallel processing of multiple, unrelated files, not parallel computing of large tasks. We didn’t consider Map Reduce mainly because we were focused on .Net compatible frameworks. Alternatives to SWF would have been, for example, Windows Workflow Foundation, or a solution around a messaging framework (e.g. RabbitMQ).
      You are definitely right about the performance, but execution time is a secondary priority to us compared to robustness.
      I almost included it in the post – AWS requires at least 3 HTTPS requests for anything you want to do, which can become an overhead for small tasks.

      As for your pains – some of them are non-issues for us. Maybe you’ve used the Flow Framework which probably offers less flexibility, but here’s how we handle it:

      • Our system supports multi-environment per computer for each developer/QA. We have a prefix for each environment (used for workflow types and task list names) and just one domain for everything. I chose this approach because there is a limit of 100 domains (about enough for just 2 years for us). Our CI scripts set this prefix automatically.
      • Our scenario is processing a large number of unrelated files, and each file gets a new execution. If you killed a runner while debugging, which happens, you can just run a new file (common), or wait a few minutes until the task times out, and we retry it (rare).
      • I don’t use versions at all. Everything is version “1”.
        Breaking changes are only a problem with long-ruinning workflows, and I solve it in two iterations (2 sprints). For example: Original: {"Minutes":5}, Hybrid: {"Minutes":5, "Seconds":300}, Refactored: {"Seconds":300}. This requirs too much thinking, and sometimes I get it wrong. Tests help.
      • Testing is particularly elegant, and I have some of my best tests around SWF:
        • Workers are just regular classes. Easy to test.
        • Deciders are tested using serialized JSON. We can test the core logic of the system – for example: “Assert the workflow doesn’t fail if a row was not added to the database yet”.

        In both cases we don’t invoke SWF at all while testing (unit tests).

      Large input and processing time can be a problem. We mitigate it using very long timeouts for slow activities (16 hours schedule-to-start in one case). Thinking about timeouts and their implications can be a strength, and I tend to think of them as a positive mechanism in our flow.

      I’ve mentioned how I don’t get rate limiting, and feel it is too limiting for a supposedly scalabe framework.

  2. Hi Kobi,

    Great article. Just had a quick question. How do you scale the workers? Create many tiny EC instances is it? Or have one instance and have multiple threads running the same worker action?

    • Thanks!

      We don’t have auto-scaling in place yet, but it is not expected to be difficult (?). The plan is to have 0 to n servers depending on current load.
      Currently we have a couple of EC2 instances running several processes, and each process running many threads. I my case most threads are idle most of the time, so it would be a waste to have an instance for each worker. It also depends on the activity – some workers are waiting for an external resource like a database action or S3 file upload – they don’t take much local resources (almost no CPU or memory), and you can place many of them on the same machine.

      • Thanks a lot Kobi. Appreciate the quick reply. Made my day (or night). 🙂

        Your scaling is similar to what I was thinking, but wasn’t sure and to get a detailed reply from an engineer who has already gone thru’ this technique is great. I’ll take this approach and see how it works.

        I just wish that AWS allows deploying .NET workers in Elastic Beanstalk, would be so much easier and easily scalable. But unfortunately, AWS doesn’t support that (yet).

        Thanks again!

  3. Dear Kobi,

    Any chance that you could share the code ?
    (base decider, helper function for history, base worker etc..)


    • Hi Wim!
      As you can imagine, the code isn’t mine, it belongs to my company. I’m not sure they’ll be willing to open source it.
      I did mention it in the past and they didn’t hate the idea. Maybe I’ll ask again, but either way it will take time (at least a few important people need to approve something like that.

  4. Hi,

    Nice article.
    I had one question. I have a use-case to create multiple workflows all the time and SWF just starts them all. However, I want to restrict the count of running workflows. Also, this is needed in FIFO fashion. I can do this via a queueing but would like to know if it is possible in SWF.
    One way is we create a waiting activity that just waits for current executions to finish however, on completion all may be picked up. Or else it wont consider the FIFO (will pick up any 1). Let me know your thoughts.

    • Hello Sakshi.

      I’m not sure this is the best place for these questions, but anyway, I can think of a few options:

      • Have a central workflow scheduling the smaller workflows.
      • Similar: Have each workflow start the next workflow.
      • Each workflow can signal the next workflow
      • Use task priority. Note that priority between identical activity tasks, not whole workflows, and is not guaranteed to be accurate.
  5. Thanks for the article. How do you handle activities that require waiting? For instance, if you have limits on concurrency or things like a maximum data pipeline instance of 1? Would you name your workflow after a compositekey of the thing that needs to be bottlenecked and then somehow retry the workflow when it fails? Or how would you pause a workflow if another asynchronous workflow must happen? (Like don’t process daily data loads until a certain once daily data load has completed.) Thanks

    • Hi, Joe thanks.
      Our workflow tend to focus on a large number of small tasks, not daily bulk operations – we moved away from that approach a few years ago because it caused bottlenecks. Still:

      1. I do tend to have predictable workflow execution IDs (composite keys) – this allows us to use signals, and make sure the same workflow doesn’t start twice.
      2. The workflow and its execution ID can be used as a synchronization lock between different workflows when you need one (I try very hard to model the code in a way that doesn’t require locking though).
      3. Scheduling is only partly handled by SWF, by using timers (and timeouts). Most of the scheduling, and finding the right “host” for your code, is left up to the developer. There are several options I can think of, for example:
        • Only running the synchronized task in a single thread.
        • Or similarly: only running the daily-bulk tasks at a certain time, and the other tasks on another times. You can simply not call PollForActivityTask for the tasks you don’t want to run, they will wait for later.
        • Using multiple workflow executions, but using the same exclusion ID as a lock (as you’ve suggested).
        • Having a central long-running workflow that schedules a single task at a time, and communicate with it.
        • Implementing an external locking mechanism (though that would be working against SWF, not with it)


  6. Hi Kobi,

    Your article is really helpful. I am also working on an application using SWF. My current concern is we have a long running workflow which usually takes 2-3 hours to complete. I am worried what happens if we need to change the activity version if the activity API is changed. What would happen to the activities already queued with the older version?
    Similarly if I change the workflow version, how should I handle when the new version is getting deployed?

    • Hello Vishakha, Thanks,
      If you are using SWF, you have several options:

      I don’t think the version matters that much (except for monitoring), because you can’t poll for a specific version either way. We use tests to ensure compatibility between different versions. The model used for JSON needs some thinking – for example you may want to have an intermediate version that supports both the new and old API. For example, when renaming a property, keep the old property as an alias.

  7. Hi Kobi,

    I was evaluating the SWF workflow for implementing all the long running jobs for the REST API. So we have quite a number of such API which is async, and take time to complete.

    I Need to provide a framework where not every one should be aware with the whole details about SWF. Anyone should be able to come and move these sync api to SWF.

    I was thinking about taking this approach.

    Have only one workflow -> with one activity(Job job) -> (This activity is the whole workflow which we want to move from sync REST to SWF) -> process it -> store the result back to SWF or may be to some storage.

    Here others does not need to worry about internals, they just need to provide me the their Job implementation. And use WorkflowClient to submit the Job to SWF, which will be pulled by WorkflowWorker.

    So my question is, Do you think this is an acceptable or right approach ? I

    Hope to hear soon from you.

    • Hello!
      This can definitely work, and I’ve also done something similar in one case. The benefit is that it’s easier for anyone to create their own workflow, even without learning how SWF works or using its API. There are considerable downsides, like losing abilities to monitor the activity tasks, retry logic, or queuing of slow activities.

      • Sorry I was on leave for couple of weeks. Thank you for your quick reply.

        Yes, We are aware of these things that we will be loosing considerable amount of capabilities if we follow this approach.

        For Long Queue Jobs, I thought of using retry capability of SWF to overcome the issue:-

        We have some of the jobs which will take like 12 hours to complete. (Because we hit some api and wait for external service to finish), so the approach i thought i will take is, i will save the intermediate state of the job and use retry (Capability of SWF) to kind of re-queue the job.
        Job will keep on throwing RetryException until it is finished or Failed. And SWF will retry till the time it is configured. So i hope i wont be starving the jobs which are in the queue waiting to be processed.

        I hope i am not abusing SWF with this approach 🙂

  8. Hey Kobi,
    I will appreciate your feedback on my just released open source library C#.NET Guflow- It allows seamless development of workflows and activities with Amazon SWF. A look at some of its examples-, should give you an idea about its capabilities. Guflow is supported by documentation- and tutorial-

  9. This is very instructive.

    Could you give a little more detail on how you serialize the decider history for testing? What is the type of data that you are recording and playing back?

    • Thanks. This is pretty straightforward – I take a Decision object and serialize it to JSON using Newtonsoft Json.NET, and save it to a file. Then I can deserialize it and run the decider logic on the same Decision object (without sending the result to SWF, of course).
      The AWS SDK has weird enum-like static values that work well with Jil (another popular serializer), but Json.NET seems to support this somehow.

  10. Hi Kobi,

    I have a question regarding scaling a decider, specifically do you have a separate fleet/ec2 instance dedicated to handling decider events? Or are the activity tasks and decider running on the same ec2 instance? I’m currently working on a crypto currency trading bot and plan to use swf for the main orchestration. In my current design i’m leaning towards separating Deciders from Activity tasks because in general decoupling the two would decrease the blast radius when changes are made, and allow more flexibility if I need to scale them separately.


    • Hello Sim!
      There are several point here.
      First – functionally, I have the flexibility to run them on the same or on different instances. It doesn’t really matter. Depending on your workflows, the decider is using some CPU for loading the history and parsing JSONs, but this isn’t a big deal and wouldn’t limit using the same instance.
      I never saw a need in scaling the deciders, they’re quick enough anyway, and the rate-limit discourages parallelism and speed to a degree. I never have decision tasks waiting in queues.
      There is inherent coupling between deciders and activity tasks – it is the decider’s job to build input for the activity task, and then understand its output. A change in the format of the activity task requires a change in the decider.

  11. Hi there,
    We’re thinking of building a service with AWS SWF as its backend. Temporal is likely not offering us their managed cloud service so we have no choice.

    We have two concerns:
    1. Is there any risk of deprecation or killing of the service in the next 4 years?
    2. How easy is it to increase quotas?

    • Hi Abnik!

      As far as I understand SWF is still working, and was deprecated five years ago (even if this was not done officially). Nobody can tell you what are the plans for four years in the future – AWS will probably not commit either – but it doesn’t look they’re killing it. Still, there’s a risk of SWF not getting the latest tooling and eventually – e.g. PowerShell didn’t have SWF commands last time I checked, other tools probably made the same decision.

      Increasing quatas was easy last time I did that (4 years ago – I left that company since then), there’s a form.

      If you already an AWS account you should also try they support tickets – they work very well and can connect you to a support expert.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.