About Kobi

Hello, I'm Kobi. I'm 35 years old and live in Israel with my wife Suzi. I work as a .Net developer at Orbograph, mainly using Asp.net MVC. Before that I worked at HP as a SharePoint developer.

First impressions of Amazon Kendra, AWS’s new Enterprise Search Engine

I did a quick hackathon proof of concept using Amazon Kendra, AWS’s new service launched at May – an enterprise search engine (more on that later) that uses natural language.

We’re using Confluence wiki for internal documentation. People are encouraged to participate, and we end up having thousands of pages. When it comes to looking for information, the search is… well…

Google auto-complete says it's bad.

My goal for the three-day hackathon was to see if Amazon Kendra can beat the Confluence search.

The good

It works, it finds quality results, and it’s quick to set up. Kendra is using natural language both for extracting information from documents, and for parsing the search query and finding results. I was impressed that it knew how to surface the high-quality pages out of the 7,500 pages it indexed, and in many cases could highlight exact answers inside these documents. Kendra can answer questions like “what offices do we have in tokyo?“, “what is mongodb?“, and can handle jargon like “define {thing}” or even “what team owns {internal-tool}?” – all of this was picked up from our documentation.

I used the S3 interface for loading documents into Kendra. Each PDF file is paired with an optional metadata json file which is useful for category and the source URL.
It took me a little over a day to export 7,500 PDFs from Confluence – the slowest part of my project. A couple more hours went into generating the metadata files. Uploading all files to S3 was quick, and then it took about 30 minutes to create the Amazon Kendra Index, configure an S3 data source, and then 4 minutes to load the data. Once loaded, the data was ready for use. I used the AWS Console to configure Amazon Kendra, which also includes an example search form – more than enough to evaluate the results and demonstrate Kendra’s capabilities (they even supply React components).

Who’s the target audience? What’s special about enterprise search?

Large organizations have multiple knowledge management and sharing tools – wiki for technical documentations, internal blogs, training, chats (like Slack or Teams), documents (like Google Drive), and more. Each tool comes with its own search engine. A common problem is that employees don’t know where the data they need lives, so a unified search engine makes a lot of sense.

Kendra has another feature that is required for a company search and you will never find in a public search engine like Google or DuckDuckGo – per-document permissions. Each user is expected to be authenticated, and find only the documents they are allowed to see. (At my first job I took part of a project that introduced a company search engine only to shut it down on the same day – because suddenly many bad-kept secrets were readily available. The search worked – turns out the permissions have always been broken…)

Finally, the pricing and quotas – with an enterprise edition starting at $5,000 a month and limited to 40,000 daily searches – only make sense for large organizations and a predictable number of internal users.

What I’d like to see next

I worked with Kendra for a short while and didn’t dive in too deeply, but there are some features that would be more than nice to have:

  • Integration with industry-standard tools: For now, Kendra has connectors for “S3, SharePoint, Salesforce, ServiceNow, RDS databases, One Drive, and many more coming later this year“. Missing here are tools like Confluence or Google’s G Suite. This list is biased toward Microsoft services – no doubt targeting a certain kind of customer. It is also not clear how these connectors work when the data is on-premises, although a PrivateLink interface is provided if your data lives in AWS.
  • Support for comments, context, and hierarchy: Kendra can ingest whole documents and parses them well, but not all data is equal. It is missing support for any kind of links between documents. A comment that doesn’t repeat information from the page is meaningless on its own, and chats rely on the discussion around them. There is currently no way of modeling this Kendra, and the pricing is not friendly to this use case either – a short comment counts the same as a full document. You can sort of get around it by including comments as part of the page (breaking direct links), but I doubt this would fit a tool like Slack. For comparison, Elasticsearch can model relations between documents.
  • Visibility into accuracy: Looking into query results, there is no indication as to the confidence of the results. Was the query well understood? Did we find good answers or only poor matches? This data would enrich the result and allow more usages (for example, a Slack bot that answers questions when the confidence is high – like they did for Lex). Closest thing here is the TopAnswer attribute.
  • Better fine-tuning: I was relieved I didn’t have to tweak any settings, or define taxonomy and stop words – steps that are not always easy or clear. Kendra does have settings for boosting documents based on fields, but if you need finer control, currently it isn’t there.
  • Planned features: Auto complete, suggestion of related searches or correction, and user feedback used for incremental learning.

Conclusion

It’s good! Results look promising, and it will be even better with multiple data sources. I’m going for it.

Popular numbers used in leaked passwords

Last time we’ve seen some popular fictional passwords, and how they are used in Exploit.in – a collection of 739,861,478 leaked passwords. I wanted to take another look at some popular numbers and how they are used in passwords.

π

How many digits of Pi have you memorized? I’ve got 6 digits after the decimal: 3.141592. Not bad, but could be better. Let’s see how far the Exploit.in community takes it:
17,713 password have exactly 7 decimal places of Pi (1415926). 1,626 have 10 decimal digits. The record is a single password with 37 digits of Pi – 1415926535 8979323846 2643383279 502884179. The last two digits in that password are cunningly reversed (79 instead of 97), so unfortunately they don’t count.

Here’s a breakdown for the digits of Pi in passwords:

Math constants

Same as with Pi, I’m only considering digits after the decimal point:

Value Digits Passwords
e 7182818284 337
e 71828182845904523536 7
√2 4142135623 114
√3 7320508075 21
Phi 6180339887 86
Euler Mascheroni 5772156649 1

Didn’t make it: Conway’s Constant, Khinchin’s constant, and Glaisher Kinkelin.

Fibonacci sequence

One advantage of the Fibonacci sequence is that if you forget your password you can always calculate it. 25,316 unforgettable passwords contain 8 Fibonacci numbers (1 through 21). Only 4 passwords make it all the way to the 15th number, 610:

Physics and Chemistry constants

Number Digits Passwords
Light Speed Meters/Second 299792458 1988
Light Speed Miles/Second 186282 298
Light Speed Kilometers/Second 299792 2561
Light Speed Miles/Hour 670616629 1
Avogadro 6.02×1023 6021023 259
Avogadro 6.022×1023 60221023 73

Avogadro passwords that make good use of special characters include m602x1023e and 602*1023.

Pop-culture numbers

Unsurprisingly, numbers that appear on TV or song lyrics make their appearance in passwords. I couldn’t find too many examples:

Music

Tommy Tutone 867-5309 / Jenny 8675309 (Exact password) 13894
Tommy Tutone 868-5309 / Jenny 8675310 (contain) 28231
Queens of the Stone Age Regular John 16278263789 7

LOST

40,430 passwords are exactly 4815162342. Additional 34,318 password contain 4815162342, and another 1,058 passwords have The Numbers with separators, for example L48O1516S2342T.

See also

For a more serious analysis of password trends and common patterns:

Speak Friend and Enter – Do people actually use movie passwords?

Can you think of a famous password, such as one used in a book or a movie? Do people see passwords being typed on screen and think “hey, now that is a pretty good password…”? I downloaded the Exploit.in combo list containing over seven hundred million leaked logins and tried to look how often than happens.

Famous passwords used in real leaked passwords:

The clear winner here is Mission Impossible with 948 password containing AW96B6. A nice second is TrustNo1 from The X-Files which appears in 629 passwords. Next we have 386 passwords containing Swordfish – a classic password from 1932 with its own Wikipedia page.
The safe combination from Inception, 528491 is used in 168 passwords, sometimes accompanied by character names like 528491c0bb or eames528491.

mission-impossible-server-room

“Damn it, Ethan, you don’t have to press Caps Lock for each letter…”

I thought many The Lord of the Rings fans will choose “Speak Friend” as their password – but not even one person did. The elf word for “friend”, Mellon, appears 107 times. We also have 6 YouShallNotPass, and even one SpamShallNotPass.
Finally, there are 11 My Preciouses, including My Cat Precious and My Precious Girl (which still count! we have to be fair toward other phrases that didn’t get this level of scrutiny).

Movie or Book Title Password Count
Mission Impossible AW96B6 948
The X-Files TrustNo1 629
Horse Feathers Swordfish 386
Die Hard Akagi (Whole word) 241
Inception 528491 (Exact number) 168
LotR Mellon 107
Arabian Nights Open Sesame 37
Watchmen Rameses 2 18
Casino Royale 836547 (Exact number) 14
LotR My Precious 11
LotR Shall Not Pass 7
Watchmen Rameses II 6
National Treasure Valley Forge 3
Who Framed Roger Rabbit Walt sent me 1
The Net Natoar23ae 1
Tron Reindeer Flotilla 0
Matrix Zion 0101 0
LotR Cannot Pass 0

Harry Potter

I didn’t plan on having so many Harry Potter passwords, but the good people at wikia made it too easy – they have a page with dozens of passwords used in the books. I’ve split them into three types:

Harry Potter passwords that are also Latin or other known phrases or snacks:

Password Count
Acid Pops 377
Fizzy Pop 259
Baubles 241
Alea iacta est 179
Catweazle 135
Quid Agis 31

Harry Potter Pass-phrases

Password Count
Dumbledore 23
Balderdash 8
Studious Success 6
Tapeworm 3
Caput Draconis 2
Flibbertigibbet 2
Pig Snout 1
Fortuna Major 1

Passwords based on Harry Potter spells:

Spell Brief Description Count
Accio (Contained in password as a whole word) Fetch something. Reasonable password. 885
Accio (The whole password is just “accio” with no additional characters.) 555
Stupefy 223
Alohomora Open locks. Makes sense. 68
Imperio 35
Crucio 17
Avada Kedavra 15
Expelliarmus 9
Sectumsempra 5
Impedimenta 3
Reparo 3
Expecto Patronum 2
Confundus 1

No muggle had used any of the following passwords:
Wattlebird, Oddsbodikins, Scurvy Cur, Banana Fritters, Fairy Lights, Mimbulus Mimbletonia, Abstinence, Dilligrout, Sherbet lemon, Cockroach cluster, Fizzing Whizzbees, Toffee Eclairs, Chocolate Frogs, Dissendium, Slytherins are Supreme, Facta non verba, Sea Serpent, This Password is Absurd, Libraries Liberate, Dragon s Egg, Light against Darkness, Chops and Gravy, Dashing Cadogan, Forget Me Never, Lunartickle, Surreptitiousness, Wanglewort, up to no good, nor Mischief Managed.

I see “This Password is Absurd” is up for grabs, so I’m calling it – this is my password now.

Technical Details

I’ve found the combo list on Google. It took me about an hour before I know the term “combo” or the name “Exploit.in”. I will not link it from here though.
The combo contains 111 files with lines in the format {email}:{password}, for example:

jane@example.com:12345
bob@example.com:pa$$word

From each line I’ve kept just the password and the TLD. I’ve loaded all data to an SQLite database, and grouped identical rows – keeping the count, of course. Having the data SQL made analyzing it very easy.
To automate some of queries I used KNIME – not so much for its analytics, mainly because is automatically saves results to disk.
PasswordsKnimeWorkflow

Password comparisons were all case-insensitive, and multi-word password are filtered using SQL Like, as where password like '%fizzy%pop%'.

Bonus

2,289,587 Spaceballs fans use 12345 as the first number in their passwords. 601,874 have 12345 as their entire password.

Entity Framework 6 – Bulk Insert and Returning of Generated Primary Keys

Bit of a code dump today. I was tasked with making database insertions quicker. On a good day we have millions of rows which we inserted using the Entity Framework – which is inefficient and done one row at a time.

I’ve started by working with the library EFUtilities which looked very promising – it uses the Entity Frameworks’s metadata and model to create a temporary table, bulk inserts into it, and copies data to the real table. The following code expand the capabilities of the library:

  • Improved thread-safety.
  • Add mode of ignoring failures – Sometimes we want to insert millions of rows, and we don’t really care about a few invalid ones.
  • Main motivation – Return the generated primary keys (or IDs) of the inserted rows. EFUtilities leaved all IDs as 0, which is very limiting.

Continue reading

Regular expression balancing groups – now also in Go!

Go comes with the terse regex engine re2, which should be very fast, but is missing a lot of common regex features (like lookarounds). This is probably one of the reasons Doug Clark has ported the .Net Regex engine to Go, and looks like he did a good job at it: regexp2 library.

I did a quick test to see if my maze-solving regex works, and it did! This is impressive, specially considering none of my patterns ever worked on Mono until they merged with the .Net Framework implementation.

Installing the library, assuming you already have Go set up (“...” is correct despite looking like a mistake, by the way):

go get github.com/dlclark/regexp2/...

My first ever Go program:

package main

import (
	"fmt"

	"github.com/dlclark/regexp2"
)

/*
This Go program is using the library Regex2 by Doug Clark ( https://github.com/dlclark ),
which is a port of the .Net regex engine to Go.
Install using:
go get github.com/dlclark/regexp2/...
*/

func solveMaze(maze string) {
	pattern := `
(?=(?<Pos>.*?(?<Step>S)))
\A
(?:
    # Find next position
    (?:
        (?<End>)
        (?<=               # Stop when Pos reached E.
            \A(?=\k<Pos>(?<=E))    
        .*)
        |
        (?<Right>)
        (?<=                
            \A              # Go back to start.
            (?=\k<Pos>      # Go to the current position.
                (?<Step>[ E])
                (?<=(?<NewPos>\A.*))   # Set new Position.
            )    
        .*)
        |
        (?<Left>)
        (?<=
            \A(?=\k<Pos>(?<=
                (?<Step>[ E])
                (?<=(?<NewPos>\A.*))   # Set new Position
                .                   # One step before the current POS
            ))    
        .*)
        |
        (?<Down>)
        (?<=
            \A(?=\k<Pos>
                (?>
                (?-s:                   # Dot should not match new lines
                    (?<=^(?<X>.)*.)       # Count characters to the left. -1 caracter for current position.
                    .*\n                  # Go to end of this line
                    (?<-X>.)*(?(X)(?!))   # Find same position as above character
                )
                )
                (?<Step>[ E])
                (?<=(?<NewPos>\A.*))   # Set new Position
            )    
        .*)
        |
        (?<Up>)
        (?<=
            \A(?=\k<Pos>
                (?<=        # Lookbehind - read pattern from bottom to top: http://stackoverflow.com/q/13389560/7586
                    (?-s:                   # Dot does not match new lines
                        (?=
                            (?>(?<-X>.)*(?(X)(?!)))  # Find same position as lower character
                            (?<Step>[ E])
                            (?s:(?<=(?<NewPos>\A.*)))   # Set new Position
                        )
                        (?>
                        ^.*\n              # Go to start of the previous line
                        ^(?<X>.)*.         # Count characters to the left. -1 caracter for current position.
                        )
                    )
                )           # End of Lookbehind
            )    
        .*)
    )
. # progress the match.
    # Check the new position is unique - it wasn't matched before.
    # None of the other positions are equal to the new postion.
    (?(End)|
        (?!(?!      # Rollback all changes after matching (Reset Pos)
            (?<=\A
                (?:
                    (?<=\A
                        (?=
                            \k<Pos>             # Move to Pos
                            (?<!\k<NewPos>)     # Check the Pos is different from NewPos
                            (?<-Pos>)           # Pop Pos.
                        )
                    .*)
                .)*
                . # one extra for the new postion
            )
        ))
        # Push NewPos to Pos
        (?<=\A
            (?=
                (?<Pos>\k<NewPos>) # Set NewPos and Temp to Pos
                (?<-NewPos>)       # Pop NewPos. No real reason.
            )
        .*)
    )
)+?
(?(End)|(?!))
`

	//re := regexp2.MustCompile(`(?<A>..)(?<-A>..)(?<-A>.)`, 0)
	re := regexp2.MustCompile(pattern, regexp2.IgnorePatternWhitespace|regexp2.Multiline|regexp2.Singleline)
	if isMatch, _ := re.MatchString(maze); isMatch {
		//do something
		fmt.Println("Solvable!")
	} else {
		fmt.Println("Nope.")
	}
}

func testMazes() {
	goodMazes := []string{
		`
██████████
█S█      █
█ █ █ █ ██
█   █ █ E█
██████████`}

	badMazes := []string{
		`
██████████
█S█      █
█ ███ █ ██
█   █ █ E█
██████████`}

	fmt.Println("These mazes are good:")
	for _, maze := range goodMazes {
		solveMaze(maze)
	}
	fmt.Println("Bad Mazes:")
	for _, maze := range badMazes {
		solveMaze(maze)
	}
}

func main() {
	testMazes()
}

run:

go run RegexMaze.go

and output (with a few more tests):

These mazes are good:
Solvable!
Solvable!
Solvable!
Bad Mazes:
Nope.
Nope.
Nope.

I copied the pattern to Go, and it just worked. Nice!

Getting Stack Overflow votes by choosing a cuter icon

I have posted 1,386 answers on Stack Overflow. That’s quite a lot.
A few of my answers are popular – they pop up on Google as top results. Some of these get voted every day. Even when I am not active on Stack Overflow, I still get a few dozens reputation points almost daily. Lucky me!
Recently I got an idea – are people more likely to vote if they like the user icon? Can changing my icon get me more votes each day?
Currently I have over 93K reputation points – would I have gotten to 100,000 already had I a different icon all of these years?
I checked.

I picked my 30 most popular answers and tracked the votes they were getting each hour. I also tracked how many views and votes their questions were getting.
Each week on Sunday I changed my user icon. Here are the icons I used:
Flag of the USA Flag of India Plush Fox Suzi in a Hat (that's my amazing wife) Kobi (that's me)

  • The flag of the United States of America. The greatest country in the world.
  • The flag of India.
  • A photo of a plush fox from Japan.
  • My wife Suzi wearing a hat. She volunteered.
  • Me and my regular icon. Looking at it again, it is a little over-exposed, and I look a little pale.

Results

StackOverflowVotesTotal
Both flags performed poorly, while the Fox and Suzi Wearing a Hat did best.
It also makes sense to look at how many views we got each day:
StackOverflowViewsPerDay
StackOverflowVotesPerView
The number of views is similar each week. Still, relative to the total number of views, it looks like Fox performed a little better than Suzi Wearing a Hat. This trend also appears on a daily breakdown:
StackOverflowVotesAnswerVoteDay

Flags

I picked the flags of the USA and India mainly because of the time difference between them – there is almost no intersection of the working hours between India and the USA. I wanted to see if the icon affects the times I was getting votes. That didn’t work:
The greatest nation in the world, and the biggest democracy in the world

Question Statistics

The question voting statistics weren’t constant either. This reflects poorly on the previous results – it is possible it is all random.
StackOverflowQuestionVotesPerViewStackOverflowQuestionVotePerDay

Is this a thing? Time for Round II!

I took the Fox and the USA flag to a second round, to see how they did:
StackOverflowVotesTotal2StackOverflowVotesPerView2 Even after a month the fox and flag had consistent results!

Conclusion?

I think it worked! Changing the icon seems to affect how many votes I’m getting.
Thanks!

StackOverflowFoxMugBackground

Disclaimers

  1. Maybe the numbers are too small for a meaningful conclusion.
  2. I don’t even know enough statistics to write an Hello World program in R.
  3. There was an outlier during the second USA flag run – one question received a lot of views. It doesn’t affects the results much, even when zeroed out.
  4. USA Flag re-run got a down-vote. I didn’t count it.
  5. In 2017-06-01 the icon Kobi stole one vote from Suzi Wearing a Hat. Its score should be a little higher.
  6. All times suffer from an off-by-one error: each hour at X:17 I took statistics that belong to X-1:00. I don’t really care.

Bonus:
StackOverflowStatisticsCommentStupidPeople

Haiku Camera: Take photo, hear a haiku. Using Reddit, AWS Rekognition, and Polly.

Three weeks ago Amazon had their annual AWS re:Invent event, where new products were launched. I wanted a quick way to test some new products, and picked something practical: Take a photo, understand what’s in the photo using Amazon Rekognition, find a relevant haiku on Reddit /r/haiku, and read it out loud using Amazon Polly. Why not.

Here’s a video demonstrating Haiku Camera:

I wrote a simple web application in ASP.Net MVC, using the AWS .Net SDK. They work quickly, and the NuGet packages for all new services were ready for use shortly after the announcement.

Amazon Rekognition

Rekognition’s DetectLabels can receive a photo and return a list of things it identified in the photo.
The API is straightforward and works nicely. While developing I tested it mostly with photos I took on my trip two years ago, and overall it did quite well, finding relevant keywords for every photo.
The code is fairly short (not handling the case of >5MB images):

public async Task<List<string>> DescribeImage(System.Drawing.Image image)
{
    using (var stream = new MemoryStream())
    {
        image.Save(stream, ImageFormat.Jpeg);
        stream.Seek(0, SeekOrigin.Begin);
        var rekognition = new AmazonRekognitionClient();
        var content = await rekognition.DetectLabelsAsync(
        new DetectLabelsRequest
        {
            MaxLabels = 5,
            Image = new Amazon.Rekognition.Model.Image
            {
                Bytes = stream,
            },
        });
        return content.Labels.Select(l => l.Name).ToList();
    }
}

Here are a few good examples using the Rekognition demo page (AWS account required):

As I’ve said, I had a specific goal – I wanted to find a haiku related to the photo. I limited the API to five keywords, assuming that would be enough, and focusing only on the most relevant features in the photo. I could have also used the confidence to remove less likely labels, but I chose not to bother.

After using enough test photos, I noticed I was getting a lot of the same keywords, for example:

It was obvious:

Unfortunately, as you probably already know, people

Horse ebooks, July 25, 2012

All of these photos have the keywords People, Person, and Human. Arguably, it is only useful in the photo of dancing people, where the people are really the subject. I search for a haiku based on all keywords, and people are a popular subject among haiku poets. People are spamming my results, and I keep getting the same haiku.
Additionally, the photos of the lion statue and Disneyland have exactly the same labels, adding Art, Sculpture, and Statue.

Confidence is not enough

Besides correctness, another issue is percision and specificness. Consider the results ["bird", "penguin"], or ["food", "sushi"]. It is clear to us, people-person-humans, that Bird ⊃ Penguin, and Food ⊃ Sushi – but how can we choose the more specific word automatically? If I’m using a black-box product like Amazon Rekognition, I probably don’t have the resources to build my own corpus. Furthermore, this data is clearly already contained in Rekognition – but it is not exposed in any way. This is a complementary service – had I used Rekognition to tag many photos and build an index, and wanted to answer questions like “find all photos with [Bird]”, I would not have had this problem. There is difficulty in choosing the best label when describing the content of a single photo.
I did not test it, but AWS has a new chatbot service called Amazon Lex (currently at limited preview) – maybe a chatbot service can help and choose the more specific words.

Technically correct

What’s in this photo?

Image source getyourguide.com.

Image source getyourguide.com.

Ask a few people, and chances are they’ll say The Eiffel Tower, or Paris.
Rekognition gives us {"Architecture", "Tower"}. Now, if a person gave you this answer there are two options: either they’re 3 and know what architecture is, or they have a superb sense of humor. And that’s really the problem: without proper names, Rekognition is being a smart-ass.

Rekognition – conclusion

Rekognition works well enough, and is easy to use. Its face recognition capabilities seem much more advanced than its modest image labeling and content identification.

What could be better:

  • A simple way to represent hierarchy between keywords.
  • Besides confidence, expose the relevance of the keyword. As reference, this is how Elasticsearch handles relevance:

    Inverse document frequency: How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms

    This just makes sense. If you have "People" in 60% of the photos (not a real number), you can be confident that a photo will have people in it, but it would not be very interesting.

  • Relevance can also be influenced by the composition of the photo: is there a huge tower in the center and a few small people below it? Maybe the tower is more interesting. It would be nice if the API returned relative areas of the labels, or bounding-boxes where possible (also useful for cropping).
  • A few proper names would be nice.

It is fair to mention that this is a first version, and even a small update can make much more useful for my use case.

Finding a haiku

This was the easy part. There are many collections of haiku on the interent. I chose Reddit /r/haiku because Reddit has a simple RSS API for content, a build-in search engine, and a huge variety of crazy creative haiku.

var keywords = String.Join(" OR ", subject.Select(s => $"({s})"));
var url = $"https://www.reddit.com/r/haiku/search.rss?q=title%3A({Uri.EscapeUriString(keywords)})&restrict_sr=on&sort=relevance&t=all";
var all = XDocument.Load(url).Descendants().Where(e => e.Name.LocalName == "title")
    .Select(e => e.Value).Where(h => h?.Count('/'.Equals) == 2).ToList();
// retrun a random haiku.

Using the keywords I build a URL for the search API. The filter looks like "title:((Dog) OR (Bear) OR (Giant Panda))".
If I used these haiku publicly or commercially (!?), I would have also check the license, and would have extracted the author and link to the haiku.

Amazon Polly

Another new service is Amazon Polly, which is a voice synthesizer: it accepts text and returns a spoken recording of that text.

public async Task CreateMp3(string text, string targetMp3FilePath)
{
    var polly = new AmazonPollyClient();
    var speak = await polly.SynthesizeSpeechAsync(new SynthesizeSpeechRequest
    {
        OutputFormat = OutputFormat.Mp3,
        Text = text,
        VoiceId = "Joanna",
    });
    using (var fileStream = File.Create(targetMp3FilePath))
    {
        speak.AudioStream.CopyTo(fileStream);
    }
}

Again, the code is simple, and Polly’s SynthesizeSpeech works easily. Polly has a variety of English voices and accents, including British, American, Australian, Welsh, and Indian, and I pick a random English one each time.
I am not the target audience, but I found the American and British voices to be clearer and of higher quality than the other English voices. (I mean in Amazon Polly, of course. Not in general.)

A minor challenge was to get the punctuation right. The poems in Reddit are presented as three lines separated by slashes and usually spaces. For example:

of all the virtues / patience is the single most / irritating one

This is not a format that is too suitable for Polly. Polly pretty much ignores the slash when it is surrounded by spaces, but reads a verbal “slash” when it is not surrounded by spaces (“warrior/poet”).
Haiku tend to have minimal punctuation, but in cases the punctuation is there I prefer to keep it. When there is no punctuation at the end of a line I add commas and periods:

of all the virtues,
patience is the single most,
irritating one.

This is not ideal, but renders nicely in Polly. I add newlines just for presentation. Newlines are ignored by Polly, as far as I could see.

Polly is obviously not meant for reading haiku, but in this case its quirks are part of the charm.
I did not try SSML at all – it probably requires better semantic understanding than I have.

Other

This is fairly little code to achieve what I wanted – understand what’s in a photo, find a haiku, and play it. I wrapped it all in a small mobile-friendly web page:

  • Turns out an <input type="file"> field can trigger the mobile camera. That’s neat.
  • I used CSS for all styling and animations. CSS can do a lot.
  • There is just a little JavaScript. Most of it deals with updating elements and CSS classes – it is not as fun as using an MVVM/MVC framework.

See also

Thanks!

Filling a large rectangle with smaller rectangles using regular expressions

We have some rectangles:

Four rectangles. I: 2x2, II: 2x1, III: 1x1, IV: 3x1

We also have a bigger, empty rectangle:

Empty area, 3x2

Goal

The goal is to fill the whole empty rectangle with some of the smaller rectangles, and to do it using a .Net regular expression (obviously).
We may use each rectangle only once, and we are allowed to rotate a rectangle as needed.
For example, these are valid solutions:

Use rectangle I and II rotated

Use rectangle II, III, and IV

But this one isn’t:
Use rectangle I and III

Input

We’ll represent the rectangles as alphanumeric blocks, and the empty rectangle as a block of tildes (~). There is an empty line between the rectangles:

aa
aa

bb

c

ddd

~~~
~~~


I chose this representation because it is relatively simple, and it makes the pattern easier to follow: ~ is part of the target rectangle, \w is a small rectangle.

An important comment regarding solution representation

There are many ways of representing a solution, and this affects the way our algorithm is written and works.
Consider this solution again:

Use rectangle II, III, and IV

Either of these work:
(I’m using roman numeral for the small rectangles: I, II, III, and IV)

  • (not this) Rectangle number II goes on position (1, 1), rectangle number III goes on position (3,1), and rectangle number IV goes on position (1,2). (we will not be solving it like this though)
  • (not this) Rectangle I is not used. Rectangle II is used first, Rectangle III is second, and Rectangle IV is third. (we will not be solving it like this either)
  • First use Rectangle II, then Rectangle III, and then Rectangle IV. (this is the one we’ll use)

This is a little nuanced. The third option works best here because it fits backtracking very well.
When I say “next free position” I mean in character order: we go over all tildes from left to right, line by line, until we find one that isn’t already occupied.
Using this representation, the solution above can be written as (II, III, IV).

The pattern

\A
(?=(?<NextPos>[^~]*))       # \k<NextPos> always matches the position *before* the next free tilde.
(?:
    (?:
        (?<IndexLength>.)+?             # Match n characters. We will use rectangle number n in position <Index>.
                                        # 0 is not an option - we need some shape to be first, second, third, etc.
        (?<=\A(?=(?<Index>(?<-IndexLength>.)+)).*)    # Capture into <Index> the first n characters.
        (?<=\A(?<CurrentIndex>\k<Index>).*)           # Copy Index into CurrentIndex
        (?=.*\Z                                       # Ensure <Index> is unique. We cannot use the same rectangle twice
            (?<!(?=\A\k<Index>(?<=\A\k<CurrentIndex>))\A.*(?<-Index>.)+) 
        )
        (?(IndexLength)(?!)|)   # Not needed, just an assert.
        
        #Copy the shape of rectangle <Index> to the target area.
        #Find rectangle number <Index>
        (?<=(?=.(?<IndexLength>.)*?(?<=\A\k<Index>))\A.*)    # Populate <IndexLength> again.
             # ^-- we are skipping over one character. We want to reach rectangle number <IndexLength>,
             #     so we're skiping over <IndexLength>-1 rectangles
        (?<=(?=\s*(?<-IndexLength>(?:\w+\r?\n)+\r?\n)*(?<=(?<RectangleStart>.*))\w)\A.*) # Capture starting position of this rectangle.
        (?(IndexLength)(?!))
        (?:
            (?<Rotate>)     # Capture 0 characters into <Rotate>. Indicates that we are not rotating this rectangle.
            |
            (?<Rotate>.)    # Capture 1 character into <Rotate>. Indicates that we are rotating this rectangle.
            (?<TempRotate>) # Also mark <TempRotate>. This is a flag that is cleared for each rectangle. Allows conditional matching.
        )

        (?<=(?=\k<NextPos>(?<=(?<X>~*)))\A.*)   # Init <X> with the number of tildes to the left of the starting position.

        # Basically `(?:\w+\n)+\n` to match the whole rectangle.
        (?<=(?=\k<RectangleStart>   # Go to the position before the first character in the current rectangle.
            (?<Rectangle>           # Capture this rectangle, just so we have it while printing the solution.
                (?=(?(TempRotate)(?<TempHeight>\w)|(?<TempWidth>\w))+\r?\n) 
                (?:  
                    (?<Solution>\w)+
                    (?(TempRotate)(?<TempWidth>\r?\n)|(?<TempHeight>\r?\n))
                )+
            )
            \r?\n        # Match until we reach the end of this rectangle.
        )\A.*)

        (?<=(?=[^~]+(?<Width>(?<-TempWidth>~)+))\A.*)(?(TempWidth)(?!)) # Capture as many tildes as the current width.
        (?=(?<-TempHeight>\S*(?<Height>\r?\n))+)(?(TempHeight)(?!))     # Capture newlines into stack <Height>.
        (?(TempRotate)(?<-TempRotate>))                                 # Clear <TempRotate>.

        # Try to fit the rectangle into empty positions in the target rectangle.
        (?<=(?=\k<NextPos>
            (?:
                (?:                         # Match tildes
                    ~
                    (?<=(?<Filled>\A.*))               # Push the current position to <Filled>
                    (?<=(?<TempCurrentFilled>\A.*))    # Also push the current position to <TempCurrentFilled>
                    (?=.*\Z                            # Ensure <Filled> is unique. No overlap between rectangles.
                        (?<!(?=\A\k<Filled>(?<=\A\k<TempCurrentFilled>))\A.*(?<-Filled>.)+) 
                    )
                )+?
                (?<=^\k<X>\k<Width>)        # Match exactly <Width> tildes.
                ~*(?<-Height>\k<Height>\k<X>|\r?\n)     # Match until the same position on the net line (or just the last line).
            )+
            (?(Height)(?!))
        )\A.*)

        # Find the next free position - .
        (?<=(?=.*?
            (?<=(?<NextPos>\A.*))       # <NextPos> is the position before the next free tilde.
            ~
            (?<=(?<TempNextPos>\A.*))   # We compare it to <Filled>, which is the position including the tilde.
            (?=.*\Z
                (?<!(?=\A\k<Filled>(?<=\A\k<TempNextPos>))\A.*(?<-Filled>.)*) 
            )
            |
            .*\r?\n\Z (?<Done>)         # If we cannot find more an empty position it means we are done. Set the <Done> flag.
        )\A.*)
        
    )
)+
(?(Done)|(?!))

The idea

At its heart, the pattern works like this:

^(?<IndexLength>.+)+

This is fairly simple, and is the main trick used here.
The pattern matches characters from the beginning of the string. It doesn’t matter what it matches, just how many characters and how many times. The group IndexLength is captured multiple times – as many as the number of rectangles used in the solution. Its length indicates which rectangle is used. In our example, when the solution is (II, III, IV), the group will have three captures with lengths 2, 3, and 4: (..)(...)(....)

A few more examples:

Solution Rectangles Indices IndexLength Captures
Use rectangle II, III, and IV (II, III, IV) (..)(...)(....)
Use rectangle I and II rotated (I, II⟳) (.)(..⟳)
Use rectangle I and III (I, III) (.)(...)

The (.+)+ pattern usually leads to catastrophic backtracking – but here it allows us to do an exhaustive search: we check all possible orders of the rectangles. To avoid catastrophe we have a large number of asserts that fail as quickly as possible when a match cannot be found.

Tricks and Comments

  • I’m using \r?\n everywhere because that’s what string literals contain.
  • The regex is written in a procedural-programming style – similar to statements and variables. I probably could have written it with more nesting and fewer groups, but I find it easier to write and understand this way.
  • In many groups we need to point to a position in the string. To do that we take all characters from the beginning of the string until the position, and push them to a stack. For example, this is done in group NextPos that points to where the next rectangle should go, and Filled that contains occupied positions.
  • In many cases I use (?<=(?=\k<NextPos>…)\A.*) – this is a lookbehind to the beginning of the string, and a lookahead to our position. I’ve placed \A “after” the lookahead because lookbehinds are matched from end to start.
  • There is no relation between the actual characters we match (a few characters from the beginning) and the actual logic flow. This is very similar to the maze-solving regex.
  • Uniqueness – In three places we’re checking that a new capture is different from all previous captures: We choose an unused rectangle, we fill the rectangle into unoccupied tildes, and then we find the next empty tilde. This is done with this pattern:

    (?=.*\Z
    	(?<!(?=\A\k<Filled>(?<=\A\k<TempNextPos>))\A.*(?<-Filled>.)*) 
    )

    This is fairly neat. The pattern is based on Martin Ender‘s answer on Programming Puzzles & Code Golf (by the way, this is the third time today that I’ve linked one of Martin’s answers – thanks!).
    This is a negative lookbehind (remember – read from the end) which:

    • (?<-Filled>.)* – Allow popping from the stack.
    • \A.* – go to the beginning.
    • (?=\A\k<Filled>(?<=\A\k<TempNextPos>)) – see that the item in the stack is different from our new value.

    Because this is a negative lookbehind all options are checked – we check the new value against all values in the stack.
    Because this is a negative lookaround the changes are discarded – we don’t have to copy the stack, and we are not losing the captured values.

  • Rotation is done as a simple alternation, followed by conditional matches on width/height.
  • We can tweak the pattern to change its behavior. For example, we can quite easily allow multiple usages of the same rectangle, or disable rotation. We can also remove the last assertion – (?(Done)|(?!)): this causes the pattern to match the first few rectangles, and is a nice way of debugging the behavior of the pattern.
  • There’s a subtle problem here: we are using input characters to represent the order of the rectangles we are using. We need an increasing number of characters for each additional rectangle. If we have a large enough number of small rectangles we may run out of input characters and the match will fail. We can fix it by allowing additional spaces at the end, for example. While this is a bug, it is not too interesting.
  • This problem was the second programming assignment I got at university (14 years ago, in Java, not regex).

Getting the Solution

The result is a single Match. The match is successful when there is a solution.
This is already pretty good.
We are also capturing these groups:

IndexLength As shown in the table above, for each rectangle used in the solution there’s a capture of IndexLength. The length of the group indicates the index of the rectangle.
Rectangle Captures the rectangles in the order they were used.
Rotate For each rectangle – captures one character if the rectangle is rotated, and an empty capture if it isn’t rotated.
NextPos Captures the position of each rectangle in the solution, on the target rectangle. The position before the tilde where the rectangle should be placed.
SolutionChar Contains each character as used in the order they were iterated. In our example with solution (I, II) the captures of SolutionChar would be {a,a,a,a,b,b}.
Filled Contains all tildes in the target rectangle in the order they were filled. This is the same order as SolutionChar, and we can use both groups to paint the solution if we wanted.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex
Rectangles were drawn in HTML/CSS using JS Bin: http://jsbin.com/bafaxijiqa/edit?html,css,output


Thanks! Thanks!

Two Years with Amazon Simple Workflow (SWF)

AWS

June 12 mark two years of us using Amazon Simple Workflow Service (SWF) in production, and I thought I’d share the experience.

First, let’s get this out of the way:

What is SWF not?

  • SWF does not execute any code.
  • SWF does not contain the logic of the workflow.
  • SWF does not allow you to draw a workflow or a state machine.

So what is it?

SWF is a web service that keeps the state of your workflow.
That’s pretty much it.

What are we using it for?

Our project is based on C#. We are using the AWS API directly (using the .Net SDK).
If you are using Java Or Ruby amazon provider a higher level library for SWF called Flow Framework. For C#, I wrote what I needed myself, or simply used the “low level” API.
Out project processes a large number of files daily, and it was my task to convert our previous batch-based solution to SWF.

How does it work?

SWF is based on polling. Your code runs on your machines on AWS or on-premises – it doesn’t matter. Your code is polling for tasks from the SWF API (where they wait in queues), receives a task, executes it, and sends the result back to the SWF API.
SWF then issues new tasks to your code, and keeps the history of the workflow (state).

If you’ve read any of the documentation, you probably know there are two kind of tasks: Activity Tasks (processed by workers), and Decision Tasks (process by The Decider). This API naturally encourages and leads you to a nice design of your software, where different components do different things.

Workers

Workers handle Activity Tasks.
Workers are simple components that actually do the work of the workflow. These are the building blocks of the workflow, and typically do one simple thing:

  • Take an S3 path as input and calculate the hash of the file.
  • Add a row to the database.
  • Send an email.
  • Take a S3 path to an image and create a thumbnail.

All of my workers implement a simple interface:

public interface IWorker
{
    Task Process(TInput input);
}

An important property of workers is that all the data it needs to perform its task is included in its input.

The Decider

When I first read about SWF I had a concept of tiny workers and deciders working together like ants to achieve a greater goal. Would that it were so simple.
While workers are simple, each type of workflow has a decider with this operation:

  • Poll for a decision task.
  • Receive a decision task with all new events since the previous decision task.
  • Optically load the entire workflow history to get context.
  • Make multiple decisions based on all new events.

For a simple linear workflow this isn’t a problem. The decider code is typically:

if workflow started
  Schedule activity task A
else if activity task A finished
  Schedule activity task B
else if activity task B finished
  Schedule activity task C
else if activity task C finished
  Complete workflow execution.

However, when the logic of the workflow is complicated a decider may be required to handle this event:

Since the previous decision, Activity Task A completed successfully with this result, Activity Task B failed, we wouldn’t start the child workflow you’ve requested because you were rate limited, oh, and that timer you’ve set up yesterday finally went off.

This is a pretty complicated scenario. The Decider has just one chance of reacting to these events, and they all come at the same time. There are certainly many approaches here, but either way the decider is a hairy piece of code.

Code Design

As I’ve mentioned earlier, my workers are simple, and don’t use the SWF API directly – I have another class to wrap an IWorker. This is a big benefit because any programmer can write a worked (without knowing anything of SWF), and because it is easy to reuse the code in any context. When the worker fails I expect it to simply throw an exception – my wrapper class registers the exception as an activity task failure.

To make writing complicated deciders easier I’ve implemented helper functions to get the history of the workflow, parse it, and make new decisions. My decider is separated to a base class that uses the SWF API, and child classes (one for each workflow type) that accept the workflow history and return new decisions. My deciders do not connect to a database or any external resource, and have no side-effects (excepts logs and the SWF API, of course). This allows me to easily unit-test the decider logic – I can record a workflow history at a certain point to JSON, feed it to the decider, and see what decisions it makes. I can also tweak the history to make more test cases easily. These tests are important to me because the decided can contain a lot of logic.

Scalability

In either case, for deciders and for workers, I keep zero state in the class instance. All state comes from the workflow history in the decider’s case, and task input in the worker’s case. There is no communication between threads and no shared memory. This approach makes writing scalable programs trivial: there are no locks and no race conditions. I can have as many processes running in as many machines as I’d like, and it just works – there is no stage of discovery or balancing. As a proof of concept, I even ran some of the treads in Linux (using Mono), and it all worked seamlessly.

Retries

The Flow Framework has built-in retries, but it only took me a few hours to implement retries to failed activity tasks, and a few more hours to add exponential backoff. This works nicely – the worker doesn’t know anything. The decider schedules another activity tasks or fails the workflow. The retry will wait a few minutes, and may run in another server. This does prove itself, and many errors are easily resolved.

Timeouts

SWF has many types of timeouts, and I’ve decided early on that I would use them everywhere. Even on manual steps we have timeouts of a few days.
Timeouts are important. They are the only way the workflow can detect a stuck worker or decider tasks because a process crashed. They also encourage you to think about your business process a little better – what does it mean when it takes four days to handle a request? Can we do something automatically?
Another good (?) property of timeouts is that timeouts can purge your queues when the volume gets too high for your solution.

Integration with other AWS services

Lambda

SWF can execute an AWS Lambda instead of an activity task, which is a compelling idea. It saves the trouble of writing the worker, polling for tasks, and reduces the overhead of a large number of threads and open connections. In the simple worker examples I gave above, all of them can be written as Lambda functions (except maybe adding a database row, depending on your database and architecture). The combination of Lambda serverless execution and SWF state-fullness can make a robust, and trivially scalabale system.
But – while you can use Lambda to replace your workers, you still need to implement a decider the uses the API, and runs as a process in your servers. This is a shame. Decision tasks are quick and self contained, and deciders can easily be implemented as Lambda functions – if they didn’t have to poll for tasks.
I predict Amazon are going to add this feature: allowing AWS Lambda to work as a decider is a logical next step, and can make SWF a lot more appealing.

CloudWatch

CloudWatch show accumulated metadata about your workflows and activity tasks. For example, this chart shows the server CPU (blue) and executions of an Activity Task (orange):
CloudWatch - CPU and an Activity Task
This is nice for seeing exclusion time and how the system handles large volumes. The downside is that while it should accumulated data – there is no drill-down. I can clearly see 25 “look for cats in images” workflows failed, but there is no way of actually seeing them. More on that below.

What can be better

Rate Limiting and Throttling

More specifically, limiting number of operations per second. I don’t get rate limiting. Mostly, rate limiting feels like this:
Little Britain - Computer Says No

I understand rate limiting can be useful, and it’s a good option when faulty code is running amok. However, even when I just started it felt like the SWF rate limiting was too trigger-happy. As a quick example – if I have a workflow that is setting a timer, and I start that workflow several hundreds of times, some workflow will fail setting the timer because of a rate limit. I then have to ask for a timer again and again until I succeed. I can’t even wait before asking for a timer again because, well, waiting means setting a timer… (to add insult to injury, the request to set the time is removed from the history, so I can’t really know exactly which timer failed)
For this reason when I’ve implemented exponential backoff between failures I didn’t use timers at all – I used a dummy activity task with a short schedule-to-start timeout. Activity tasks are not rate-limited per time (looking at the list again – this statement doesn’t look accurate, but that list wasn’t public at the time).
I just don’t get the point. The result isn’t better for Amazon or for the programmers. I understand the motive behind rate limiting, but it should be better tuned.

SWF Monitoring API

The API used for searching workflows is very limiting. A few examples:

  • Find all workflows of type Customer Request – Supported.
  • Find all failed workflows – Supported.
  • Find all failed workflows of type Customer Request – Not supported.
  • Find all workflows that used the “Send SMS” activity task – Nope.
  • Find the 6 workflows where the “Send SMS” activity task timed out – No way, not even close.

This can get frustrating. CloudWatch can happily report 406 workflows used the “Send SMS” activity task between 13:00 and 13:05, and 4 activity tasks failed. There is no way of finding these workflows.
So sure, it isn’t difficult to implement it myself (we do have logs), but a feature like this is missing.

The AWS Console

The AWS management console in poor in general. The UI is dated and riddled with small bugs and oversights: JavaScript based links do not allow middle-clicking, bugs when the workflow history is too big, or missing links where they are obvious, like clicking on RunId of parent or child workflow, number of decision task should link to that decision, link from queue name can count of pending tasks, etc.
And of course, the console is using the API, so everything the API cannot do, the console can’t either.
Working with the console leaves a lot to be desired.

Community

There is virtually no noteworthy discussion on SWF. I’m not sure that’s important.

Conclusion

While SWF has its quirks, I am confident and happy with our solution.


2018 Update

An important comment is that SWF doesn’t seem to be in active development. From the FAQs – When should I use Amazon SWF vs. AWS Step Functions?:

AWS customers should consider using Step Functions for new applications. If Step Functions does not fit your needs, then you should consider Amazon Simple Workflow (SWF).

AWS will continue to provide the Amazon SWF service, Flow framework, and support all Amazon SWF customers

So it still work, and our code still works, but SWF is not getting any new features. This is certainly something to consider when choosing a major component in your system.

What is better in 2018 is visibility into of rate limits: There are CloudWatch metrics that show you your limit, usage, and throttled events, and there is a structured support form for increasing the rate limits.