First impressions of Amazon Kendra, AWS’s new Enterprise Search Engine

I did a quick hackathon proof of concept using Amazon Kendra, AWS’s new service launched at May – an enterprise search engine (more on that later) that uses natural language.

We’re using Confluence wiki for internal documentation. People are encouraged to participate, and we end up having thousands of pages. When it comes to looking for information, the search is… well…

Google auto-complete says it's bad.

My goal for the three-day hackathon was to see if Amazon Kendra can beat the Confluence search.

The good

It works, it finds quality results, and it’s quick to set up. Kendra is using natural language both for extracting information from documents, and for parsing the search query and finding results. I was impressed that it knew how to surface the high-quality pages out of the 7,500 pages it indexed, and in many cases could highlight exact answers inside these documents. Kendra can answer questions like “what offices do we have in tokyo?“, “what is mongodb?“, and can handle jargon like “define {thing}” or even “what team owns {internal-tool}?” – all of this was picked up from our documentation.

I used the S3 interface for loading documents into Kendra. Each PDF file is paired with an optional metadata json file which is useful for category and the source URL.
It took me a little over a day to export 7,500 PDFs from Confluence – the slowest part of my project. A couple more hours went into generating the metadata files. Uploading all files to S3 was quick, and then it took about 30 minutes to create the Amazon Kendra Index, configure an S3 data source, and then 4 minutes to load the data. Once loaded, the data was ready for use. I used the AWS Console to configure Amazon Kendra, which also includes an example search form – more than enough to evaluate the results and demonstrate Kendra’s capabilities (they even supply React components).

Who’s the target audience? What’s special about enterprise search?

Large organizations have multiple knowledge management and sharing tools – wiki for technical documentations, internal blogs, training, chats (like Slack or Teams), documents (like Google Drive), and more. Each tool comes with its own search engine. A common problem is that employees don’t know where the data they need lives, so a unified search engine makes a lot of sense.

Kendra has another feature that is required for a company search and you will never find in a public search engine like Google or DuckDuckGo – per-document permissions. Each user is expected to be authenticated, and find only the documents they are allowed to see. (At my first job I took part of a project that introduced a company search engine only to shut it down on the same day – because suddenly many bad-kept secrets were readily available. The search worked – turns out the permissions have always been broken…)

Finally, the pricing and quotas – with an enterprise edition starting at $5,000 a month and limited to 40,000 daily searches – only make sense for large organizations and a predictable number of internal users.

What I’d like to see next

I worked with Kendra for a short while and didn’t dive in too deeply, but there are some features that would be more than nice to have:

  • Integration with industry-standard tools: For now, Kendra has connectors for “S3, SharePoint, Salesforce, ServiceNow, RDS databases, One Drive, and many more coming later this year“. Missing here are tools like Confluence or Google’s G Suite. This list is biased toward Microsoft services – no doubt targeting a certain kind of customer. It is also not clear how these connectors work when the data is on-premises, although a PrivateLink interface is provided if your data lives in AWS.
  • Support for comments, context, and hierarchy: Kendra can ingest whole documents and parses them well, but not all data is equal. It is missing support for any kind of links between documents. A comment that doesn’t repeat information from the page is meaningless on its own, and chats rely on the discussion around them. There is currently no way of modeling this Kendra, and the pricing is not friendly to this use case either – a short comment counts the same as a full document. You can sort of get around it by including comments as part of the page (breaking direct links), but I doubt this would fit a tool like Slack. For comparison, Elasticsearch can model relations between documents.
  • Visibility into accuracy: Looking into query results, there is no indication as to the confidence of the results. Was the query well understood? Did we find good answers or only poor matches? This data would enrich the result and allow more usages (for example, a Slack bot that answers questions when the confidence is high – like they did for Lex). Closest thing here is the TopAnswer attribute.
  • Better fine-tuning: I was relieved I didn’t have to tweak any settings, or define taxonomy and stop words – steps that are not always easy or clear. Kendra does have settings for boosting documents based on fields, but if you need finer control, currently it isn’t there.
  • Planned features: Auto complete, suggestion of related searches or correction, and user feedback used for incremental learning.

Conclusion

It’s good! Results look promising, and it will be even better with multiple data sources. I’m going for it.

Haiku Camera: Take photo, hear a haiku. Using Reddit, AWS Rekognition, and Polly.

Three weeks ago Amazon had their annual AWS re:Invent event, where new products were launched. I wanted a quick way to test some new products, and picked something practical: Take a photo, understand what’s in the photo using Amazon Rekognition, find a relevant haiku on Reddit /r/haiku, and read it out loud using Amazon Polly. Why not.

Here’s a video demonstrating Haiku Camera:

I wrote a simple web application in ASP.Net MVC, using the AWS .Net SDK. They work quickly, and the NuGet packages for all new services were ready for use shortly after the announcement.

Amazon Rekognition

Rekognition’s DetectLabels can receive a photo and return a list of things it identified in the photo.
The API is straightforward and works nicely. While developing I tested it mostly with photos I took on my trip two years ago, and overall it did quite well, finding relevant keywords for every photo.
The code is fairly short (not handling the case of >5MB images):

public async Task<List<string>> DescribeImage(System.Drawing.Image image)
{
    using (var stream = new MemoryStream())
    {
        image.Save(stream, ImageFormat.Jpeg);
        stream.Seek(0, SeekOrigin.Begin);
        var rekognition = new AmazonRekognitionClient();
        var content = await rekognition.DetectLabelsAsync(
        new DetectLabelsRequest
        {
            MaxLabels = 5,
            Image = new Amazon.Rekognition.Model.Image
            {
                Bytes = stream,
            },
        });
        return content.Labels.Select(l => l.Name).ToList();
    }
}

Here are a few good examples using the Rekognition demo page (AWS account required):

As I’ve said, I had a specific goal – I wanted to find a haiku related to the photo. I limited the API to five keywords, assuming that would be enough, and focusing only on the most relevant features in the photo. I could have also used the confidence to remove less likely labels, but I chose not to bother.

After using enough test photos, I noticed I was getting a lot of the same keywords, for example:

It was obvious:

Unfortunately, as you probably already know, people

Horse ebooks, July 25, 2012

All of these photos have the keywords People, Person, and Human. Arguably, it is only useful in the photo of dancing people, where the people are really the subject. I search for a haiku based on all keywords, and people are a popular subject among haiku poets. People are spamming my results, and I keep getting the same haiku.
Additionally, the photos of the lion statue and Disneyland have exactly the same labels, adding Art, Sculpture, and Statue.

Confidence is not enough

Besides correctness, another issue is percision and specificness. Consider the results ["bird", "penguin"], or ["food", "sushi"]. It is clear to us, people-person-humans, that Bird ⊃ Penguin, and Food ⊃ Sushi – but how can we choose the more specific word automatically? If I’m using a black-box product like Amazon Rekognition, I probably don’t have the resources to build my own corpus. Furthermore, this data is clearly already contained in Rekognition – but it is not exposed in any way. This is a complementary service – had I used Rekognition to tag many photos and build an index, and wanted to answer questions like “find all photos with [Bird]”, I would not have had this problem. There is difficulty in choosing the best label when describing the content of a single photo.
I did not test it, but AWS has a new chatbot service called Amazon Lex (currently at limited preview) – maybe a chatbot service can help and choose the more specific words.

Technically correct

What’s in this photo?

Image source getyourguide.com.

Image source getyourguide.com.

Ask a few people, and chances are they’ll say The Eiffel Tower, or Paris.
Rekognition gives us {"Architecture", "Tower"}. Now, if a person gave you this answer there are two options: either they’re 3 and know what architecture is, or they have a superb sense of humor. And that’s really the problem: without proper names, Rekognition is being a smart-ass.

Rekognition – conclusion

Rekognition works well enough, and is easy to use. Its face recognition capabilities seem much more advanced than its modest image labeling and content identification.

What could be better:

  • A simple way to represent hierarchy between keywords.
  • Besides confidence, expose the relevance of the keyword. As reference, this is how Elasticsearch handles relevance:

    Inverse document frequency: How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms

    This just makes sense. If you have "People" in 60% of the photos (not a real number), you can be confident that a photo will have people in it, but it would not be very interesting.

  • Relevance can also be influenced by the composition of the photo: is there a huge tower in the center and a few small people below it? Maybe the tower is more interesting. It would be nice if the API returned relative areas of the labels, or bounding-boxes where possible (also useful for cropping).
  • A few proper names would be nice.

It is fair to mention that this is a first version, and even a small update can make much more useful for my use case.

Finding a haiku

This was the easy part. There are many collections of haiku on the interent. I chose Reddit /r/haiku because Reddit has a simple RSS API for content, a build-in search engine, and a huge variety of crazy creative haiku.

var keywords = String.Join(" OR ", subject.Select(s => $"({s})"));
var url = $"https://www.reddit.com/r/haiku/search.rss?q=title%3A({Uri.EscapeUriString(keywords)})&restrict_sr=on&sort=relevance&t=all";
var all = XDocument.Load(url).Descendants().Where(e => e.Name.LocalName == "title")
    .Select(e => e.Value).Where(h => h?.Count('/'.Equals) == 2).ToList();
// retrun a random haiku.

Using the keywords I build a URL for the search API. The filter looks like "title:((Dog) OR (Bear) OR (Giant Panda))".
If I used these haiku publicly or commercially (!?), I would have also check the license, and would have extracted the author and link to the haiku.

Amazon Polly

Another new service is Amazon Polly, which is a voice synthesizer: it accepts text and returns a spoken recording of that text.

public async Task CreateMp3(string text, string targetMp3FilePath)
{
    var polly = new AmazonPollyClient();
    var speak = await polly.SynthesizeSpeechAsync(new SynthesizeSpeechRequest
    {
        OutputFormat = OutputFormat.Mp3,
        Text = text,
        VoiceId = "Joanna",
    });
    using (var fileStream = File.Create(targetMp3FilePath))
    {
        speak.AudioStream.CopyTo(fileStream);
    }
}

Again, the code is simple, and Polly’s SynthesizeSpeech works easily. Polly has a variety of English voices and accents, including British, American, Australian, Welsh, and Indian, and I pick a random English one each time.
I am not the target audience, but I found the American and British voices to be clearer and of higher quality than the other English voices. (I mean in Amazon Polly, of course. Not in general.)

A minor challenge was to get the punctuation right. The poems in Reddit are presented as three lines separated by slashes and usually spaces. For example:

of all the virtues / patience is the single most / irritating one

This is not a format that is too suitable for Polly. Polly pretty much ignores the slash when it is surrounded by spaces, but reads a verbal “slash” when it is not surrounded by spaces (“warrior/poet”).
Haiku tend to have minimal punctuation, but in cases the punctuation is there I prefer to keep it. When there is no punctuation at the end of a line I add commas and periods:

of all the virtues,
patience is the single most,
irritating one.

This is not ideal, but renders nicely in Polly. I add newlines just for presentation. Newlines are ignored by Polly, as far as I could see.

Polly is obviously not meant for reading haiku, but in this case its quirks are part of the charm.
I did not try SSML at all – it probably requires better semantic understanding than I have.

Other

This is fairly little code to achieve what I wanted – understand what’s in a photo, find a haiku, and play it. I wrapped it all in a small mobile-friendly web page:

  • Turns out an <input type="file"> field can trigger the mobile camera. That’s neat.
  • I used CSS for all styling and animations. CSS can do a lot.
  • There is just a little JavaScript. Most of it deals with updating elements and CSS classes – it is not as fun as using an MVVM/MVC framework.

See also

Thanks!