.Net Regular Expressions – Finding Acronyms, and Reversing the Stack

A recent Stack Overflow question asked if you could (not should) use regular expressions to find acronyms, specifically of the form “Original Poster (OP)” – words followed by the acronym in parentheses.

Well, my first try was this:

\b((?<Acronym>\w)\w*\W+)+
\((?<-Acronym>\k<Acronym>)+\)
(?(Acronym)(?!))

Seems simple – the first line captures the words and pushes each first letter to the stack. The second line pops and matches it, and the last line makes sure there aren’t any extra letters. Seems nice, but wrong. The first letter on the stack in this case comes from the last word, so it matches a reversed acronym – “Oops, Wrong (WO)”.

What I had to do is to reverse the stack. I came up with this regex:

\b((?<Acronym>\w)\w*\W+)+
(?<=(?<-Acronym>.(?=.*?(?<Reverse>\k<Acronym>)))+)(?(Acronym)(?!))
\((?<-Reverse>\k<Reverse>)+\)
(?(Reverse)(?!))

Now, I’m not sure that’s the best way, but it works nicely. The second line is the only thing interesting – I won’t explain it too much, because nobody is reading it. Basically, I match every letter on the stack, and push it to a second stack. I match a dot for each letter because the engine has trouble matching a zero-width expression multiple times (though it works with {5}, for example, but not + or {1,5} – it only tries one). I can match backwards because I know I had at least that many letters, and can look forward because I’m optimistic – I expect to match these letters later, so if they aren’t there, I might as well fail now.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

Advertisements

2 thoughts on “.Net Regular Expressions – Finding Acronyms, and Reversing the Stack

  1. I know this is an old post, but I find the approach and use of .NET specific backtracking features pretty interesting.

    While I’m still working on figuring out exactly how that middle line works, I was wondering if you’ve got a second to provide some insight on how the following use cases might be accomplished:

    1.) Something like International Organization for Standardization (ISO) where there isn’t an exactly matching number of characters AND the words are out of order
    2.) An acronym defined prior to the definition like (ATM) Automatic Teller Machine
    3.) Where the capitalized letter isn’t the first one as in: eXtensible Markup Language (XML)

    My current implementation, more or less, looks like this. If I use only the acronym group and a sliding bag of words running across the text in a loop I can then look forward and backwards in the text. Not “quite right”, but close.

    
    (?
      (?:
      (?:
        (?[A-Z][\w\-]+
                  (?:[\s][\w\-]+)?(?:[,\s]{0,}[A-Z][\w\-]+)+
                )[\s.,]{0,}
                )
      )
      (?\()
      (?:(?\b(?:[A-Z][\.]){2,})|(?<acronym>\b(?:[A-Z\&][a-z]{0,}){2,}\d{0,}\b))
      (?\)
      )
    )
    
    • Interesting questions.
      Some of your post was lost because HTML was stripped, I can edit your comment if you post the code elsewhere.

      For 3 I’d use a different approach than this post:

      (?=(\((?<Letter>\w)+\)))
      (?<=((\w*(?<-Letter>\k<Letter>)\w*\s+)+))
      (?(Letter)(?!))
      

      This does not match the whole text, but it does capture it. First I match the acronym, and then match backward for the words. This is easier because I don’t have to reverse the stack, and would work just as well for the original post. This way matching the letter in the middle of the word is possible, and pretty easy.

      Case 2, acronym being before the full words isn’t particularly interesting, unless we want to have one pattern for both cases. There are some tricks to achieve this, but I think it’s an overkill for a pattern this simple – it is probably better to just write a part of the pattern twice, or use two patterns.

      Case 1 is the trickiest. You lose the benefits of using a stack when you allow a different order. I don’t see another way than a loop on the stack and lookaround, like you said. It’s also important to have the right letter count – for example not to match “Hello World (HWWH)”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s