Solving mazes using regular expressions

Can the .Net regex engine find its way through a maze?

A maze?

We have a text based maze. We can walk on spaces, and need to find a path from the start point (S) to the end point (E). Anything else in a wall.

Examples:

██████████
█S█      █
█ █ █ █ ██
█   █ █ E█
██████████
███████████████████████████S█████████████
█         █ █           █ █   █     █   █
█ ███ ███ █ █ ███████ █ █ ███ █ ███ ███ █
█ █ █ █     █ █   █   █ █   █   █ █ █   █
█ █ █ █████ █ █ ███ ███ ███ █████ █ █ ███
█ █ █     █ █   █   █   █     █ █   █   █
█ █ █████ █████ █ █████ █ ███ █ █ █████ █
█ █     █     █ █     █   █   █ █       █
█ █████ █████ █ █████ █████ ███ ███████ █
█       █   █ █   █ █ █           █   █ █
█████ ███ █ █ ███ █ █ █████████ ███ █ █ █
█ █   █   █     █   █     █     █   █   █
█ █ ███ ███████████ █████ █ █████ ███████
█ █   █           █ █   █ █   █ █ █     █
█ ███ ███████████ █ ███ █ ███ █ █ █████ █
█   █     █ █     █     █   █   █ █     █
█ █ █████ █ █ █████████ ███ █████ █ █████
█ █   █   █ █   █       █   █     █     █
█ █ ███ ███ ███ █ ███████ ███ █████████ █
█ █           █         █               █
█████████████████████E███████████████████
█S█
█ ███████
█       █
███████ █
      █ █
      █E█

Strategy

We are going to implement the following algorithm for solving a maze:

  1. Find the start point.
  2. Pick a new position to go to. We have four choices: right, left, down, or up.
    The new position is invalid if it crosses our current path through the maze (the path cannot have loops).
  3. If we reached the End, we are done.
  4. If we cannot advance, undo the last choice and make a new one.
  5. Go to step 2.

As you may have noticed, this is backtracking: we’re making choices and reverting them when we know they were wrong.

Continue reading

.Net Regular Expressions – Finding Decimal Numbers that are Divisible by Three

It’s very easy to check a decimal number is divisible by three using a simple DFA with 3 states.

A regex, therefor, is possible, but not too pretty (source):

(?:[0369]|
[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|
[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258])
)*

Example: http://www.rubular.com/r/ZcRDblHg8M

Here’s another approach, using .Net’s stacks as a simple counter:

\b
(?>             # No regrets - don't backtrack on if/else decisions.
    [0369]      # = 0 (mod 3)
    |
    [147]       # = 1 (mod 3)
    (?:         # if possible pop 2, else push 1
        (?<-Sum>){2}|(?<Sum>)
    )
    |
    [258]       # = 2 (mod 3)
    (?:         # if possible pop 1, else push 2
        (?<-Sum>)|(?<Sum>){2}
    )
)+
\b
(?(Sum)(?!)) # Assert nothing's left in the stack

Why? Well, I was bored while I shaved. Luckily, this regex is simple enough for Mono. Working example: http://ideone.com/Yp6Ti (ok, maybe not, mono is missing 111222)

.Net Regular Expressions – Using the Stack State to Understand Numeric Values

It is common knowledge that regular expressions should handle text and not values. A recent stack overflow question got me thinking though – it is possible to use .Net regular expressions to understand numbers or other values while matching the pattern?
Regular expressions can be used to perform numerical tasks, but that is usually when working in unary base.
It turns out this is possible – .Net keeps a stack for every capture of every matched group while matching the pattern, and that state is available for use while matching. The idea is simple: we can represent numbers as depth of the stack, so the number 0 is an empty stack, 6 is a stack with 6 captures, and so forth.

(?>
    (?=[0-9])   # optimization - don't multiply when we don't have a digit.
    # multiply the content of the stack by 10
    # for each item on Stack, push 10 items to a Temp stack.
    (?(Decimal)
        (?<-Decimal>
            (?<Temp>){10}
        )
    ){100000}
    (?(Decimal)(?!))
    # Push all items from Temp back to Stack
    (?(Temp)
        (?<-Temp>
            (?<Decimal>)
        )
    ){100000}
    (?(Temp)(?!))
    # match a digit, and push its value to the stack
    (?:
        0                 |
        1 (?<Decimal>)    |
        2 (?<Decimal>){2} |
        3 (?<Decimal>){3} |
        4 (?<Decimal>){4} |
        5 (?<Decimal>){5} |
        6 (?<Decimal>){6} |
        7 (?<Decimal>){7} |
        8 (?<Decimal>){8} |
        9 (?<Decimal>){9}
    )
)+

The idea is very simple: when we see a new digit, we multiply the depth of the stack by 10, and add the number represented by the new digit. The value of the number can be verified using:

match.Groups["Decimal"].Captures.Count

A curious bit here is the use of the loop to copy stacks:

(?(Temp)
    (?<-Temp>
        (?<Decimal>)
    )
){100000}

I’d expect this to be enough:

(?<-Temp> (?<Decimal>) )*
(?(Temp)(?!))

It turns out the above loop is only executed once, and the condition always fails. It is probably a documented optimization, I’ll look more into that later. As a proof of concept, the workaround should do.

It is even possible to perform basic arithmetic operations on these stacks such as adding, subtracting, multiplying and such from within the regex engine, but that may be a few extra steps too many.
It should go without saying, of course, that regex isn’t a good option here – this is for recreational use. The run time and complexity are far from ideal.

See also:

.Net Regular Expressions – Finding Acronyms, and Reversing the Stack

A recent Stack Overflow question asked if you could (not should) use regular expressions to find acronyms, specifically of the form “Original Poster (OP)” – words followed by the acronym in parentheses.

Well, my first try was this:

\b((?<Acronym>\w)\w*\W+)+
\((?<-Acronym>\k<Acronym>)+\)
(?(Acronym)(?!))

Seems simple – the first line captures the words and pushes each first letter to the stack. The second line pops and matches it, and the last line makes sure there aren’t any extra letters. Seems nice, but wrong. The first letter on the stack in this case comes from the last word, so it matches a reversed acronym – “Oops, Wrong (WO)”.

What I had to do is to reverse the stack. I came up with this regex:

\b((?<Acronym>\w)\w*\W+)+
(?<=(?<-Acronym>.(?=.*?(?<Reverse>\k<Acronym>)))+)(?(Acronym)(?!))
\((?<-Reverse>\k<Reverse>)+\)
(?(Reverse)(?!))

Now, I’m not sure that’s the best way, but it works nicely. The second line is the only thing interesting – I won’t explain it too much, because nobody is reading it. Basically, I match every letter on the stack, and push it to a second stack. I match a dot for each letter because the engine has trouble matching a zero-width expression multiple times (though it works with {5}, for example, but not + or {1,5} – it only tries one). I can match backwards because I know I had at least that many letters, and can look forward because I’m optimistic – I expect to match these letters later, so if they aren’t there, I might as well fail now.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

.Net Regex – Matching Mixed Balanced Parentheses

Yesterday I got thinking about matching different types balanced parentheses using .net regular expression. I assumed it was similar to matching quotes or a single kind of parentheses, but soon realized it isn’t quite that simple. The main problem is looking into the stack – when I see a closed curly brace, for example, how can I tell the top of the stack has an open one?

A quick search brought up Linguistic Forms’ Regex Balancing Group in Depth, which has a clever approach to the problem in the best use I’ve seen yet to balancing groups, and taught me a few new tricks. I’d recommend reading it. The tl;dr version is this: when I see a closed bracket I look-behind my current position using whatever was matched since the last open bracket, revisit it, and check it again. So, if I see I had a { (by matching it again), I know I should now match a }.

While reading the post I had an interesting idea – who says I must push what I matched to the stack? What if wanted to push an arbitrary string to the stack? When I see an open bracket I really want to push a closing bracket – but how can I?
The trick is to use a look-ahead: find the next occurrence of the character I want to push, and push it:

{(?=.*?(<Stack>}))

Next, when I want to match a closing brace, I already have the right one in stack.
Using this approach, here’s a regex that matches tokens with matching balanced parentheses of 3 different typed:

(
    [^(){}\[\]]+
    | \( (?=[^)]*  (?<Stack> \) ) )
    | \[ (?=[^\]]* (?<Stack> \] ) )
    | \{ (?=[^}]*  (?<Stack> \} ) )
    | \k<Stack> (?<-Stack>)
)+?
(?(Stack) (?!))

Of course, this approach does have limitations – you may not find the character you’d like to push (which might be a good thing – it allows you to fail early). It also gets much trickier if you want to balance something more complicated than constant strings, but that’s another subject.

Happy matching.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

Using .Net Regex Balancing Groups to Match Words in Fibonacci Lengths

\b
(?<A>\w)+\s+
(?<A>\w)+\s+
(?<-A>\w)+
(?(A)(?!))
\b

The first example captures three words. Besides spaces and word boundaries, the interesting bits read:

  • (?<A>\w)+ – Capture the first and second word. Push each letter to the A stack. A note here is the it might have been more correct to write (?:\w(?<A>))+, so I don’t push a value I don’t use to the stack, but I think the (?<A>\w) is clearer. I could have also compressed it to ((?<A>\w)+\s+){2}, but then it would have been even less readable.
  • (?<-A>\w)+ – On the third word, this will only match as many letters as matched before. An alternative syntax here would be (\w(?<-A>))+.
  • (?(A)(?!)) – Fail the match if there are still letters in the A stack. This one is a must, in case the third word is shorter than the first two.
  • \b – Last but the least, I need to make sure I match at word boundaries.

Note that In the above regex I don’t care if the second word is longer or of equal length to the first. If this is a problem for you, it can be checked using the following pattern. I won’t explain it too much, but the idea here is to push two letters to the B stack for each letter you remove from the A stack:

\b
(?<A>\w)+   # First word
\s+
(?<B-A>(?<B>)\w)+  # Second word - consume all A's, push an extra B
(?(A)(?!))   # Make sure A is finished
(?<B>\w)*    # Add more B's (or none)
\s+
(?<-B>\w)+ # Third word
(?(B)(?!))
\b

Finally, here’s a regex to capture a sequence of words, where the length of each word is equal to the sum of the lengths of the two previous words:

\b
(?:
    (?<A>\w)+   # First word - push each letter
    \s+
    (?=
        (?<A>\w)+  # Second word - push each letter
        \s+
        (?<-A>\w)+ # Third word - pop letters
        (?(A)(?!))
        \b
    )    #Look Ahead
)+
\w+\s+\w+\b #Capture last two words (already looked at them)

This might look a bit strange: I match a single word at a time, and use the look-ahead to check the following two words. The result is a sequence of words whose lengths make a Fibonacci series.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

See Also:
Regex Balancing Group in Depth – An excellent introduction to balancing groups and the basic concepts and tools used in .Net regular expressions.