Regular expression balancing groups – now also in Go!

Go comes with the terse regex engine re2, which should be very fast, but is missing a lot of common regex features (like lookarounds). This is probably one of the reasons Doug Clark has ported the .Net Regex engine to Go, and looks like he did a good job at it: regexp2 library.

I did a quick test to see if my maze-solving regex works, and it did! This is impressive, specially considering none of my patterns ever worked on Mono until they merged with the .Net Framework implementation.

Installing the library, assuming you already have Go set up (“...” is correct despite looking like a mistake, by the way):

go get github.com/dlclark/regexp2/...

My first ever Go program:

package main

import (
	"fmt"

	"github.com/dlclark/regexp2"
)

/*
This Go program is using the library Regex2 by Doug Clark ( https://github.com/dlclark ),
which is a port of the .Net regex engine to Go.
Install using:
go get github.com/dlclark/regexp2/...
*/

func solveMaze(maze string) {
	pattern := `
(?=(?<Pos>.*?(?<Step>S)))
\A
(?:
    # Find next position
    (?:
        (?<End>)
        (?<=               # Stop when Pos reached E.
            \A(?=\k<Pos>(?<=E))    
        .*)
        |
        (?<Right>)
        (?<=                
            \A              # Go back to start.
            (?=\k<Pos>      # Go to the current position.
                (?<Step>[ E])
                (?<=(?<NewPos>\A.*))   # Set new Position.
            )    
        .*)
        |
        (?<Left>)
        (?<=
            \A(?=\k<Pos>(?<=
                (?<Step>[ E])
                (?<=(?<NewPos>\A.*))   # Set new Position
                .                   # One step before the current POS
            ))    
        .*)
        |
        (?<Down>)
        (?<=
            \A(?=\k<Pos>
                (?>
                (?-s:                   # Dot should not match new lines
                    (?<=^(?<X>.)*.)       # Count characters to the left. -1 caracter for current position.
                    .*\n                  # Go to end of this line
                    (?<-X>.)*(?(X)(?!))   # Find same position as above character
                )
                )
                (?<Step>[ E])
                (?<=(?<NewPos>\A.*))   # Set new Position
            )    
        .*)
        |
        (?<Up>)
        (?<=
            \A(?=\k<Pos>
                (?<=        # Lookbehind - read pattern from bottom to top: http://stackoverflow.com/q/13389560/7586
                    (?-s:                   # Dot does not match new lines
                        (?=
                            (?>(?<-X>.)*(?(X)(?!)))  # Find same position as lower character
                            (?<Step>[ E])
                            (?s:(?<=(?<NewPos>\A.*)))   # Set new Position
                        )
                        (?>
                        ^.*\n              # Go to start of the previous line
                        ^(?<X>.)*.         # Count characters to the left. -1 caracter for current position.
                        )
                    )
                )           # End of Lookbehind
            )    
        .*)
    )
. # progress the match.
    # Check the new position is unique - it wasn't matched before.
    # None of the other positions are equal to the new postion.
    (?(End)|
        (?!(?!      # Rollback all changes after matching (Reset Pos)
            (?<=\A
                (?:
                    (?<=\A
                        (?=
                            \k<Pos>             # Move to Pos
                            (?<!\k<NewPos>)     # Check the Pos is different from NewPos
                            (?<-Pos>)           # Pop Pos.
                        )
                    .*)
                .)*
                . # one extra for the new postion
            )
        ))
        # Push NewPos to Pos
        (?<=\A
            (?=
                (?<Pos>\k<NewPos>) # Set NewPos and Temp to Pos
                (?<-NewPos>)       # Pop NewPos. No real reason.
            )
        .*)
    )
)+?
(?(End)|(?!))
`

	//re := regexp2.MustCompile(`(?<A>..)(?<-A>..)(?<-A>.)`, 0)
	re := regexp2.MustCompile(pattern, regexp2.IgnorePatternWhitespace|regexp2.Multiline|regexp2.Singleline)
	if isMatch, _ := re.MatchString(maze); isMatch {
		//do something
		fmt.Println("Solvable!")
	} else {
		fmt.Println("Nope.")
	}
}

func testMazes() {
	goodMazes := []string{
		`
██████████
█S█      █
█ █ █ █ ██
█   █ █ E█
██████████`}

	badMazes := []string{
		`
██████████
█S█      █
█ ███ █ ██
█   █ █ E█
██████████`}

	fmt.Println("These mazes are good:")
	for _, maze := range goodMazes {
		solveMaze(maze)
	}
	fmt.Println("Bad Mazes:")
	for _, maze := range badMazes {
		solveMaze(maze)
	}
}

func main() {
	testMazes()
}

run:

go run RegexMaze.go

and output (with a few more tests):

These mazes are good:
Solvable!
Solvable!
Solvable!
Bad Mazes:
Nope.
Nope.
Nope.

I copied the pattern to Go, and it just worked. Nice!

Filling a large rectangle with smaller rectangles using regular expressions

We have some rectangles:

Four rectangles. I: 2x2, II: 2x1, III: 1x1, IV: 3x1

We also have a bigger, empty rectangle:

Empty area, 3x2

Goal

The goal is to fill the whole empty rectangle with some of the smaller rectangles, and to do it using a .Net regular expression (obviously).
We may use each rectangle only once, and we are allowed to rotate a rectangle as needed.
For example, these are valid solutions:

Use rectangle I and II rotated

Use rectangle II, III, and IV

But this one isn’t:
Use rectangle I and III

Input

We’ll represent the rectangles as alphanumeric blocks, and the empty rectangle as a block of tildes (~). There is an empty line between the rectangles:

aa
aa

bb

c

ddd

~~~
~~~


I chose this representation because it is relatively simple, and it makes the pattern easier to follow: ~ is part of the target rectangle, \w is a small rectangle.

An important comment regarding solution representation

There are many ways of representing a solution, and this affects the way our algorithm is written and works.
Consider this solution again:

Use rectangle II, III, and IV

Either of these work:
(I’m using roman numeral for the small rectangles: I, II, III, and IV)

  • (not this) Rectangle number II goes on position (1, 1), rectangle number III goes on position (3,1), and rectangle number IV goes on position (1,2). (we will not be solving it like this though)
  • (not this) Rectangle I is not used. Rectangle II is used first, Rectangle III is second, and Rectangle IV is third. (we will not be solving it like this either)
  • First use Rectangle II, then Rectangle III, and then Rectangle IV. (this is the one we’ll use)

This is a little nuanced. The third option works best here because it fits backtracking very well.
When I say “next free position” I mean in character order: we go over all tildes from left to right, line by line, until we find one that isn’t already occupied.
Using this representation, the solution above can be written as (II, III, IV).

The pattern

\A
(?=(?<NextPos>[^~]*))       # \k<NextPos> always matches the position *before* the next free tilde.
(?:
    (?:
        (?<IndexLength>.)+?             # Match n characters. We will use rectangle number n in position <Index>.
                                        # 0 is not an option - we need some shape to be first, second, third, etc.
        (?<=\A(?=(?<Index>(?<-IndexLength>.)+)).*)    # Capture into <Index> the first n characters.
        (?<=\A(?<CurrentIndex>\k<Index>).*)           # Copy Index into CurrentIndex
        (?=.*\Z                                       # Ensure <Index> is unique. We cannot use the same rectangle twice
            (?<!(?=\A\k<Index>(?<=\A\k<CurrentIndex>))\A.*(?<-Index>.)+) 
        )
        (?(IndexLength)(?!)|)   # Not needed, just an assert.
        
        #Copy the shape of rectangle <Index> to the target area.
        #Find rectangle number <Index>
        (?<=(?=.(?<IndexLength>.)*?(?<=\A\k<Index>))\A.*)    # Populate <IndexLength> again.
             # ^-- we are skipping over one character. We want to reach rectangle number <IndexLength>,
             #     so we're skiping over <IndexLength>-1 rectangles
        (?<=(?=\s*(?<-IndexLength>(?:\w+\r?\n)+\r?\n)*(?<=(?<RectangleStart>.*))\w)\A.*) # Capture starting position of this rectangle.
        (?(IndexLength)(?!))
        (?:
            (?<Rotate>)     # Capture 0 characters into <Rotate>. Indicates that we are not rotating this rectangle.
            |
            (?<Rotate>.)    # Capture 1 character into <Rotate>. Indicates that we are rotating this rectangle.
            (?<TempRotate>) # Also mark <TempRotate>. This is a flag that is cleared for each rectangle. Allows conditional matching.
        )

        (?<=(?=\k<NextPos>(?<=(?<X>~*)))\A.*)   # Init <X> with the number of tildes to the left of the starting position.

        # Basically `(?:\w+\n)+\n` to match the whole rectangle.
        (?<=(?=\k<RectangleStart>   # Go to the position before the first character in the current rectangle.
            (?<Rectangle>           # Capture this rectangle, just so we have it while printing the solution.
                (?=(?(TempRotate)(?<TempHeight>\w)|(?<TempWidth>\w))+\r?\n) 
                (?:  
                    (?<Solution>\w)+
                    (?(TempRotate)(?<TempWidth>\r?\n)|(?<TempHeight>\r?\n))
                )+
            )
            \r?\n        # Match until we reach the end of this rectangle.
        )\A.*)

        (?<=(?=[^~]+(?<Width>(?<-TempWidth>~)+))\A.*)(?(TempWidth)(?!)) # Capture as many tildes as the current width.
        (?=(?<-TempHeight>\S*(?<Height>\r?\n))+)(?(TempHeight)(?!))     # Capture newlines into stack <Height>.
        (?(TempRotate)(?<-TempRotate>))                                 # Clear <TempRotate>.

        # Try to fit the rectangle into empty positions in the target rectangle.
        (?<=(?=\k<NextPos>
            (?:
                (?:                         # Match tildes
                    ~
                    (?<=(?<Filled>\A.*))               # Push the current position to <Filled>
                    (?<=(?<TempCurrentFilled>\A.*))    # Also push the current position to <TempCurrentFilled>
                    (?=.*\Z                            # Ensure <Filled> is unique. No overlap between rectangles.
                        (?<!(?=\A\k<Filled>(?<=\A\k<TempCurrentFilled>))\A.*(?<-Filled>.)+) 
                    )
                )+?
                (?<=^\k<X>\k<Width>)        # Match exactly <Width> tildes.
                ~*(?<-Height>\k<Height>\k<X>|\r?\n)     # Match until the same position on the net line (or just the last line).
            )+
            (?(Height)(?!))
        )\A.*)

        # Find the next free position - .
        (?<=(?=.*?
            (?<=(?<NextPos>\A.*))       # <NextPos> is the position before the next free tilde.
            ~
            (?<=(?<TempNextPos>\A.*))   # We compare it to <Filled>, which is the position including the tilde.
            (?=.*\Z
                (?<!(?=\A\k<Filled>(?<=\A\k<TempNextPos>))\A.*(?<-Filled>.)*) 
            )
            |
            .*\r?\n\Z (?<Done>)         # If we cannot find more an empty position it means we are done. Set the <Done> flag.
        )\A.*)
        
    )
)+
(?(Done)|(?!))

The idea

At its heart, the pattern works like this:

^(?<IndexLength>.+)+

This is fairly simple, and is the main trick used here.
The pattern matches characters from the beginning of the string. It doesn’t matter what it matches, just how many characters and how many times. The group IndexLength is captured multiple times – as many as the number of rectangles used in the solution. Its length indicates which rectangle is used. In our example, when the solution is (II, III, IV), the group will have three captures with lengths 2, 3, and 4: (..)(...)(....)

A few more examples:

Solution Rectangles Indices IndexLength Captures
Use rectangle II, III, and IV (II, III, IV) (..)(...)(....)
Use rectangle I and II rotated (I, II⟳) (.)(..⟳)
Use rectangle I and III (I, III) (.)(...)

The (.+)+ pattern usually leads to catastrophic backtracking – but here it allows us to do an exhaustive search: we check all possible orders of the rectangles. To avoid catastrophe we have a large number of asserts that fail as quickly as possible when a match cannot be found.

Tricks and Comments

  • I’m using \r?\n everywhere because that’s what string literals contain.
  • The regex is written in a procedural-programming style – similar to statements and variables. I probably could have written it with more nesting and fewer groups, but I find it easier to write and understand this way.
  • In many groups we need to point to a position in the string. To do that we take all characters from the beginning of the string until the position, and push them to a stack. For example, this is done in group NextPos that points to where the next rectangle should go, and Filled that contains occupied positions.
  • In many cases I use (?<=(?=\k<NextPos>…)\A.*) – this is a lookbehind to the beginning of the string, and a lookahead to our position. I’ve placed \A “after” the lookahead because lookbehinds are matched from end to start.
  • There is no relation between the actual characters we match (a few characters from the beginning) and the actual logic flow. This is very similar to the maze-solving regex.
  • Uniqueness – In three places we’re checking that a new capture is different from all previous captures: We choose an unused rectangle, we fill the rectangle into unoccupied tildes, and then we find the next empty tilde. This is done with this pattern:

    (?=.*\Z
    	(?<!(?=\A\k<Filled>(?<=\A\k<TempNextPos>))\A.*(?<-Filled>.)*) 
    )

    This is fairly neat. The pattern is based on Martin Ender‘s answer on Programming Puzzles & Code Golf (by the way, this is the third time today that I’ve linked one of Martin’s answers – thanks!).
    This is a negative lookbehind (remember – read from the end) which:

    • (?<-Filled>.)* – Allow popping from the stack.
    • \A.* – go to the beginning.
    • (?=\A\k<Filled>(?<=\A\k<TempNextPos>)) – see that the item in the stack is different from our new value.

    Because this is a negative lookbehind all options are checked – we check the new value against all values in the stack.
    Because this is a negative lookaround the changes are discarded – we don’t have to copy the stack, and we are not losing the captured values.

  • Rotation is done as a simple alternation, followed by conditional matches on width/height.
  • We can tweak the pattern to change its behavior. For example, we can quite easily allow multiple usages of the same rectangle, or disable rotation. We can also remove the last assertion – (?(Done)|(?!)): this causes the pattern to match the first few rectangles, and is a nice way of debugging the behavior of the pattern.
  • There’s a subtle problem here: we are using input characters to represent the order of the rectangles we are using. We need an increasing number of characters for each additional rectangle. If we have a large enough number of small rectangles we may run out of input characters and the match will fail. We can fix it by allowing additional spaces at the end, for example. While this is a bug, it is not too interesting.
  • This problem was the second programming assignment I got at university (14 years ago, in Java, not regex).

Getting the Solution

The result is a single Match. The match is successful when there is a solution.
This is already pretty good.
We are also capturing these groups:

IndexLength As shown in the table above, for each rectangle used in the solution there’s a capture of IndexLength. The length of the group indicates the index of the rectangle.
Rectangle Captures the rectangles in the order they were used.
Rotate For each rectangle – captures one character if the rectangle is rotated, and an empty capture if it isn’t rotated.
NextPos Captures the position of each rectangle in the solution, on the target rectangle. The position before the tilde where the rectangle should be placed.
SolutionChar Contains each character as used in the order they were iterated. In our example with solution (I, II) the captures of SolutionChar would be {a,a,a,a,b,b}.
Filled Contains all tildes in the target rectangle in the order they were filled. This is the same order as SolutionChar, and we can use both groups to paint the solution if we wanted.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex
Rectangles were drawn in HTML/CSS using JS Bin: http://jsbin.com/bafaxijiqa/edit?html,css,output


Thanks! Thanks!

Solving mazes using regular expressions

Can the .Net regex engine find its way through a maze?

A maze?

We have a text based maze. We can walk on spaces, and need to find a path from the start point (S) to the end point (E). Anything else in a wall.

Examples:

██████████
█S█      █
█ █ █ █ ██
█   █ █ E█
██████████
███████████████████████████S█████████████
█         █ █           █ █   █     █   █
█ ███ ███ █ █ ███████ █ █ ███ █ ███ ███ █
█ █ █ █     █ █   █   █ █   █   █ █ █   █
█ █ █ █████ █ █ ███ ███ ███ █████ █ █ ███
█ █ █     █ █   █   █   █     █ █   █   █
█ █ █████ █████ █ █████ █ ███ █ █ █████ █
█ █     █     █ █     █   █   █ █       █
█ █████ █████ █ █████ █████ ███ ███████ █
█       █   █ █   █ █ █           █   █ █
█████ ███ █ █ ███ █ █ █████████ ███ █ █ █
█ █   █   █     █   █     █     █   █   █
█ █ ███ ███████████ █████ █ █████ ███████
█ █   █           █ █   █ █   █ █ █     █
█ ███ ███████████ █ ███ █ ███ █ █ █████ █
█   █     █ █     █     █   █   █ █     █
█ █ █████ █ █ █████████ ███ █████ █ █████
█ █   █   █ █   █       █   █     █     █
█ █ ███ ███ ███ █ ███████ ███ █████████ █
█ █           █         █               █
█████████████████████E███████████████████
█S█
█ ███████
█       █
███████ █
      █ █
      █E█

Strategy

We are going to implement the following algorithm for solving a maze:

  1. Find the start point.
  2. Pick a new position to go to. We have four choices: right, left, down, or up.
    The new position is invalid if it crosses our current path through the maze (the path cannot have loops).
  3. If we reached the End, we are done.
  4. If we cannot advance, undo the last choice and make a new one.
  5. Go to step 2.

As you may have noticed, this is backtracking: we’re making choices and reverting them when we know they were wrong.

Continue reading

.Net Regular Expressions – Finding Decimal Numbers that are Divisible by Three

It’s very easy to check a decimal number is divisible by three using a simple DFA with 3 states.

A regex, therefor, is possible, but not too pretty (source):

(?:[0369]|
[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|
[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258])
)*

Example: http://www.rubular.com/r/ZcRDblHg8M

Here’s another approach, using .Net’s stacks as a simple counter:

\b
(?>             # No regrets - don't backtrack on if/else decisions.
    [0369]      # = 0 (mod 3)
    |
    [147]       # = 1 (mod 3)
    (?:         # if possible pop 2, else push 1
        (?<-Sum>){2}|(?<Sum>)
    )
    |
    [258]       # = 2 (mod 3)
    (?:         # if possible pop 1, else push 2
        (?<-Sum>)|(?<Sum>){2}
    )
)+
\b
(?(Sum)(?!)) # Assert nothing's left in the stack

Why? Well, I was bored while I shaved. Luckily, this regex is simple enough for Mono. Working example: http://ideone.com/Yp6Ti (ok, maybe not, mono is missing 111222)

.Net Regular Expressions – Using the Stack State to Understand Numeric Values

It is common knowledge that regular expressions should handle text and not values. A recent stack overflow question got me thinking though – it is possible to use .Net regular expressions to understand numbers or other values while matching the pattern?
Regular expressions can be used to perform numerical tasks, but that is usually when working in unary base.
It turns out this is possible – .Net keeps a stack for every capture of every matched group while matching the pattern, and that state is available for use while matching. The idea is simple: we can represent numbers as depth of the stack, so the number 0 is an empty stack, 6 is a stack with 6 captures, and so forth.

(?>
    (?=[0-9])   # optimization - don't multiply when we don't have a digit.
    # multiply the content of the stack by 10
    # for each item on Stack, push 10 items to a Temp stack.
    (?(Decimal)
        (?<-Decimal>
            (?<Temp>){10}
        )
    ){100000}
    (?(Decimal)(?!))
    # Push all items from Temp back to Stack
    (?(Temp)
        (?<-Temp>
            (?<Decimal>)
        )
    ){100000}
    (?(Temp)(?!))
    # match a digit, and push its value to the stack
    (?:
        0                 |
        1 (?<Decimal>)    |
        2 (?<Decimal>){2} |
        3 (?<Decimal>){3} |
        4 (?<Decimal>){4} |
        5 (?<Decimal>){5} |
        6 (?<Decimal>){6} |
        7 (?<Decimal>){7} |
        8 (?<Decimal>){8} |
        9 (?<Decimal>){9}
    )
)+

The idea is very simple: when we see a new digit, we multiply the depth of the stack by 10, and add the number represented by the new digit. The value of the number can be verified using:

match.Groups["Decimal"].Captures.Count

A curious bit here is the use of the loop to copy stacks:

(?(Temp)
    (?<-Temp>
        (?<Decimal>)
    )
){100000}

I’d expect this to be enough:

(?<-Temp> (?<Decimal>) )*
(?(Temp)(?!))

It turns out the above loop is only executed once, and the condition always fails. It is probably a documented optimization, I’ll look more into that later. As a proof of concept, the workaround should do.

It is even possible to perform basic arithmetic operations on these stacks such as adding, subtracting, multiplying and such from within the regex engine, but that may be a few extra steps too many.
It should go without saying, of course, that regex isn’t a good option here – this is for recreational use. The run time and complexity are far from ideal.

See also:

.Net Regular Expressions – Finding Acronyms, and Reversing the Stack

A recent Stack Overflow question asked if you could (not should) use regular expressions to find acronyms, specifically of the form “Original Poster (OP)” – words followed by the acronym in parentheses.

Well, my first try was this:

\b((?<Acronym>\w)\w*\W+)+
\((?<-Acronym>\k<Acronym>)+\)
(?(Acronym)(?!))

Seems simple – the first line captures the words and pushes each first letter to the stack. The second line pops and matches it, and the last line makes sure there aren’t any extra letters. Seems nice, but wrong. The first letter on the stack in this case comes from the last word, so it matches a reversed acronym – “Oops, Wrong (WO)”.

What I had to do is to reverse the stack. I came up with this regex:

\b((?<Acronym>\w)\w*\W+)+
(?<=(?<-Acronym>.(?=.*?(?<Reverse>\k<Acronym>)))+)(?(Acronym)(?!))
\((?<-Reverse>\k<Reverse>)+\)
(?(Reverse)(?!))

Now, I’m not sure that’s the best way, but it works nicely. The second line is the only thing interesting – I won’t explain it too much, because nobody is reading it. Basically, I match every letter on the stack, and push it to a second stack. I match a dot for each letter because the engine has trouble matching a zero-width expression multiple times (though it works with {5}, for example, but not + or {1,5} – it only tries one). I can match backwards because I know I had at least that many letters, and can look forward because I’m optimistic – I expect to match these letters later, so if they aren’t there, I might as well fail now.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

.Net Regex – Matching Mixed Balanced Parentheses

Yesterday I got thinking about matching different types balanced parentheses using .net regular expression. I assumed it was similar to matching quotes or a single kind of parentheses, but soon realized it isn’t quite that simple. The main problem is looking into the stack – when I see a closed curly brace, for example, how can I tell the top of the stack has an open one?

A quick search brought up Linguistic Forms’ Regex Balancing Group in Depth, which has a clever approach to the problem in the best use I’ve seen yet to balancing groups, and taught me a few new tricks. I’d recommend reading it. The tl;dr version is this: when I see a closed bracket I look-behind my current position using whatever was matched since the last open bracket, revisit it, and check it again. So, if I see I had a { (by matching it again), I know I should now match a }.

While reading the post I had an interesting idea – who says I must push what I matched to the stack? What if wanted to push an arbitrary string to the stack? When I see an open bracket I really want to push a closing bracket – but how can I?
The trick is to use a look-ahead: find the next occurrence of the character I want to push, and push it:

{(?=.*?(<Stack>}))

Next, when I want to match a closing brace, I already have the right one in stack.
Using this approach, here’s a regex that matches tokens with matching balanced parentheses of 3 different typed:

(
    [^(){}\[\]]+
    | \( (?=[^)]*  (?<Stack> \) ) )
    | \[ (?=[^\]]* (?<Stack> \] ) )
    | \{ (?=[^}]*  (?<Stack> \} ) )
    | \k<Stack> (?<-Stack>)
)+?
(?(Stack) (?!))

Of course, this approach does have limitations – you may not find the character you’d like to push (which might be a good thing – it allows you to fail early). It also gets much trickier if you want to balance something more complicated than constant strings, but that’s another subject.

Happy matching.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

Using .Net Regex Balancing Groups to Match Words in Fibonacci Lengths

\b
(?<A>\w)+\s+
(?<A>\w)+\s+
(?<-A>\w)+
(?(A)(?!))
\b

The first example captures three words. Besides spaces and word boundaries, the interesting bits read:

  • (?<A>\w)+ – Capture the first and second word. Push each letter to the A stack. A note here is the it might have been more correct to write (?:\w(?<A>))+, so I don’t push a value I don’t use to the stack, but I think the (?<A>\w) is clearer. I could have also compressed it to ((?<A>\w)+\s+){2}, but then it would have been even less readable.
  • (?<-A>\w)+ – On the third word, this will only match as many letters as matched before. An alternative syntax here would be (\w(?<-A>))+.
  • (?(A)(?!)) – Fail the match if there are still letters in the A stack. This one is a must, in case the third word is shorter than the first two.
  • \b – Last but the least, I need to make sure I match at word boundaries.

Note that In the above regex I don’t care if the second word is longer or of equal length to the first. If this is a problem for you, it can be checked using the following pattern. I won’t explain it too much, but the idea here is to push two letters to the B stack for each letter you remove from the A stack:

\b
(?<A>\w)+   # First word
\s+
(?<B-A>(?<B>)\w)+  # Second word - consume all A's, push an extra B
(?(A)(?!))   # Make sure A is finished
(?<B>\w)*    # Add more B's (or none)
\s+
(?<-B>\w)+ # Third word
(?(B)(?!))
\b

Finally, here’s a regex to capture a sequence of words, where the length of each word is equal to the sum of the lengths of the two previous words:

\b
(?:
    (?<A>\w)+   # First word - push each letter
    \s+
    (?=
        (?<A>\w)+  # Second word - push each letter
        \s+
        (?<-A>\w)+ # Third word - pop letters
        (?(A)(?!))
        \b
    )    #Look Ahead
)+
\w+\s+\w+\b #Capture last two words (already looked at them)

This might look a bit strange: I match a single word at a time, and use the look-ahead to check the following two words. The result is a sequence of words whose lengths make a Fibonacci series.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

See Also:
Regex Balancing Group in Depth – An excellent introduction to balancing groups and the basic concepts and tools used in .Net regular expressions.