We have some rectangles:
We also have a bigger, empty rectangle:
The goal is to fill the whole empty rectangle with some of the smaller rectangles, and to do it using a .Net regular expression (obviously).
We may use each rectangle only once, and we are allowed to rotate a rectangle as needed.
For example, these are valid solutions:
But this one isn’t:
We’ll represent the rectangles as alphanumeric blocks, and the empty rectangle as a block of tildes (~). There is an empty line between the rectangles:
I chose this representation because it is relatively simple, and it makes the pattern easier to follow:
~ is part of the target rectangle,
\w is a small rectangle.
An important comment regarding solution representation
There are many ways of representing a solution, and this affects the way our algorithm is written and works.
Consider this solution again:
Either of these work:
(I’m using roman numeral for the small rectangles: I, II, III, and IV)
- (not this) Rectangle number II goes on position (1, 1), rectangle number III goes on position (3,1), and rectangle number IV goes on position (1,2). (we will not be solving it like this though)
- (not this) Rectangle I is not used. Rectangle II is used first, Rectangle III is second, and Rectangle IV is third. (we will not be solving it like this either)
- First use Rectangle II, then Rectangle III, and then Rectangle IV. (this is the one we’ll use)
This is a little nuanced. The third option works best here because it fits backtracking very well.
When I say “next free position” I mean in character order: we go over all tildes from left to right, line by line, until we find one that isn’t already occupied.
Using this representation, the solution above can be written as
(II, III, IV).
(?=(?<NextPos>[^~]*)) # \k<NextPos> always matches the position *before* the next free tilde.
(?<IndexLength>.)+? # Match n characters. We will use rectangle number n in position <Index>.
# 0 is not an option - we need some shape to be first, second, third, etc.
(?<=\A(?=(?<Index>(?<-IndexLength>.)+)).*) # Capture into <Index> the first n characters.
(?<=\A(?<CurrentIndex>\k<Index>).*) # Copy Index into CurrentIndex
(?=.*\Z # Ensure <Index> is unique. We cannot use the same rectangle twice
(?(IndexLength)(?!)|) # Not needed, just an assert.
#Copy the shape of rectangle <Index> to the target area.
#Find rectangle number <Index>
(?<=(?=.(?<IndexLength>.)*?(?<=\A\k<Index>))\A.*) # Populate <IndexLength> again.
# ^-- we are skipping over one character. We want to reach rectangle number <IndexLength>,
# so we're skiping over <IndexLength>-1 rectangles
(?<=(?=\s*(?<-IndexLength>(?:\w+\r?\n)+\r?\n)*(?<=(?<RectangleStart>.*))\w)\A.*) # Capture starting position of this rectangle.
(?<Rotate>) # Capture 0 characters into <Rotate>. Indicates that we are not rotating this rectangle.
(?<Rotate>.) # Capture 1 character into <Rotate>. Indicates that we are rotating this rectangle.
(?<TempRotate>) # Also mark <TempRotate>. This is a flag that is cleared for each rectangle. Allows conditional matching.
(?<=(?=\k<NextPos>(?<=(?<X>~*)))\A.*) # Init <X> with the number of tildes to the left of the starting position.
# Basically `(?:\w+\n)+\n` to match the whole rectangle.
(?<=(?=\k<RectangleStart> # Go to the position before the first character in the current rectangle.
(?<Rectangle> # Capture this rectangle, just so we have it while printing the solution.
\r?\n # Match until we reach the end of this rectangle.
(?<=(?=[^~]+(?<Width>(?<-TempWidth>~)+))\A.*)(?(TempWidth)(?!)) # Capture as many tildes as the current width.
(?=(?<-TempHeight>\S*(?<Height>\r?\n))+)(?(TempHeight)(?!)) # Capture newlines into stack <Height>.
(?(TempRotate)(?<-TempRotate>)) # Clear <TempRotate>.
# Try to fit the rectangle into empty positions in the target rectangle.
(?: # Match tildes
(?<=(?<Filled>\A.*)) # Push the current position to <Filled>
(?<=(?<TempCurrentFilled>\A.*)) # Also push the current position to <TempCurrentFilled>
(?=.*\Z # Ensure <Filled> is unique. No overlap between rectangles.
(?<=^\k<X>\k<Width>) # Match exactly <Width> tildes.
~*(?<-Height>\k<Height>\k<X>|\r?\n) # Match until the same position on the net line (or just the last line).
# Find the next free position - .
(?<=(?<NextPos>\A.*)) # <NextPos> is the position before the next free tilde.
(?<=(?<TempNextPos>\A.*)) # We compare it to <Filled>, which is the position including the tilde.
.*\r?\n\Z (?<Done>) # If we cannot find more an empty position it means we are done. Set the <Done> flag.
At its heart, the pattern works like this:
This is fairly simple, and is the main trick used here.
The pattern matches characters from the beginning of the string. It doesn’t matter what it matches, just how many characters and how many times. The group
IndexLength is captured multiple times – as many as the number of rectangles used in the solution. Its length indicates which rectangle is used. In our example, when the solution is
(II, III, IV), the group will have three captures with lengths 2, 3, and 4:
A few more examples:
(.+)+ pattern usually leads to catastrophic backtracking – but here it allows us to do an exhaustive search: we check all possible orders of the rectangles. To avoid catastrophe we have a large number of asserts that fail as quickly as possible when a match cannot be found.
Tricks and Comments
- I’m using
\r?\neverywhere because that’s what string literals contain.
- The regex is written in a procedural-programming style – similar to statements and variables. I probably could have written it with more nesting and fewer groups, but I find it easier to write and understand this way.
- In many groups we need to point to a position in the string. To do that we take all characters from the beginning of the string until the position, and push them to a stack. For example, this is done in group
NextPosthat points to where the next rectangle should go, and
Filledthat contains occupied positions.
- In many cases I use
(?<=(?=\k<NextPos>…)\A.*)– this is a lookbehind to the beginning of the string, and a lookahead to our position. I’ve placed
\A“after” the lookahead because lookbehinds are matched from end to start.
- There is no relation between the actual characters we match (a few characters from the beginning) and the actual logic flow. This is very similar to the maze-solving regex.
- Uniqueness – In three places we’re checking that a new capture is different from all previous captures: We choose an unused rectangle, we fill the rectangle into unoccupied tildes, and then we find the next empty tilde. This is done with this pattern:
(?=.*\Z (?<!(?=\A\k<Filled>(?<=\A\k<TempNextPos>))\A.*(?<-Filled>.)*) )
This is a negative lookbehind (remember – read from the end) which:
(?<-Filled>.)*– Allow popping from the stack.
\A.*– go to the beginning.
(?=\A\k<Filled>(?<=\A\k<TempNextPos>))– see that the item in the stack is different from our new value.
Because this is a negative lookbehind all options are checked – we check the new value against all values in the stack.
Because this is a negative lookaround the changes are discarded – we don’t have to copy the stack, and we are not losing the captured values.
- Rotation is done as a simple alternation, followed by conditional matches on width/height.
- We can tweak the pattern to change its behavior. For example, we can quite easily allow multiple usages of the same rectangle, or disable rotation. We can also remove the last assertion –
(?(Done)|(?!)): this causes the pattern to match the first few rectangles, and is a nice way of debugging the behavior of the pattern.
- There’s a subtle problem here: we are using input characters to represent the order of the rectangles we are using. We need an increasing number of characters for each additional rectangle. If we have a large enough number of small rectangles we may run out of input characters and the match will fail. We can fix it by allowing additional spaces at the end, for example. While this is a bug, it is not too interesting.
- This problem was the second programming assignment I got at university (14 years ago, in Java, not regex).
Getting the Solution
The result is a single
Match. The match is successful when there is a solution.
This is already pretty good.
We are also capturing these groups:
||As shown in the table above, for each rectangle used in the solution there’s a capture of
||Captures the rectangles in the order they were used.|
||For each rectangle – captures one character if the rectangle is rotated, and an empty capture if it isn’t rotated.|
||Captures the position of each rectangle in the solution, on the target rectangle. The position before the tilde where the rectangle should be placed.|
||Contains each character as used in the order they were iterated. In our example with solution (I, II) the captures of
||Contains all tildes in the target rectangle in the order they were filled. This is the same order as