It is common knowledge that regular expressions should handle text and not values. A recent stack overflow question got me thinking though – it is possible to use .Net regular expressions to understand numbers or other values while matching the pattern?
Regular expressions can be used to perform numerical tasks, but that is usually when working in unary base.
It turns out this is possible – .Net keeps a stack for every capture of every matched group while matching the pattern, and that state is available for use while matching. The idea is simple: we can represent numbers as depth of the stack, so the number 0 is an empty stack, 6 is a stack with 6 captures, and so forth.
(?>
(?=[0-9]) # optimization - don't multiply when we don't have a digit.
# multiply the content of the stack by 10
# for each item on Stack, push 10 items to a Temp stack.
(?(Decimal)
(?<-Decimal>
(?<Temp>){10}
)
){100000}
(?(Decimal)(?!))
# Push all items from Temp back to Stack
(?(Temp)
(?<-Temp>
(?<Decimal>)
)
){100000}
(?(Temp)(?!))
# match a digit, and push its value to the stack
(?:
0 |
1 (?<Decimal>) |
2 (?<Decimal>){2} |
3 (?<Decimal>){3} |
4 (?<Decimal>){4} |
5 (?<Decimal>){5} |
6 (?<Decimal>){6} |
7 (?<Decimal>){7} |
8 (?<Decimal>){8} |
9 (?<Decimal>){9}
)
)+
The idea is very simple: when we see a new digit, we multiply the depth of the stack by 10, and add the number represented by the new digit. The value of the number can be verified using:
match.Groups["Decimal"].Captures.Count
A curious bit here is the use of the loop to copy stacks:
(?(Temp)
(?<-Temp>
(?<Decimal>)
)
){100000}
I’d expect this to be enough:
(?<-Temp> (?<Decimal>) )* (?(Temp)(?!))
It turns out the above loop is only executed once, and the condition always fails. It is probably a documented optimization, I’ll look more into that later. As a proof of concept, the workaround should do.
It is even possible to perform basic arithmetic operations on these stacks such as adding, subtracting, multiplying and such from within the regex engine, but that may be a few extra steps too many.
It should go without saying, of course, that regex isn’t a good option here – this is for recreational use. The run time and complexity are far from ideal.
See also:
- https://gist.github.com/940593 – Sample program that captures a decimal number followed by the same number in binary base.
- How to determine if a number is a prime with regex?