The implementation of std::regex appears to be NFA based [1] which is surprising...

billyoneal · on Sept 17, 2019

That is near the top of the list of things to replace when we can break ABI next. It's... not amazing.

We regularly recommend customers use Boost if you need an API compatible version, or something like RE2 if you just need a fast regex engine.

Mathnerd314 · on Sept 17, 2019

https://github.com/microsoft/STL/blob/master/stl/inc/regex#L... is the code. Looks pretty simple, check if it matches the backreferences and error/continue.

The handling of repetition is more revealing, it doesn't appear to have "states" at all but rather is just the usual backtracking: https://github.com/microsoft/STL/blob/master/stl/inc/regex#L...

So the "NFA" is really an AST for the regex rather than an automaton based on Thompson's algorithm or similar.

asdfasgasdgasdg · on Sept 17, 2019

I think you're thinking of DFAs, which don't support backreferences. NFAs do. Right?

ridiculous_fish · on Sept 17, 2019

Maybe? But in theory any NFA can be reduced to a DFA through powerset construction, so there must be something else going on.

burntsushi · on Sept 17, 2019

Jeffrey Friedl's book on regexes unapologetically popularized the incorrect use of the terms "NFA" and "DFA" to characterize the different implementations of regex engines. Translating to something more sensible, "NFA" means "unbounded backtracking," where as "DFA" means "Thompson NFA simulation." In this vernacular, as far as I can tell, there is no word to describe an actual DFA.

PCRE's docs continue this confusion. For example, they expose a DFA API... But it's not actually a DFA!

One might say this is similar to how "regex" doesn't necessarily imply "regular," which is true. That's definitely a battle that has been lost long ago. But NFA and DFA are even more specific jargon terms, and the way in which they have been redefined by the Friedls of the world has led to exactly your confusion over and over again.

ridiculous_fish · on Sept 17, 2019

Thank you, that clears everything up. NFA as used here means something completely different from NFAs in CS literature.

burntsushi · on Sept 17, 2019

Unfortunately, yes. Although there is a connection. If you squint, you can see parts of the NFA. Indeed, it is very easy to simulate an NFA with unbounded backtracking (taking exponential worst case time). All you need to do from there is add a few op codes for look around and backreferences. Where you end up still looks like you're simulating an NFA, but the actual implementation has a lot more power.

To be fair, the same thing happens with the Thompson NFA simulation. It's not too hard to add a little extra power to it in the form of memory in order to track capturing group locations. It becomes more powerful than an NFA with this addition, but retains its linear time bound. This is why I've normally seen this addition called the Pike VM. (At least, Russ Cox and I both use than name.)

For completeness, you can also use bounded backtracking to simulate an NFA. It keeps track of all states visited per input byte, so it uses a lot of extra memory, but maintains a linear time bound.

bregma · on Sept 17, 2019

Any NFA that does not have backreferences can be converted to a DFA. If you have backreferences in a regex, you must use an NFA to drive it.

Also, backreferences stand a chance of turning an NFA into NIA. Turns out it's not possible to programmatically determine which backreference-containing regexes will halt or not.

wahern · on Sept 17, 2019

That's only for regular languages. Backreferences aren't regular.

ridiculous_fish · on Sept 17, 2019

Yes, hence my question about using NFAs for std::regex!