Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The implementation of std::regex appears to be NFA based [1] which is surprising because backreferences are part of the standard.

How are backreferences implemented here?

1: https://github.com/microsoft/STL/blob/master/stl/inc/regex



That is near the top of the list of things to replace when we can break ABI next. It's... not amazing.

We regularly recommend customers use Boost if you need an API compatible version, or something like RE2 if you just need a fast regex engine.


https://github.com/microsoft/STL/blob/master/stl/inc/regex#L... is the code. Looks pretty simple, check if it matches the backreferences and error/continue.

The handling of repetition is more revealing, it doesn't appear to have "states" at all but rather is just the usual backtracking: https://github.com/microsoft/STL/blob/master/stl/inc/regex#L...

So the "NFA" is really an AST for the regex rather than an automaton based on Thompson's algorithm or similar.


I think you're thinking of DFAs, which don't support backreferences. NFAs do. Right?


Maybe? But in theory any NFA can be reduced to a DFA through powerset construction, so there must be something else going on.


Jeffrey Friedl's book on regexes unapologetically popularized the incorrect use of the terms "NFA" and "DFA" to characterize the different implementations of regex engines. Translating to something more sensible, "NFA" means "unbounded backtracking," where as "DFA" means "Thompson NFA simulation." In this vernacular, as far as I can tell, there is no word to describe an actual DFA.

PCRE's docs continue this confusion. For example, they expose a DFA API... But it's not actually a DFA!

One might say this is similar to how "regex" doesn't necessarily imply "regular," which is true. That's definitely a battle that has been lost long ago. But NFA and DFA are even more specific jargon terms, and the way in which they have been redefined by the Friedls of the world has led to exactly your confusion over and over again.


Thank you, that clears everything up. NFA as used here means something completely different from NFAs in CS literature.


Unfortunately, yes. Although there is a connection. If you squint, you can see parts of the NFA. Indeed, it is very easy to simulate an NFA with unbounded backtracking (taking exponential worst case time). All you need to do from there is add a few op codes for look around and backreferences. Where you end up still looks like you're simulating an NFA, but the actual implementation has a lot more power.

To be fair, the same thing happens with the Thompson NFA simulation. It's not too hard to add a little extra power to it in the form of memory in order to track capturing group locations. It becomes more powerful than an NFA with this addition, but retains its linear time bound. This is why I've normally seen this addition called the Pike VM. (At least, Russ Cox and I both use than name.)

For completeness, you can also use bounded backtracking to simulate an NFA. It keeps track of all states visited per input byte, so it uses a lot of extra memory, but maintains a linear time bound.


Any NFA that does not have backreferences can be converted to a DFA. If you have backreferences in a regex, you must use an NFA to drive it.

Also, backreferences stand a chance of turning an NFA into NIA. Turns out it's not possible to programmatically determine which backreference-containing regexes will halt or not.


That's only for regular languages. Backreferences aren't regular.


Yes, hence my question about using NFAs for std::regex!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: