<a href="</a>">try this</a>

unlinkr · on Sept 8, 2019

That is not XHTML.

hk__2 · on Sept 8, 2019

What about <a>this  </a> ?

unlinkr · on Sept 9, 2019

Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.

hk__2 · on Sept 9, 2019

> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.

unlinkr · on Sept 9, 2019

No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.

nurettin · on Sept 8, 2019

unlinkr · on Sept 8, 2019

That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.

nurettin · on Sept 8, 2019

Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.

unlinkr · on Sept 9, 2019

That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.

nurettin · on Sept 9, 2019

It just makes the starting offset of the regex input important.

unlinkr · on Sept 9, 2019

Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!