Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

<a href="</a>">try this</a>


That is not XHTML.


What about <a>this <!-- </a> --> </a> <!-- </a> -->?


Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.


> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.


No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.


<input value="how about this? />"/>


That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.


Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.


That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.


It just makes the starting offset of the regex input important.


Sure. But that would be true for any parsing technique. No parser known to man would be able to produce a valid parse if you start it in the middle of a quoted string!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: