Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I found it extremely helpful. The sheer emphasis of the reply made me very curious why the idea of using regex on xml is so bonkers.

- I will never forget that regex can't parse XHTML

- the reason being, regex is insufficiently powerful

- when I first saw this post, I knew little about regex under the hood, this sent me down a wiki hole of FSMs, pushdown automata and turing machines

- this misconception is apparently common enough to be madness-inducing to those that know better

- use a hecking xml parser instead

It almost reminds me of a Bill Nye sketch. Teaching through a bit of non-sequitur and absurdism.



The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!

Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.

I have elaborated a bit in blog post: https://www.cargocultcode.com/solving-the-zalgo-regex/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: