To be fair it *is* unconstructive. If you read the question carefully, it is cle...

lloeki · on Jan 31, 2019

Regexes are indeed a perfectly fine answer when you have the guarantee no corner cases will show up in the content, and I did and still do use regexes to quickly extract data form well-known HTML/XML as a quick hack (curl|grep). Otherwise you're much better served by using a parser and select nodes with xpath/css.

The question doesn't specify if the file to match against is unique/one-shot or if it's a general case. Without that info you can largely assume it has to handle any input. The regex will get unwieldy since you have to account for corner cases like:

   <!-- <a href="foo"> -->
   <div bar='<a href="foo">'></div>

goto11 · on Jan 31, 2019

The second line is not a corner case, that is simply not legal XHTML. You cannot have an unescaped < in an attribute value. You will need to take comments (and DTD's and CData) into consideration of course, but you can do that in a regex.

In any case, how would you use xpath or CSS to identity self-closing tags? They operate on the parsed tree, not on the token level, and the question is about identifying specific tokens.

boomlinde · on Jan 31, 2019

Maybe not very constructive, but I think it's a technically fair answer given the question. The person asking is not intending to match individual tokens one by one to feed into a parser, but simply to use a regular expression to extract all instances of a set of opening tags in a whole document. The trivial solution he proposes, while perfectly sufficient for some subset of documents, quickly breaks in the general case when you consider comments and CDATA sections. For that you need to maintain an understanding of the whole document.

That said, this answer frequently gets linked in discussions even where using regular expressions is an entirely valid approach.

goto11 · on Jan 31, 2019

How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes. You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.

A technically fair answer would be to point out that the regex would have to take other tokens like comments, CData etc. into consideration, so it is more like a five-line regex than a one-line regex. If someone recommended a XHTML tokenizer or other tool which could solve the OP's task, that would also be a great answer.

boomlinde · on Jan 31, 2019

> How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes.

Yes, but that you can tokenize XHTML using regular expressions is not the same thing as being able to use a single regular expression to extract XHTML tokens. Remember that context free languages are a superset of regular expressions. I don't personally know enough about the XHTML syntax to say off the bat whether the syntax can be described with a regular expression, but generally a recursive definition of valid syntax is not possible to express with regular expressions.

> You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.

You can use a parser, just not any XHTML parser. The parser would need to be constructed with the objectives in mind, to parse into a data structure that doesn't abstract these details away.

That said, maybe an even simpler solution exists, such as to use several regular expressions to first remove comment and CDATA before matching. I'm not immediately aware of any other cases that would cause problems for the trivial match suggested in the question post.

Izkata · on Jan 31, 2019

I think you're missing the point - if anything the reasons you gave would be cause for it to be downvoted, because it's still an anwer, just a bad one. The reason it would be deleted as unconstructive is the creativity, which is discouraged in the push for professionalism.

goto11 · on Jan 31, 2019

But it is not downvoted. It is heavily upvoted despite being wrong and misleading. Because it is fun and snarky so lots of people upvote it regardless whether they even understand the issue or not.

itsfun10213123 · on Jan 31, 2019

>Parsers typically use regexes for the tokenization stage - indeed, what else would you use?

This is completely wrong. One can also just write their own tokenizer reading one character at a time with a state machine. It's trivial compared to the complexity of the rest of the parser.

grumdan · on Jan 31, 2019

A standard state machine with no memory (other than the current state) is equivalent in expressivity to regexes (in fact regexes with back-references are more expressive); even if the state machine is non-deterministic.

goto11 · on Jan 31, 2019

The question is not about parsing. It is about tokenizing XHTML. So you are suggesting to write a hand-rolled tokenizer instead of using regexes for tokenization? Why is that better? That is exactly the kind of task regexes excel at.

sergiosgc · on Jan 31, 2019

A regex is a state machine. You can code the state machine by hand, but that does not invalidate the previous statement.

mpax · on Feb 6, 2019

Depends on how you look at it.

Regex is a family of languages each of which can have various implementations. You could have a regex implementation that instead uses mutually recursive functions etc.

What is true is that regexes are typically not turing complete and can be represented with simple state machines.

mayniac · on Jan 31, 2019

>The answers are ridiculing the OP for asking a totally reasonable question.

Isn't this fairly common for stackoverflow nowadays?

linker3000 · on Jan 31, 2019

Hell yeah:

I was ridiculed for posting a query about a C++ concept I was trying to learn from one of the authoritative books on the subject - I just couldn't 'get' the syntax being explained.

I persevered and then someone chimed in that, hey, there was a typo in the book's example!