Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Regexes are indeed a perfectly fine answer when you have the guarantee no corner cases will show up in the content, and I did and still do use regexes to quickly extract data form well-known HTML/XML as a quick hack (curl|grep). Otherwise you're much better served by using a parser and select nodes with xpath/css.

The question doesn't specify if the file to match against is unique/one-shot or if it's a general case. Without that info you can largely assume it has to handle any input. The regex will get unwieldy since you have to account for corner cases like:

   <!-- <a href="foo"> -->
   <div bar='<a href="foo">'></div>


The second line is not a corner case, that is simply not legal XHTML. You cannot have an unescaped < in an attribute value. You will need to take comments (and DTD's and CData) into consideration of course, but you can do that in a regex.

In any case, how would you use xpath or CSS to identity self-closing tags? They operate on the parsed tree, not on the token level, and the question is about identifying specific tokens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: