Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

out of curiosity: what would you use for the URL validation?

(but i get this is difficult: https://gist.github.com/dperini/729294 )



Well, it's clear to see why people are lead astray. In this case, RFC 3986, which defines what a URI actually is, itself actually proposes a regex[0]. It even claims, and I quote, 'the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions'. Of course, it doesn't actually validate anything... it just performs rudimentary field extraction. Trying to do anything beyond that with regex is madness.

The ABNF grammar is fairly simple though and, I think, free of ambiguities. I've had luck converting it straight in to PEG form. It's not a trivial or wholly useful endeavour though, and, if you do so, remember to check the errata. For HTTP you'll also have to add in the changes from RFC 7230[1].

Oh, and of course, none of this validates DNS names, their labels, etc. for length, the "LDH rule", or the public suffix list[2], or IP addresses to check whether they have publicly routable prefixes.

Bottom line is, if you want to validate a URL, the best thing to do, much like e-mail, is to just try and GET it.

[0] https://tools.ietf.org/html/rfc3986#appendix-B

[1] https://tools.ietf.org/html/rfc7230#section-2.7.1

[2] https://publicsuffix.org/


The general rule is that any tricky security code should have as many eyeballs on it as possible, so I'd probably start by seeing if my framework had solid support for this, or failing that, see what popular libraries for my platform exist.

In this particular case, the Google Caja project[1] is a good starting place for most HTML/JS/CSS sanitization needs (although the project has a much larger scope than just that); and I think the 'sanitizer' package on npm is a fairly popular wrapper/port of it's basic sanitizing code, and I believe ruby/php/python have their own but I couldn't name them offhand. But it would depend on the exact attack vector you're trying to stop, eg, XSS, remote shell via filename params, etc.

If I had to write it myself, I'd probably go for something as braindead as possible; probably a bunch of nested loops backed by some thorough tests. Regexps are great for magic one liners, but magic one liners are antithetical to good security.

[1]: https://developers.google.com/caja/


I prefer straight up imperative code (or, better yet, whatever Url/Uri API that is provided by the platform).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: