What’s the difference between this HTML snippet:
Both of them look like simple Google searches (though they could have been anything; Google is just an example). One of them appends an extra “&foo=0″ to the end of the URL; the other appends “©=0″ instead.
Only the second snippet is valid in HTML 4.01 Strict, but that snippet doesn’t work the way you might expect. Neither snippet is valid in XHTML.
Give up? Click on these:
The first URL searches for “html,” but the other URL searches for “html©=0.”
Two weird things are happening here.
- Note that “©” is an HTML entity for the copyright symbol “©.” It would have been more obvious if the URL had used a semicolon, like this:
or if we’d used a more traditional HTML entity like this:
- The second weird thing is a quirk in the HTML specification on character references:
Note. In SGML, it is possible to eliminate the final “;” after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the “;” in all cases to avoid problems with user agents that require this character to be present.
As a result, all modern browsers (FF3, IE7, Opera 9, Safari 3.1) will helpfully notice possible entities like “©” and “<” and replace them with “©” and “<” … they assume you forgot the semicolon. This applies to all of the HTML entities, even the obscure ones like &empty “∅”, ¬ “¬”, ® “®”, &sub “⊂”, and &lang “〈”. (Bizarrely, &Copy is left alone as “&Copy” but © is replaced with “©”.)
We think there are two valuable lessons to learn from this story. The first lesson you may already know:
- The correct way to write an URL with a query parameter is to HTML escape the URL, replacing all &s with & like this:
That’s also the only way to make the snippet XHTML compliant.
- Don’t use URL query parameters whose names are HTML entities. Never create a web service that accepts a query parameter like “&lang=en”. After all, there’s no way to know when your users might want to copy & paste your URLs into a blog, forum, or HTML email. Even if developers are clever enough to HTML escape href links, not everyone will be, and you can save everybody some trouble by avoiding the dangerous entities altogether.