URL Query Parameters and HTML Entities: The Case of the Missing Semicolon

What’s the difference between this HTML snippet:

    <a href="http://www.google.com/search?q=html&foo=0">foo=0</a>

and this?

    <a href="http://www.google.com/search?q=html&copy=0">copy=0</a>

Both of them look like simple Google searches (though they could have been anything; Google is just an example). One of them appends an extra “&foo=0″ to the end of the URL; the other appends “&copy=0″ instead.

Only the second snippet is valid in HTML 4.01 Strict, but that snippet doesn’t work the way you might expect. Neither snippet is valid in XHTML.

Give up? Click on these:

The first URL searches for “html,” but the other URL searches for “html©=0.”

Two weird things are happening here.

  • Note that “&copy;” is an HTML entity for the copyright symbol “©.” It would have been more obvious if the URL had used a semicolon, like this:
        <a href="http://www.google.com/search?q=html&copy;=0">copy;=0</a>

    or if we’d used a more traditional HTML entity like this:

        <a href="http://www.google.com/search?q=html&quot;=0">quot;=0</a>
  • The second weird thing is a quirk in the HTML specification on character references:

    Note. In SGML, it is possible to eliminate the final “;” after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the “;” in all cases to avoid problems with user agents that require this character to be present.

    As a result, all modern browsers (FF3, IE7, Opera 9, Safari 3.1) will helpfully notice possible entities like “&copy” and “&lt” and replace them with “©” and “<” … they assume you forgot the semicolon. This applies to all of the HTML entities, even the obscure ones like &empty “∅”, &not “¬”, &reg “®”, &sub “⊂”, and &lang “⟨”. (Bizarrely, &Copy is left alone as “&Copy” but &COPY is replaced with “&COPY;”.)

We think there are two valuable lessons to learn from this story. The first lesson you may already know:

  1. The correct way to write an URL with a query parameter is to HTML escape the URL, replacing all &s with &amp; like this:
        <a href="http://www.google.com/search?q=html&amp;copy=0">copy=0</a>

    That’s also the only way to make the snippet XHTML compliant.

  2. Don’t use URL query parameters whose names are HTML entities. Never create a web service that accepts a query parameter like “&lang=en”. After all, there’s no way to know when your users might want to copy & paste your URLs into a blog, forum, or HTML email. Even if developers are clever enough to HTML escape href links, not everyone will be, and you can save everybody some trouble by avoiding the dangerous entities altogether.

Discussion

  • http://blog.caffeinatedsoftware.com Robbie

    This reminds of an IE 6 bug I uncovered last year. IE 6 converted an URL encoded ampersands in the query string (%26) into ampersands (&). Needless to say, this little glitch completely fouls up your page’s query string processing.

  • sweavo

    The real lesson to be learned is that a SPECIFICATION should be SPECIFIC. The root of this difficulty is that the specification is permissive and wooly. /rant

  • http://link name

    Very interesting sites.,

  • http://link name

    really great sites, thank you,

  • http://link name

    Your Site Is Great,

  • http://link name

    I like your work!,

  • http://www.discount-nike-dunk-shoes.com nike dunk shoes

    Hhe article's content rich variety which make us move for our mood after reading this article. surprise, here you will find what you want! Recently, I found some wedsites which commodity is colorful of fashion.
    http://www.scarf8.net

  • Rhslcs

    Nice code for image formation.

  • http://www.123internetmarketing.co.uk/ web design southampton

    Thanks, for your kind information.
    u r great.