21 September 2005

Ampersandectomy

For a while now, a few of us at www.att.com have been trying to crack the case of getting ampersands to appear properly encoded as & once our Content Management System slices and dices said content and makes beautiful julienned fries. (I mean files.) Try as we might, there always seem to be a few that magically transform into & despite our best efforts. “It’s not unlike playing whack-a-moley’see.”

So this has got to be a CMS gaffe, right? This isn’t rocket science. All we’re looking for is well-formed, valid markup at the end of the day. If we enter plain text, we can rest easy knowing & becomes & (or &) at markup generation time. Or, if we enter some XHTML outright (not exactly what I’d call “content” in the purest sense but let’s not pick nits - besides that’s another topic anyway) the CMS editor kindly performs its own sanity checks before passing it on.

So that’s it. We’re covered!

Alas, not quite, as we learned on Friday. The culprit, it turns out, is not the CMS per se. Rather, it’s the XSLT processor used by the CMS.

“Must be some super-special proprietary closed system! It figures.” Wait - actually, no. It’s not. Hold on to your hats, it’s (drum roll) … the Apache Software Foundation’s very own … Xalan! (Surprised? Well, I sure was.)

Here’s the problem in a nutshell:

Let’s suppose I have markup that includes URI attribute values containing two or more query string parameters. Something like <a href="http://my.web.server/file?param1=a&amp;param2=b">...</a>. Notice that I used &amp; to separate each name/value pair. Now I realize I could go the more ecologically friendly route and use the lone semicolon as a separator (which, truth be told, I’d actually prefer). Let’s just say for the sake of argument that I need to peacefully coexist with both kinds of separators. Mostly because, for the sake of reality, I actually do need to peacefully coexist with ; and &amp;.

The next ingredient is the XSLT itself, wherein we can set the output method. Should we set it to plain text? Nah, too easy. Let’s pick something more sensible, perhaps an output method of xhtml, or even html. Sure! Let’s pick html, why don’t we. Aside from that, all we do is pass along the markup from point a to point b.

Now, if we were to pass all of this through Xalan, what do you suppose would happen? Should we expect to see those escaped ampersands left as-is in the final output? Presuming we’ve set up our template properly, I’d say yes we should.

But it is not to be. Instead, Xalan manages to convert each of them to a lone &!

Wait, it gets better. Take that same exact attribute value - the one with the ampersands in it - and assign it to title, or alt, or any other non-URI attribute.

Same result, right? Nope. It emerges unscathed! Really. No fooling.

Bug XALANJ-611 tells all. I won’t repeat the discussion here. All I can say is it spells out the situation rather well, plus it is flagged as a “major bug” and is still present in the latest release from what I can tell.

Right. On to plan B. (There’s always a plan B.) What about libxslt? Works fine. Saxon? A-OK. XMLSpy? Nooooo problem.

But can I use any of these with my CMS? Sadly, no. The CMS is shipped with Xalan goodness baked right in. In fact, I’ll even go so far as to reveal that it has been modified by the vendor. “Ah-ha, so it is the vendor’s fault!” Nope. Thanks to Marc Liyanage’s TestXSLT, I’ve verified I can still replicate the unwanted behavior using a fresh install of Xalan, even the latest release. Besides, there’s already a bug filed, plus we wouldn’t want to break the warranty seal and hack the poor CMS to pieces now, would we?

Ah well. So much for Plan B. [Pauses, looks up to the heavens, shakes both fists in the air, cries out, camera POV from above.] “Ahhhhhhhhh!”

I wonder if there is a chance this can be patched without a lot of muss and fuss? Is it actually low-hanging fruit? How ‘bout it, friendly neighborhood Xalan Developers? This one’s been around since late October 2001, and is currently reopened. “Can we fix it?”

Meanwhile, there is a (potentially) happy ending here. I conveniently skipped xhtml output mode in the above scenario. What happens if we try it? No problem at all (to Xalan’s credit)! Very well then. For html output mode, we know a surprise is in store.

As for www.att.com, it’s now clear that our XSLT has been set to html output mode all this time. Whoops. I recommended that we switch to xhtml output mode and kick the tires (and hold the xml declaration for now, thank you). We are, after all, generating what should be well-formed and valid xhtml. Hopefully there are no other gotchas in store by going this route. Right?

Oh, c’mon, you know there will be gotchas.

So far, we’ve only found one. Turns out we still have a rather unhealthy dose of historical content that uses named anchors in conjunction with fragment IDs, like so: <a name="id"></a>. (Yes, this is considered verboten nowadays. No we’re unable to go on a search and replace mission for the time being.) At any rate, because there is essentially no text node (to be displayed), or even if there was only whitespace within, it emerges simplified as <a name="id" />.

Pray tell, what browser do you suppose has a problem applying CSS to this sanely? (Starts with an I … ends with an R …)

Now what? We keep that distillation from occurring. For now, I’ve got the XSLT forcing a space in between when there is neither a text node or nested elements to be found, like so:


<xsl:template match="a[not(normalize-space(.)) and not(*)]">
  <xsl:copy>
   <xsl:copy-of select="@*"/>
   <xsl:text><![CDATA[ ]]></xsl:text>
  </xsl:copy>
</xsl:template>

We’re going to kick the tires on this for a bit, allowing it to be introduced with content updates and, if there are no other surprises in between now and then, apply it across the board.

UPDATE: Yes, there are surprises! How about empty <textarea></textarea> or <script></script> blocks? The latter case is especially common when including page behaviors.

For Plan B, then, we’re still looking for the “empty” case as before, only this time we let through the ten XHTML 1.0 Strict elements that are expected to be empty outright. The rest get the one space treatment from before. Here’s the XSLT I’ve submitted for this go-round:


<xsl:template match="*[not(normalize-space(.)) and not(*)]">
  <xsl:choose>
    <xsl:when test="../area|../base|../br|../col|../hr|../img|../input|../link|../meta|../param">
      <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:apply-templates/>
      </xsl:copy>
    </xsl:when>
    <xsl:otherwise>
      <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:text><![CDATA[ ]]></xsl:text>
      </xsl:copy>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Perhaps this can be simplified/refactored somewhat. In any event, so far we’re getting far better results with this revision.

Posted by joe at 12:30 AM

Trackback Pings

TrackBack URL for this entry:
http://www.joesapt.net/mt/mt-tb.fcgi/34

Post a comment


(will not be published)


Remember Me?

(you may use HTML tags for style)