Not for the first time I've been caught out when trying to do simple XSLT to pick information out of XML files. Hopefully this is the last. I'm blogging this so I can refer to it again later. Hopefully it will be useful to you too.
If your XSLT is behaving strangely, there are two golden rules to remember.
text() nodes to the output verbatim). If you're deliberately trying to filter some content out of the file then you'll want to override these.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="text()" />
<!-- Now pick out the elements you are interested in -->
</xsl:stylesheet>
Bearing these two simple rules in mind could help to prevent a lot of head-scratching when dealing with XSLT that appears to be behaving against all logic.
Comments
Your blog for "Creating Japanese HTML Help (chm) using Author-It
Hi,
We just had this problem, and your blog was the only source of information that provided a solution. It's been a long time since you posted the issue, and you probably already installed the upgrade with the fix. If you didn't upgrade, or if you are interested in seeing what AIT tech support said about it:
..., in 5.3.211 Author-it forced all Web formatted published output to use UTF-8 regardless of what was in the templates, this was a misguided attempt to fix a problem in all previous releases of Author-it 5 where ANSI was forced on all Web formatted publishing regardless of what was in the templates originally. It was thought that UTF-8 and Unicode would solve everyone's problems, but it turns out a number of people actually need the older ANSI format for some files. So this was further corrected in release 5.3.316 where Author-it was made to honour what is in the templates and in situations where no template is being used, default to UTF-8 and Unicode (because it's good for all languages and most outputs).
The current release is 5.3.530 and the fix added in 5.3.316 is still valid. More information on this can be found in the release notes for 5.3.316, and in the HTML Template Encoding topic in the Knowledge Centre.
Japanese HTML Help still "broken" in Author-It
Hi,
Thanks for your feedback.
The latest version of Author-It as of 18/5/10 is still broken as far as I am concerned. Yes, you can specify a code page in the template and Author-It will usually generate output in that format. However, the output is encoded to ANSI; if you specify UTF-8 as the code page, you LOSE any special characters! Instead, they come out as question marks before you have a chance to encode them into HTML entities. You get the same problem with Shift_JIS: for example, if you have a copyright symbol(©) it will come out as a c.
In summary, any symbol that doesn't exist in your target ANSI code page will be forcibly translated into a character that is in that code page. That is true, even if you select UTF-8 as a code page (presumably it uses 0-127 in ASCII)!
There might be a solution where you publish to HTML instead of HTML Help (which seems to honour UTF-8 for all Unicode characters), then copy the files and use Rainbow (http://okapi.sourceforge.net/) to translate the topics to Shift_JIS while preserving the special characters. But I haven't tested that fully yet because I'm still hoping Author-It fixes the software.
I have a support call with Author-It but I seem to be the only person complaining about this issue so there's no word on if/when they will fix it. Can I really be the only person who wants to produce chm files containing symbols that don't appear in the target character set?