Encoding

1. Another Source of information on encoding
2. Encoding special characters
3. How to parse internal data which is in UTF-8 format.
4. Encode URL inside HTML Anchor Tag
5. Netscape 6 and XSLT
6. output encoding iso-8859-1
7. Special Characters in URLs
8. Encoding URL in XSLT
9. xml encoding
10. Retrieve the encoding?
11. Encoding problems
12. XML and HTML in output
13. Encoding terminology
14. Encoding
15. Encoding problems (reported as character entity problems)

1.

Another Source of information on encoding

Michael Beddow

The best source for all encoding-related issues is by Mike Brown at his web page.

2.

Encoding special characters

Mike Brown

> As I was told ™ is the best way to archieve (TM) symbol > the majority of major browsers display it as TM.

Do not rely on what people tell you; consult the actual specifications at w3.org. In XML, XSLT, and HTML, ™ is *by definition* ISO/IEC 10646-1:1993 character number 153, which is not a trademark symbol. HTML also happens to define ™ as an entity reference.

Browsers are letting you get away with ™ because old versions were lax about what numeric character references meant. It is for backward compatibility. So yes, you are right, it is the 'best' way in that it is the most reliable.. for as long as browsers support this broken notation... but it is not 'best' as in 'correct in any way whatsoever'.

> If you use the unicode value, it does not work on some IEs 
> (but it did work on my linux :-)

Did you actually try it in the transformation? You will not get ™ in the HTML just because you put ™ in the stylesheet. The stylesheet is not a literal specification for output. Try it and see, with output method="html". I almost guarantee you will not see ™ in the output.

> I managed to make xalan to render it as ™ if I used encoding="us-ascii".

Then this is bug #48571249854175123484 in Xalan. There are half a dozen Xalan bugs posted here every week, it seems like.

> I do not wanna use encoding="windows-1252" because than it only works on
> windows-1252.

The output is going to be bits and bytes in *some* encoding.

In HTML you are allowed to represent the trademark symbol as one of these references *only*:

  ™
  ™
  ™

(It is certain browsers that let you use ™)

Since the references themselves are comprised of all ASCII characters, and since almost all encodings subset ASCII, they are allowed in all encodings.

Also, instead of using the reference, *if the encoding supports it*, you can use the directly encoded character. That is,

  if encoding is:   byte sequence for that character is:
  ===============   ====================================
  utf-8             0xE2 0x84 0xA2
  windows-1252      0x99
  iso-8859-1        n/a. must use reference. e.g.,
                    0x26 0x74 0x72 0x61 0x64 0x65 0x3B
                    &    t    r    a    d    e    ;
  us-ascii          n/a. must use reference.

However in XSLT you have no way of demanding that the output be certain bytes or certain references. You must rely on the wisdom of the XSLT processor's output method to convert the character you want (and there is only one trademark character) to the right sequence of bits, according to the encoding you asked for in xsl:output.

> Characters are being rendered according to
>         a) input encoding
>         b) input form (escaped/non-escaped)

no.

the xml document is typically a bit sequence like 110101010101010111010101111110010101010101010111111...

  these represent ISO/IEC 10646-1:1993 (UCS) (~Unicode) characters like

  <?xml version="1.0" encoding="utf-8"?>
  <doc>
    <element attribute="cdata">character&#20;data</element>
  </doc>

this mapping of bits to UCS characters is the encoding (essentially). the encoding declaration in the XML declaration is only for helping to determine the encoding. once the document is decoded, it is irrelevant. it is at that point all UCS characters.

after decoding the document, the xml parser resolves character and certain entity references, turning them into UCS characters too. in the example above, &#20; becomes the space character.

the UCS characters at this level imply the logical structures: elements, attributes, character data. these structures are reported by the parser to the application (the XSLT processor).

so you see, you can say &#20; or &#x14; or refer to an entity that you defined as the space character, or put the encoded bits for the character into the binary document ... it doesn't matter; it all means the same thing, once it goes through the parser. the XSLT processor only knows about the single space character that was meant, not the 5 characters '&#20;'. those were just 'physical' markup.

now consider that the stylesheet is itself an xml document that is parsed just like the source document. the xslt processor acts on the logical structures. the stylesheet is not a literal specification for output. it is only a representation of how to build the result tree. character references in the stylesheet are just an abstraction for the individual characters that will actually be manipulated by the processor.

the stylesheet's instructions result in the creation of a node tree -- the result tree. depending on what you put in the xsl:output element's 'method' and 'encoding' attributes, this tree will be serialized in different ways. the serialization for xml and html output methods will be as bits in the given encoding. the method might affect whether, say, UCS character 160 (non-breaking space) is output as the encoded bits for the single character number 160, or as the encoded bits for the character sequence '&nbsp;', or as the encoded bits for the character sequence '&#160;' or '&#xA0;'.

I wrote a lot about this at http://www.skew.org/xml/tutorial/ (An excellent reference btw - DaveP) because I was disappointed that XML books make very little effort to address these issues. Concepts like encoding and logical structures should come first. Syntax and code samples come last, and are almost inconsequential, once you understand the principles at work. Instead, everyone teaches these things backward, and you end up with situations like this, where your impression of the meaning of a character reference is shaped by the way HTML user agents behave(d).

I think you are under the impression that character references are related to the encoding of the document. They are not. They are by definition, in both HTML and XML, references to characters in one specific repertoire.

3.

How to parse internal data which is in UTF-8 format.

Richard Tobin


The problem is that the parser reports XML parsing error 
for a character less than 0x20. Is this correct?

Yes.  The legal XML characters are TAB, CR, LF, 0x20 - 0xd7ff,
0xe000-0xfffd, and 0x10000-0x1ffff.

See production [2] in section 2.2 of the XML spec.

UTF8 range is between 0-127 (?) so a character like 0x1C is a valid
one (?).

The problem is not that they aren't legal in UTF-8 or Unicode, but
that they aren't legal in XML.

If you want to represent illegal characters in an XML document, you
will have to come up with some encoding of your own.  For example, you
could replace them with elements like this:

 <character code="27"/>

and perhaps use entities to refer to them:

 <!ENTITY esc '<character code="27"/>'>

or your could replace them with characters from the Unicode private
use area, which are legal in XML:

 <!ENTITY nul "&#xe000">
 <!ENTITY esc "&#xe01b">

In either case you would have to process the parser output to get the
real characters.

It is certainly inconvenient that XML prohibits these characters.

            

4.

Encode URL inside HTML Anchor Tag

Mike Brown

> Is there a function/transformation available in XSL that will allow me
> to encode URLs in HTML files?
	

There is no built-in function for this purpose, no.

If your XSLT processor supports extension functions and is Java based, as I believe yours is, you can simply invoke the encode() method of java.net.URLEncoder, passing it the string to encode.

The example I have below is for James Clark's XT, and also works as-is with SAXON. I'm sure by just adjusting the 'url' namespace declaration it will work with Cocoon/Xalan equally well:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"    
   version="1.0"
   xmlns:url="http://www.jclark.com/xt/java/java.net.URLEncoder"
   exclude-result-prefixes="url">

  <xsl:output method="html" indent="yes"/>
  
  <xsl:template match="/">
    <xsl:variable name="x" select="'encode me #1 superstar?'"/>
    <xsl:if test="function-available('url:encode')">
      <a href="http://www.skew.org/printenv?foo={url:encode($x)}">
        <xsl:value-of select="$x"/> 
      </a>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Some XSLT processors (namely SAXON) will automatically URL-encode the values of href attributes. Even though you can turn this off in SAXON, there are reasons why this is undesirable in general and I don't recommend that other processors adopt this practice.

The only other alternative is to stick with pure XSLT and write a stylesheet that does the encoding via substring lookups and tail recursion. This is a rather daunting task considering there is no easy way to determine the UTF-8 sequence for a given character.

(The URL-encoding algorithm is, roughly, replace certain reserved characters with their UTF-8 sequences, expressed as '%xx' for each octet, where xx is the hexadecimal representation of the octet; with the option of using '+' instead of '%20' for spaces. As you seem to already understand, this translation applies only to certain parts of the URI while the URI is constructed, not afterward, which is one of the reasons why SAXON's behavior is not desirable.)

5.

Netscape 6 and XSLT

Mike Brown

> The XML I am transforming (on the server) looks
> something like this: "()&^&^%^&^)%^(%********(*)^(&" 
> when viewed in  Netscape 6.
>  
> Could it be that the only problem is my 'encoding' attribute value in the
> processing instruction?

I am guessing that you are actually producing UTF-16 encoded output, and Netscape is having a hard time with it. Are you using MSXML3 or higher to do the transformation?

Under certain circumstances, you need to set the output encoding in your code that calls the transformation, because MSXML won't honor what is in the xsl:output instruction in the stylesheet.

First make sure you have specified your desired encoding in the stylesheet:

  <xsl:output method="html" encoding="iso-8859-1" indent="no"/>

(substitute iso-8859-1 with whatever is approrpiate for your content. utf-8 is another option)

If you have the ability to do so, make sure you are also setting the HTTP Content-Type header for the HTML that you are serving with a charset parameter like this:

  Content-Type: text/html; charset=iso-8859-1

If you don't have this much control over the server, use a meta tag in your HTML (i.e., have the stylesheet put it in there):

<html>
  <head>
    <title>foo</title>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
/>
  </head>
  <body>
  ...
  </body>
</html>

With this Content-Type info, Netscape should be able to properly decode the document. Without it, Netscape is guessing, and you run the risk of getting garbage.

It would be helpful to figure out what encoding you are actually producing. If you set IE to auto-setect the encoding, you can load up the page and then go back to View > Encoding and see what it says.

6.

output encoding iso-8859-1

Mike Brown

> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?>
> <start>
> á °
> </start>

The email was iso-8859-1 encoded. In other words, "á" (Latin small letter a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm guessing that your original file is iso-8859-1 encoded, too.

Your XML is misdeclaring its encoding. It is an error to say it is utf-8 encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but apparently your parser doesn't care.

&#6192; = &#x1830; which is equivalent to the bytes 0xE1 0xA0 0xB0 in utf-8. I'd say your parser is being very liberal with its interpretation of the bytes.

> What character reference is the &#6192?  
> This is supposed to be ISO-8859-1
> isn't it?

The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output as their 7 respective iso-8859-1 bytes, as per your xsl:output instruction, yes. What "&#6192;" means, however, in the context of an XML or HTML document, is the single character known as MONGOLIAN LETTER SA.

>  Then how come I can't seem to find the character code for 6192

Maybe because you weren't looking at The Unicode Standard at unicode.org, or the Letter Database at http://www.eki.ee/letter/, or at the standard that is referenced by both the XML and HTML specs: ISO/IEC 10646-1.

> And also, what happened to the 2 distinct characters from the
> source xml?

Your 3 characters (including the space in between them) became 3 bytes in the encoding supported by the editor that made the file. When read back in by an XML parser under the assumption that utf-8 was the character map used, and taking into account the fact that your parser is apparently very forgiving of the illegal byte sequence, the 3 bytes together imply 1 abstract character -- that Mongolian character that you probably won't find in any font. When this character is copied to the result tree in your XSL transformation, it retains its identity as a single character. When the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it is impossible to represent this character as anything other than "&#6192;" or "&#x1830;"

Michael Beddow adds:

To sort out this minefield you need to take a good look at Mike Brown's explanation of encoding issues atMike Browns page

Also, your mention of "copying and pasting from a Unicode font in Microsoft Word" suggests you may need to find out more about another chamber of horrors: the way that Win9X/ME on the one hand and NT4/W2K internally represent unicode characters and handle their transfer via the clipboard and COM in very different ways, even when running identical applications with identical fonts. Sorry, but I don't know a handy single site for that info (tho' its splattered all over various bits of the MSDN site and CD's)

If you use standard text editors, you can't safely cut and paste between documents in different encodings without hitting the sort of problems that Mike Brown has explained. But if you use XML Spy (and maybe others, I don't know) AND run under NT or W2K, if you are editing an XML document with a declared encoding of utf-8 (or no declaration, so utf-8 is default) and you paste into it characters cut from a document in another encoding (possibly in another editor) then XMLSpy handles the clipboard in such a way that it transparently converts the encoding for you. This won't work under W9x or ME though.

The original questionner then adds:

Thanks very much for the detailed answer Mike, this is all starting to make sense.

There were a couple of points that make this stuff easier to understand. Forgive me if this is stating the obvious, but it took me a while to synthesize this... the info is scattered all over the place.

1) The "&#xxx;" notation in XML and HTML files are character references, which refer to the decimal value of the character in the Unicode character set. This is entirely different from the encoding scheme that the document declares. If the encoding scheme says ISO-8859-1 these character references still refer to Unicode character values.

2) The encoding scheme is supposed to declare the actual byte encoding of the doc. That's all.

3) It is non-trivial to manage content with extended characters across a number of different applications and operating systems... Clearly, in my case, strange stuff happened to the byte ordering during the "cut and paste" process, and, as well, I am not sure if the apps I was using to view the content were able to make sense of the UTF-8 multibyte characters anyway. Rather than assume this will work you really need to discuss each application individually.

My conclusion for now is that the safest way to do manage content with international characters is to use the character references as discussed in #1. Unfortunately, this won't result in a WYSIWYG editing system, but that's a small price to pay for increased portability of the content, across all kinds of editors and OS's.

Mike answers

Yes, it's safest in that by only using non-ASCII characters by reference, your document is at a low level 100% ASCII and thus can be misinterpreted as being in pretty much any encoding (a few obscure ones notwithstanding).

One of the features of XML, though, is that there are relatively unambiguous rules for determining the encoding, and for declaring it, so that this shouldn't be necessary. Of course, your system is only as strong as its weakest link, so if you have some unknown factor like a web browser's dubious mechanisms for encoding and transmitting the text input in an HTML form, you cannot reliably process the data without making some dangerous assumptions.

7.

Special Characters in URLs

Mike Brown




> I have encoding set to iso-8859-1 for
>the xsl:stylesheet and the xsl:output elements, why would it replace it with
>the UTF-8 values? How should go about doing this, with characters not valid
>for the URL replaced with their HTTP/URL replacement (e.g. " " replaced with
>"+", "+" with "%2B" etc)?

The content of the document should be encoded in ISO 8859-1, yes. But a URI is interpreted by a URI resolver such as a Web server, not an XML parser, so the rules for encoding are different. The rules for URI encoding are described in RFC 2718 ('2.2.5) and in the W3C internationalization guidelines (er... somewhere).

IE 5, at least, interprets this correctly, as does (IIRC) Mozilla 5 and Netscape 6, and Opera 4.

> The URIs are interpreted by the Web Server/Web browser but I need
> them to be generated correctly by the XSLT processor -- to comply with the
> HTTP-standard (e.g. no white space in URLs). Is there a way to achieve
> this?

With respect to the encoding:

The encoding of the document as a whole has no bearing on the %-style escaping of characters in a URI. So for example if you have in your stylesheet

&lt;xsl:output method="html" encoding="iso-8859-1">
   and
&lt;a href="http://skew.org/printenv?greeting={greeting}">click&lt;/a>

and your XML has:

   &lt;greeting>&amp;#161;Hola!&lt;/greeting>

then your output should end up like:

&lt;a href="http://skew.org/printenv?greeting=%C3%A1Hola!">click&lt;/a>

You may have thought that the last 6 characters of that URI reference would be bytes like:

    ¡  H  o  l  a  !
    A1 48 6F 6C 61 21  &lt;-- iso-8859-1 bytes

because if you just did &lt;xsl:value-of select="greeting"/> that is precisely what you would get.

The reason it changes when the XSL processor emits it in an href attribute is because of this clause in the XSLT spec: "The html output method should escape non-ASCII characters in URI attribute values using the method recommended in Section B.2.1 of the HTML 4.0 Recommendation". And that section says to use UTF-8 as the basis for the %-escaping of the URI. This means you likely get this in the output:

    %  C  3  %  A  1  H  o  l  a  !
    25 43 33 25 41 31 48 6F 6C 61 21  &lt;-- iso-8859-1 bytes, still

See, you *did* get iso-8859-1 output like you asked for. The UTF-8-ness is actually at a higher level of abstraction.

Note that this escaping happens *only* for non-ASCII characters (U-00000080 and higher). So it does not affect those ASCII characters that are reserved or disallowed in a URI, like " ", among others.

Even if the XSLT processor failed to do the UTF-8 based escaping of non-ASCII characters, the HTML user agents are supposed to do it when interpreting the URI reference anyway.

Of course your problem is on the server end. Chances are, you are coding using an API that expects iso-8859-1 as the basis for the URL escaping, which is perfectly reasonable to do, especially in light of the fact that browsers tend to send URL-encoded form data with the URL-escaping being based on the actual encoding of the document containing the form (rather, the encoding that the browser is assuming the containing document is using; this is user-overridable).

If you make the containing document utf-8 instead of iso-8859-1, you can assume that all the escaping is UTF-8 based, and then you can convert the misinterpreted-as-iso-8859-1 strings you get from the form data API back to iso-8859-1 bytes and then read these bytes back into a string using utf-8 interpretation.

Your other option is to avoid putting the raw non-ASCII characters in the URI refs in the first place. If you absolutely must have %A1 for inverted exclamation mark, then the only way to ensure this is to make your stylesheet put %A1 in the result tree. You can do this using an extension function (ideal) or with a clever recursive template.

Re: escaping of ASCII characters like " " (space), you must also control this in your stylesheet. If you want "+" or "%20" (the latter is preferable), then have your stylesheet explicitly put that in the result tree.

See also: skew.org. W3C, XML explains what a character reference is, and that the reference for what numbers mean what characters is ISO/IEC 10646, not ISO/IEC 8859-1. Since the coded character set defined by ISO/IEC 10646 is identical to that of Unicode, you can use Unicode as a reference. Unicode org should be helpful. eki.ee is very useful, too.

Mike later adds:

As long as you are sticking to the iso-8859-1 range, I would use the static method java.net.URLEncoder.encode() as an extension function. Something like this example that I posted to xsl-list last year:

&lt;xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"    
   version="1.0"
   xmlns:url="http://whatever/java/java.net.URLEncoder"
   exclude-result-prefixes="url">

  &lt;xsl:output method="html" indent="yes"/>
  
  &lt;xsl:template match="/">
    &lt;xsl:variable name="x" select="'encode me #1 superstar?'"/>
    &lt;xsl:if test="function-available('url:encode')">
      &lt;a href="http://www.skew.org/printenv?foo={url:encode($x)}">
        &lt;xsl:value-of select="$x"/> 
      &lt;/a>
    &lt;/xsl:if>
  &lt;/xsl:template>

&lt;/xsl:stylesheet>

Of course, you'll have to adapt this to your stylesheet and make the "x" variable pick up the string from your source tree instead of using 'encode me #1 superstar?'.

It should be noted that URLEncoder.encode() only uses the lower byte of each Java char, so in effect, it only works unambiguously for \u0000 through \u00FF. So if you have, for example, the bullet character &amp;#2022; in your XML, the Java char is \u2022 and you'll get back from the method the sequence %22, which is actually a double quote. As long as you only pass it Strings consisting of characters solely from the iso-8859-1 range, which subsets Unicode and supersets ASCII, you'll be fine.

8.

Encoding URL in XSLT

Mike Brown


> Is there a quick way to encode URLs in XSL? 

> Ex: Convert http://myurl.com/document.html?param1=foo1&param2=foo2 to
> http%3A%2F%2Fmyurl.com%2Fdocument.html%3Fparam1%3Dfoo1%26param2%3Dfoo2 

To apply 'URL-encoding' to a string, so that you can safely embed that string in the path part of a URI, (you don't really want to apply it to an entire URI, unless that's the string being embedded), you can use either an extension function or an XSLT template that parses the string and encodes as necessary.

First, you should note that URL-encoding is only 100% safe for characters that fall in the ASCII range (32-126, or 0-127 if you want to count control characters). Outside of this range, it gets tricky, because URIs were not intended to encapsulate arbitrary binary data. However, if the string contains some non-ASCII characters, newer standards are recommending that the UTF-8 bytes for those characters should be the basis for the encoding. For example, a small e with acute accent would be %C3%A9. This gives you the freedom to have any of the million+ possible Unicode characters in your string.

Depending on what you're doing, though, you'll find that many applications that process URIs are in fact expecting that ISO-8859-1, not UTF-8, is the basis for the encoding. The e with acute accent would need to be encoded as %E9, for example. This limits you to a very small range of Unicode in your string, just the first 256 characters out of 1.1 million, but this might be enough for you, I don't know.

If you want to make the iso-8859-1 assumption, or if you don't care either way because you're only dealing with ASCII strings, then that makes life easy. With that being the case...

If your XSLT processor is Java based, it probably has a namespace reserved for invoking static methods in arbitrary classes in your classpath, and this would work:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"    
   version="1.0"
   xmlns:url="http://whatever/java/java.net.URLEncoder"
   exclude-result-prefixes="url">

  <xsl:output method="html" indent="yes"/>

  <xsl:param name="str" select="'encode me #1 superstar?'"/>
  
  <xsl:template match="/">
    <xsl:if test="function-available('url:encode')">
      <a href="http://skew.org/printenv?foo={url:encode($str)}">
        <xsl:value-of select="$x"/> 
      </a>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

If you want a more portable, pure XSLT 1.0 approach, here's some voodoo for you: skew.org

9.

xml encoding

Joel Konkle-Parker


 > <?xml version="1.0" encoding="Cp1252"?>

'Cp1252' is not a registered charset name. The list is here: iana.org or iana.org

10.

Retrieve the encoding?

Jeni Tennison



> Is it possible to retrieve the encoding of an XML-file with XSL??

No it isn't. The encoding is part of the physical structure of the file rather than the logical structure. XSLT only has access to the logical structure of an XML document.

11.

Encoding problems

J.Pietschmann



> I'm receiving an XML file with encoding="ISO-8859-15", when I transform the
> file I received javax.xml.transform.TransformerException. java.io
> UnsupportedEncodingException ISO8859_15

It's the parser's job to recognize the encoding and act accordingly. You can use any JAXP 1.0 conformant parser with Xalan (I suppose you won't run into the few dark corners of the spec). Most people using JDK 1.3.1 use Xerces as parser if they work with Xalan, I guess you do so too.

No XML parser is required to support encodings other than UTF-8 and perhaps UTF-16. So you cant just blame them for not supporting 8859-15, which is relatively new.

Your options:

- File a RFE with Xerces or whatever parser you use and wait until they implement support for 8859-15. If they just use the RTL facilities, they may tell you they wont do it though.
- Get another JAXP 1.0 compliant parser which supports 8859-15.
- Use a reencoding and stream-editing tool to adapt teh XML encoding (including the encoding declaration) before you feed it to the parser.

12.

XML and HTML in output

Larry Mason




>How to ensure browser recognises the encoding?

The key is not so much the encoding on the xsl:ouput but rather to add a meta statement to inform the browser of the encoding.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
  <!-- change output method to be xml since result is both html and xml -->
  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="yes"/>
  <xsl:template match="/">
    <HTML>
     <HEAD>
       <META http-equiv="Content-Type" content="text/html; charset=UTF-8"
/>
     </HEAD>
      <BODY>

             <p> <xsl:text>&#160;</xsl:text>grandparent<</p
             
      </BODY>
    </HTML>
  </xsl:template>
</xsl:stylesheet>

13.

Encoding terminology

Eliot Kimber



> What is wierd, is that a friend of mine tried to change the encoding 
> in his editor of one of my stylesheets. But it would not let him. The 
> editor said that the stylesheet had characters in it that violated the 
> encoding choice. My question comes from his experience with my 
> stylesheet.

Many, if not most, non-Unicode encodings do not have a 100% coverage of the Unicode characters. This means that a Unicode document may contain characters that do not exist in some target non-Unicode encoding. That was probably the case here.

However, the numeric character references should not have caused this.

That is, at the level of the encoding of the *file* the numeric character references are interpreted as the characters "&", "#", "x", etc., not the (abstract) character they represent in XML land. These characters are all within the base ASCII range and therefore should be in every encoding you might want to use.

It is only in the XML parser that the reference is converted from the sequence of characters "&", "#', "x", etc. to a single *Unicode* character.

It's important to remember that, regardless of how the bytes of the XML file are written to disk (the character encoding of the file), once parsed, all XML documents are, by definition, sequences of Unicode characters. Thus, even if I defined "Eliot's personal encoding" and put all my XML files in it and hacked my favorite parser to understand it, once parsed, the XML data provided to other applications by the XML parser would be Unicode characters.

14.

Encoding

David Carlisle

I'm transforming a number of xml documents into html. They're all exactly the same, except that their encoding can vary.

What i want to do be able to do is use the same stylesheet for each one, just telling the processor to output to encoding xxx conditionally, based on whatever the encoding of doc being transformed is.

Answer

In xslt2 you have more flexibility but in 1.0 you may need to use multiple stylesheets.

In XSLT 2.0 you can use <xsl:result-document encoding="{$param}"/>

 Right now I'm using numerous stylesheets, with only
 one element changed, <xsl:output encoding="xxx" ..../>
 which really seems like quite a kludge, and is a
 hassle to keep them synced. The style sheets all use
 the same encoding, Shift_JIS.

The stylesheets don't need to be copies that you need to keep in sync, they just need to be 2 line stylesheets that a) set the output encoding and b) xsl:import the stylesheet that does the work.

Alternatively to using these wrapper stylesheets many xslt 1 engines will allow the encoding to be specified in the API that's calling the transform, effecively overriding the xsl:output attributes. details depend on the xsl engine and api of course.

15.

Encoding problems (reported as character entity problems)

Tony Graham



> Latest effort: I tried using encoding="utf-8" for all levels: my
> > original xml, my xsl output, and the input to ZSL's index, & I also
> > saved my xml file as utf-8 format, and used the Spanish n inside my
> > xml, i.e. ñ rather than &#241;. Doing that, the Spanish n was
> > preserved through the xsl output, but ZSL stores it as: Ã±, & that's
> > also how my browser displays it.

That is UTF-8, but your browser thinks it's ISO-8859-1.

The generalisation is that if a character from the Latin-1 Supplement block comes out as two characters where the first character is an accented "A", then you are probably reading UTF-8 as ISO-8859-1.

If you go to Richard Ishida's excellent Unicode Code Converter [1] and enter 241 in the "Decimal code points" box, you'll see that it's "C3 B1" in UTF-8.

If you then go to Richard Ishida's excellent UniView [2], you can suss out that "C3 B1" as two ISO-859-1 characters would be "ñ".

[1] http://rishida.net/scripts/uniview/conversion

[2] http://rishida.net/scripts/uniview/uniview.php?codepoints=F1