Invalid Characters & escape rules

This article describes how invalid XML characters are handled by the FOR XML clause, and lists

the escape rules for characters that are invalid in XML names.

entitizes invalid XML characters when they’re returned within FOR XML queries that

don’t use the TYPE directive.

Although XML 1.0 conformant parsers raise parse errors regardless of whether these characters

are entitized or not, the entitized form is better aligned with XML 1.1. The entitized form is also

potentially better aligned with future versions of the XML standard. Additionally, it makes

debugging simpler, because the code point of the invalid character becomes visible.

For users of XML tools, no workaround is required, because the XML parser will fail either way

at the point where the invalid characters occur in the data stream. If you use non-XML tools,

this change can require you to update your programming logic to search for these characters

as entitized values.

The following white space characters are entitized differently in FOR XML queries to preserve

their presence through round-tripping:

In element content and attributes:

(carriage return)

In attribute content:

(tab),

(line feed)

These characters are preserved in output, and a parser won’t normalize them.

names that contain characters that are invalid in XML names, such as spaces, are

translated into XML names in a way in which the invalid characters are translated into escaped

numeric entity encoding.

There are only two non-alphabetic characters that can occur within an XML name: the colon

(

) and the underscore (

). Because the colon is already reserved for namespaces, the

underscore is chosen as the escape character. Following are the escape rules that are used for

encoding:

hex(0D) hex(09) hex(0A)
:
_