Standards

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • ZWJ&XML

    5 answers - 457 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Unicode Technical Report #20 (Unicode in XML and other Markup Languages) specifies that Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with in the markup. But when an xml file with the tags written in Malayalam using ZWJs (In Malayalam ZWJ is used to form certain characters) an error is reported that the tag contained an invalid character. Can anyone tell me what will be wrong?
    Scanned and protected by Email scanner
  • No.1 | | 598 bytes | |

    Jose wrote:
    Unicode Technical Report #20 (Unicode in XML and other Markup Languages)

    <specifies that
    Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with
    in the markup. But when an xml file with the tags written in Malayalam
    using ZWJs (In Malayalam ZWJ is used to form certain characters) an
    error is reported that the tag contained an invalid character.
    Can anyone tell me what will be wrong?

    Can you include or link to a short example? Note that ZWJ is legal in
    the content of an XML document, but not in the names of elements.
  • No.2 | | 2030 bytes | |

    Wed, 13 Sep 2006, Jose wrote:

    Unicode Technical Report #20 (Unicode in XML and other Markup
    Languages) specifies that
    Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with
    in the markup.

    Yes, for affecting ligature and joining behavior. I mention this because
    there is a popular word processor that uses ZWJ and ZWNJ quite
    inappropriately for line break control.

    course, the statement is of general nature: those characters are in
    principle suitable for use in marked-up text. It does not guarantee or
    prescribe that a particular markup system allows them or that they will be
    interpreted by their Unicode semantics.

    But when an xml file with the tags written in Malayalam
    using ZWJs (In Malayalam ZWJ is used to form certain characters) an
    error is reported that the tag contained an invalid character.

    Reported by which program? I first suspected that you may have tried to
    enter these characters but they do not appear correctly in the declared or
    implied character encoding.

    But reading again, I notice that you are referring to _tags_ and might
    actually mean the use of characters in element or attribute names, as
    opposite to their use in content between tags. UTR #20 discusses the
    latter, i.e. what you can use in document content proper - together with
    markup, not _inside_ markup (tags).

    The use of characters in element and attribute names is governed by the
    use of each markup language, basically in the _identifier_ syntax.
    Generally, and in XML 1.0, control characters are excluded in that syntax,
    and ZWJ and ZWNJ are control characters by definition (General Category:
    Cf). Thus, an attempt to use them in element names would violate
    well-formedness constraints, and an XML parser would report an error - not
    about an invalid character per se but about a syntax error.

    In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is probably
    of little practical value.
  • No.3 | | 490 bytes | |

    Jukka K. Korpela scripsit:

    In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is
    probably of little practical value.

    It has the merit that it allows identifiers to be spelled correctly
    in the various languages that *require* ZWJ or ZWNJ or both; Persian
    and several Indic languages come to mind.

    If you meant simply that XML 1.1 is not widely adopted, and it is
    therefore of little practical value to write documents in it, I
    must sadly agree.
  • No.4 | | 1172 bytes | |

    As I recall, the problem with XML 1.1 adoption was that XML 1.1 was
    not fully backwards compatible with XML 1.0: there were XML 1.0
    documents that were not valid XML 1.1. That being the case, people
    just didn't see it worthwhile to have two incompatible parsers.

    As for ZWJ/NJ - the original intent was for these to not make any
    semantic difference. There is a UTC action to collect cases where they
    are being used to make a clear semantic difference (eg XXX means "sea
    gull" and XX<ZWNJ>X means "republican"), so any feedback on such cases
    would be useful.

    Mark

    9/13/06, John Cowan <cowan (AT) ccil (DOT) orgwrote:
    --
    Jukka K. Korpela scripsit:

    In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is
    probably of little practical value.

    It has the merit that it allows identifiers to be spelled correctly
    in the various languages that *require* ZWJ or ZWNJ or both; Persian
    and several Indic languages come to mind.

    If you meant simply that XML 1.1 is not widely adopted, and it is
    therefore of little practical value to write documents in it, I
    must sadly agree.
  • No.5 | | 1391 bytes | |

    Mark Davis scripsit:

    As I recall, the problem with XML 1.1 adoption was that XML 1.1 was
    not fully backwards compatible with XML 1.0: there were XML 1.0
    documents that were not valid XML 1.1.

    In the sense that "XML 1.0" names a countably infinite set of abstract
    objects, true; in the sense that "XML 1.0" names a set
    of texts physically fixed in a tangible medium, I venture to doubt it.
    Specifically, I doubt that any Real World XML 1.0 documents contained
    any instances of U+007F through U+009F not as character references.

    In exactly the same sense, Unicode 2.0 was not backward compatible with
    Unicode 1.1, a fact which does not seem to have seriously impeded its
    adoption.

    The issues with XML 1.1 were in fact political; I say no more.

    As for ZWJ/NJ - the original intent was for these to not make any
    semantic difference. There is a UTC action to collect cases where they
    are being used to make a clear semantic difference (eg XXX means "sea
    gull" and XX<ZWNJ>X means "republican"), so any feedback on such cases
    would be useful.

    IIRC the leading case is the plural ending in Persian. It's not just
    a matter of a clear semantic difference: there is no semantic difference
    between "they're" and "theyre" in English, but the latter is unambiguously
    wrong in the standard orthography.

Re: ZWJ&XML


max 4000 letters.
Your nickname that display:
In order to stop the spam: 2 + 1 =
QUESTION ON "Standards"

EMSDN.COM