Discussion:
Validating XML with nsgmls: doesn't catch illegal characters
(too old to reply)
Arvin Portlock
2004-08-15 18:31:57 UTC
Permalink
I'm using nsgmls v. 1.3.4 to process both SGML and XML files
which is why I'm not using an XML parser. I notice XML validators
catch certain illegal characters while nsgmls does not. For
example both RXP and Xerces object to 0xc0 but nsgmls does not,
even with -wxml:

RXP:
Error: Input error: Illegal UTF-8 start byte <0xc0> at file
offset 2618 in unnamed entity at line 44 char 8 of
file:///C:/PROGRA~1/rxp/badchar.xml

Is there a way to configure nsgmls, say by modifying the
SGML declaration, so that it will catch these illegal character
errors? My sample file is here:

http://www.geocities.com/_ratty/badchar.xml

(Geocities may munge this up so it may not be useable).

Arvin
Peter Flynn
2004-08-16 22:46:48 UTC
Permalink
Post by Arvin Portlock
I'm using nsgmls v. 1.3.4 to process both SGML and XML files
which is why I'm not using an XML parser. I notice XML validators
catch certain illegal characters while nsgmls does not. For
example both RXP and Xerces object to 0xc0 but nsgmls does not,
Error: Input error: Illegal UTF-8 start byte <0xc0> at file
offset 2618 in unnamed entity at line 44 char 8 of
file:///C:/PROGRA~1/rxp/badchar.xml
Is there a way to configure nsgmls, say by modifying the
SGML declaration, so that it will catch these illegal character
If I download that file, byte offset 2618 is not an error. That
byte is the "n" of [Kevin C.] Knox. Have you changed that file
since you posted your message?

There *is* some obvious garbage in the document:

<head>ÀÁÂÃÄÅContact Information</head>

(if this gets mangled, that is <head> followed by six capital As,
each with a diacritic: in order, grave, acute, circumflex, tilde,
umlaut, and ring; followed by the word Contact).

But that is not an XML error.

What that error message usually means is that you have got a wrong
encoding declaration for the set of characters you have used, and
the way they have been stored. Fix that and you've fixed the problem.

///Peter
--
"The cat in the box is both a wave and a particle"
-- Terry Pratchett, introducing quantum physics in _The Authentic Cat_
Arvin Portlock
2004-08-18 15:15:16 UTC
Permalink
Post by Peter Flynn
Post by Arvin Portlock
I'm using nsgmls v. 1.3.4 to process both SGML and XML files
which is why I'm not using an XML parser. I notice XML validators
catch certain illegal characters while nsgmls does not. For
example both RXP and Xerces object to 0xc0 but nsgmls does not,
Error: Input error: Illegal UTF-8 start byte <0xc0> at file
offset 2618 in unnamed entity at line 44 char 8 of
file:///C:/PROGRA~1/rxp/badchar.xml
Is there a way to configure nsgmls, say by modifying the
SGML declaration, so that it will catch these illegal character
If I download that file, byte offset 2618 is not an error. That
byte is the "n" of [Kevin C.] Knox. Have you changed that file
since you posted your message?
ÀÁÂÃÄÅContact Information(if this gets mangled, that is followed by
six capital As,
each with a diacritic: in order, grave, acute, circumflex, tilde,
umlaut, and ring; followed by the word Contact).
But that is not an XML error.
What that error message usually means is that you have got a wrong
encoding declaration for the set of characters you have used, and
the way they have been stored. Fix that and you've fixed the problem.
///Peter
The difference is probably some conversion problem related to FTPing
an MSDOS file to a Unix file system. I probably should have gzip'd it
first.

Anyway, you are saying that the error with the document is actually
an incorrect encoding declaration? I wonder what kind of encoding
declaration could possibly make it valid. Both RXP and Xerces catch
this error but nsgmls does not. It sounds like I simply can't use
nsgmls to validate XML documents after all.

Thanks for taking the time to look at this.

Best regards,

Arvin
Peter Flynn
2004-09-07 22:30:18 UTC
Permalink
Arvin Portlock wrote:
[...]
Post by Arvin Portlock
The difference is probably some conversion problem related to FTPing
an MSDOS file to a Unix file system. I probably should have gzip'd it
first.
Can you email it to me (edit my address before sending)
Post by Arvin Portlock
Anyway, you are saying that the error with the document is actually
an incorrect encoding declaration?
That's what it looked like.
Post by Arvin Portlock
I wonder what kind of encoding
declaration could possibly make it valid. Both RXP and Xerces catch
this error but nsgmls does not.
I didn't see any error in the file.
Post by Arvin Portlock
It sounds like I simply can't use
nsgmls to validate XML documents after all.
On the contrary, I'd always believe nsgmls before rxp or Xerces. James
simply doesn't produce bad software.

///Peter
--
"The cat in the box is both a wave and a particle"
-- Terry Pratchett, introducing quantum physics in _The Authentic Cat_
Loading...