Implicit "stickyness" of tags, DTD problem

Peter Flynn

2006-08-13 21:03:27 UTC

Post by Tin Gherdanarra
Dear c.t.s.,
I have inherited a large SGML repository with
bogus tagging, i.e. tagging that does not fit
the DTD. It would be easy to fix the wrong
tagging in the SGML, but the management is a
little nervous about changing the data.

They should be even more nervous about running a
broken application.

Post by Tin Gherdanarra
They want
me to fix the DTD. I'm sceptical whether this is
possible at all. The problem is the declaration
and usage of two tags, <OL> and .
These tags have basically the same semantics
as in HTML, what misled some people into
thinking that they can use these tags in
their SGML just as in SGML.
The declaration for these tags looks like
this.
<!ELEMENT p - o (%ptext;)+ -->
A does not require a .
<!ELEMENT ol - - (li)+ -->
An <OL> does not require a </OL>.

Yes it does. That's what the second minus sign means.

Post by Tin Gherdanarra
This means that
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
is perfectly legal,

Nope. It's invalid.

Post by Tin Gherdanarra
and so is

whatever

whatever

Yes, that's OK.

Post by Tin Gherdanarra
Either tag can occur within a superordinate
tag, let's call it <FOO>, the declaration
<!ELEMENT foo - - (ol|sl)+ +(h|p) -->
Consequently, we can put a or an <OL>
into <FOO>. However, we can't put an <OL>

OK.

Post by Tin Gherdanarra
<FOO>
 
<OL>
<LI>
textext
My first thought was to include
<OL>
in the declaration for
<!ELEMENT p...
but this made the parser complain about an
"sx:... content model is ambiguous: when no tokens have been
matched, both the 1st and 2nd occurrences of "OL" are
possible

Because OL is already somewhere in the expansion of ptext.

Post by Tin Gherdanarra
I understand that omission tags lead to problems when
<FOO>

<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext

We need to see the expansion of ptext to know this.

Post by Tin Gherdanarra
<OL>
<LI>
textextext
<LI>
textextext

</FOO>
I don't know much about DTDs, but this looks
like a "shift/reduce-conflict" from classical
compiler construction to me. The parser does
"Close the immediately after the first
<OL> because the second <OL> belongs at the
same level as the first (i.e. to <FOO>)"

Possibly.

Post by Tin Gherdanarra
"Close the after the second <OL>
because the second <OL> belongs at the
same level as the first <OL>, and is thus
under , not under <FOO>"

Ditto.

Post by Tin Gherdanarra
This insight moves the problem from allowing
<OL>s in s to allowing <OL>s in s AND
DECLARING: "whenever you find an <OL> as a sibling
in an unclosed , it belongs to the , not
the superordinate <OL>."

That's what the parser will assume.

You shoudl first test all this by running onsgmlnorm which will
normalize the instance and insert all the missing end-tags where
the parser deduces they belong. That will show you what the
current position is.

Post by Tin Gherdanarra
I dare saying that there is no way to express
such priorities or implicit stickyness of certain
tags in a DTD, or is there? [No?]

Yes.

Post by Tin Gherdanarra
HTML can handle such ambiguities because they
are not ambigous in HTML. The parser tacitly assumes
priorities, or stickyness of nested unclosed tags.

It's not relevant whether it's HTML or not.

///Peter