Discussion:
Implicit "stickyness" of tags, DTD problem
(too old to reply)
Tin Gherdanarra
2006-08-08 14:00:07 UTC
Permalink
Dear c.t.s.,

I have inherited a large SGML repository with
bogus tagging, i.e. tagging that does not fit
the DTD. It would be easy to fix the wrong
tagging in the SGML, but the management is a
little nervous about changing the data. They want
me to fix the DTD. I'm sceptical whether this is
possible at all. The problem is the declaration
and usage of two tags, <OL> and <P>.
These tags have basically the same semantics
as in HTML, what misled some people into
thinking that they can use these tags in
their SGML just as in SGML.

The declaration for these tags looks like
this.

Here is the one for the paragraph element:
<!ELEMENT p - o (%ptext;)+ -->

A <P> does not require a </P>.

<!ELEMENT ol - - (li)+ -->

A n <OL> does not require a </OL>.

This means that

<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext

is perfectly legal, and so is

<P>
whatever
<P>
whatever

Either tag can occur within a superordinate
tag, let's call it <FOO>, the declaration
looks like this:

<!ELEMENT foo - - (ol|sl)+ +(h|p) -->

Consequently, we can put a <P> or an <OL>
into <FOO>. However, we can't put an <OL>
into a <P>:



<FOO>
<P> <!-- *** BOGUS *** -->
<OL>
<LI>
textext


My first thought was to include

<OL>

in the declaration for

<!ELEMENT p...

but this made the parser complain about an
ambigous declaration:

"sx:... content model is ambiguous: when no tokens have been
matched, both the 1st and 2nd occurrences of "OL" are
possible

I understand that omission tags lead to problems when
nesting them. For example:

<FOO>
<P>
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext

<!-- Does the </P> belong here? -->


<OL>
<LI>
textextext
<LI>
textextext

<!-- Does the </P> belong here? -->

</FOO>

I don't know much about DTDs, but this looks
like a "shift/reduce-conflict" from classical
compiler construction to me. The parser does
not know which rule to apply:
"Close the <P> immediately after the first
<OL> because the second <OL> belongs at the
same level as the first <P> (i.e. to <FOO>)"
or
"Close the <P> after the second <OL>
because the second <OL> belongs at the
same level as the first <OL>, and is thus
under <P>, not under <FOO>"

This insight moves the problem from allowing
<OL>s in <P>s to allowing <OL>s in <P>s AND
DECLARING: "whenever you find an <OL> as a sibling
in an unclosed <P>, it belongs to the <P>, not
the superordinate <OL>."


I dare saying that there is no way to express
such priorities or implicit stickyness of certain
tags in a DTD, or is there? [No?]

HTML can handle such ambiguities because they
are not ambigous in HTML. The parser tacitly assumes
priorities, or stickyness of nested unclosed tags.

Anything wrong with my reasoning?

Thanks for your attention
Tin
--
Lisp kann nicht kratzen, denn Lisp ist fluessig
Peter Flynn
2006-08-13 21:03:27 UTC
Permalink
Post by Tin Gherdanarra
Dear c.t.s.,
I have inherited a large SGML repository with
bogus tagging, i.e. tagging that does not fit
the DTD. It would be easy to fix the wrong
tagging in the SGML, but the management is a
little nervous about changing the data.
They should be even more nervous about running a
broken application.
Post by Tin Gherdanarra
They want
me to fix the DTD. I'm sceptical whether this is
possible at all. The problem is the declaration
and usage of two tags, <OL> and <P>.
These tags have basically the same semantics
as in HTML, what misled some people into
thinking that they can use these tags in
their SGML just as in SGML.
The declaration for these tags looks like
this.
<!ELEMENT p - o (%ptext;)+ -->
A <P> does not require a </P>.
<!ELEMENT ol - - (li)+ -->
An <OL> does not require a </OL>.
Yes it does. That's what the second minus sign means.
Post by Tin Gherdanarra
This means that
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
is perfectly legal,
Nope. It's invalid.
Post by Tin Gherdanarra
and so is
<P>
whatever
<P>
whatever
Yes, that's OK.
Post by Tin Gherdanarra
Either tag can occur within a superordinate
tag, let's call it <FOO>, the declaration
<!ELEMENT foo - - (ol|sl)+ +(h|p) -->
Consequently, we can put a <P> or an <OL>
into <FOO>. However, we can't put an <OL>
OK.
Post by Tin Gherdanarra
<FOO>
<P> <!-- *** BOGUS *** -->
<OL>
<LI>
textext
My first thought was to include
<OL>
in the declaration for
<!ELEMENT p...
but this made the parser complain about an
"sx:... content model is ambiguous: when no tokens have been
matched, both the 1st and 2nd occurrences of "OL" are
possible
Because OL is already somewhere in the expansion of ptext.
Post by Tin Gherdanarra
I understand that omission tags lead to problems when
<FOO>
<P>
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
<!-- Does the </P> belong here? -->
We need to see the expansion of ptext to know this.
Post by Tin Gherdanarra
<OL>
<LI>
textextext
<LI>
textextext
<!-- Does the </P> belong here? -->
</FOO>
I don't know much about DTDs, but this looks
like a "shift/reduce-conflict" from classical
compiler construction to me. The parser does
"Close the <P> immediately after the first
<OL> because the second <OL> belongs at the
same level as the first <P> (i.e. to <FOO>)"
Possibly.
Post by Tin Gherdanarra
"Close the <P> after the second <OL>
because the second <OL> belongs at the
same level as the first <OL>, and is thus
under <P>, not under <FOO>"
Ditto.
Post by Tin Gherdanarra
This insight moves the problem from allowing
<OL>s in <P>s to allowing <OL>s in <P>s AND
DECLARING: "whenever you find an <OL> as a sibling
in an unclosed <P>, it belongs to the <P>, not
the superordinate <OL>."
That's what the parser will assume.

You shoudl first test all this by running onsgmlnorm which will
normalize the instance and insert all the missing end-tags where
the parser deduces they belong. That will show you what the
current position is.
Post by Tin Gherdanarra
I dare saying that there is no way to express
such priorities or implicit stickyness of certain
tags in a DTD, or is there? [No?]
Yes.
Post by Tin Gherdanarra
HTML can handle such ambiguities because they
are not ambigous in HTML. The parser tacitly assumes
priorities, or stickyness of nested unclosed tags.
It's not relevant whether it's HTML or not.

///Peter

Loading...