Tin Gherdanarra
2006-08-08 14:00:07 UTC
Dear c.t.s.,
I have inherited a large SGML repository with
bogus tagging, i.e. tagging that does not fit
the DTD. It would be easy to fix the wrong
tagging in the SGML, but the management is a
little nervous about changing the data. They want
me to fix the DTD. I'm sceptical whether this is
possible at all. The problem is the declaration
and usage of two tags, <OL> and <P>.
These tags have basically the same semantics
as in HTML, what misled some people into
thinking that they can use these tags in
their SGML just as in SGML.
The declaration for these tags looks like
this.
Here is the one for the paragraph element:
<!ELEMENT p - o (%ptext;)+ -->
A <P> does not require a </P>.
<!ELEMENT ol - - (li)+ -->
A n <OL> does not require a </OL>.
This means that
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
is perfectly legal, and so is
<P>
whatever
<P>
whatever
Either tag can occur within a superordinate
tag, let's call it <FOO>, the declaration
looks like this:
<!ELEMENT foo - - (ol|sl)+ +(h|p) -->
Consequently, we can put a <P> or an <OL>
into <FOO>. However, we can't put an <OL>
into a <P>:
<FOO>
<P> <!-- *** BOGUS *** -->
<OL>
<LI>
textext
My first thought was to include
<OL>
in the declaration for
<!ELEMENT p...
but this made the parser complain about an
ambigous declaration:
"sx:... content model is ambiguous: when no tokens have been
matched, both the 1st and 2nd occurrences of "OL" are
possible
I understand that omission tags lead to problems when
nesting them. For example:
<FOO>
<P>
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
<!-- Does the </P> belong here? -->
<OL>
<LI>
textextext
<LI>
textextext
<!-- Does the </P> belong here? -->
</FOO>
I don't know much about DTDs, but this looks
like a "shift/reduce-conflict" from classical
compiler construction to me. The parser does
not know which rule to apply:
"Close the <P> immediately after the first
<OL> because the second <OL> belongs at the
same level as the first <P> (i.e. to <FOO>)"
or
"Close the <P> after the second <OL>
because the second <OL> belongs at the
same level as the first <OL>, and is thus
under <P>, not under <FOO>"
This insight moves the problem from allowing
<OL>s in <P>s to allowing <OL>s in <P>s AND
DECLARING: "whenever you find an <OL> as a sibling
in an unclosed <P>, it belongs to the <P>, not
the superordinate <OL>."
I dare saying that there is no way to express
such priorities or implicit stickyness of certain
tags in a DTD, or is there? [No?]
HTML can handle such ambiguities because they
are not ambigous in HTML. The parser tacitly assumes
priorities, or stickyness of nested unclosed tags.
Anything wrong with my reasoning?
Thanks for your attention
Tin
I have inherited a large SGML repository with
bogus tagging, i.e. tagging that does not fit
the DTD. It would be easy to fix the wrong
tagging in the SGML, but the management is a
little nervous about changing the data. They want
me to fix the DTD. I'm sceptical whether this is
possible at all. The problem is the declaration
and usage of two tags, <OL> and <P>.
These tags have basically the same semantics
as in HTML, what misled some people into
thinking that they can use these tags in
their SGML just as in SGML.
The declaration for these tags looks like
this.
Here is the one for the paragraph element:
<!ELEMENT p - o (%ptext;)+ -->
A <P> does not require a </P>.
<!ELEMENT ol - - (li)+ -->
A n <OL> does not require a </OL>.
This means that
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
is perfectly legal, and so is
<P>
whatever
<P>
whatever
Either tag can occur within a superordinate
tag, let's call it <FOO>, the declaration
looks like this:
<!ELEMENT foo - - (ol|sl)+ +(h|p) -->
Consequently, we can put a <P> or an <OL>
into <FOO>. However, we can't put an <OL>
into a <P>:
<FOO>
<P> <!-- *** BOGUS *** -->
<OL>
<LI>
textext
My first thought was to include
<OL>
in the declaration for
<!ELEMENT p...
but this made the parser complain about an
ambigous declaration:
"sx:... content model is ambiguous: when no tokens have been
matched, both the 1st and 2nd occurrences of "OL" are
possible
I understand that omission tags lead to problems when
nesting them. For example:
<FOO>
<P>
<OL>
<LI>
textextext
<LI>
textextext
<LI>
textextext
<!-- Does the </P> belong here? -->
<OL>
<LI>
textextext
<LI>
textextext
<!-- Does the </P> belong here? -->
</FOO>
I don't know much about DTDs, but this looks
like a "shift/reduce-conflict" from classical
compiler construction to me. The parser does
not know which rule to apply:
"Close the <P> immediately after the first
<OL> because the second <OL> belongs at the
same level as the first <P> (i.e. to <FOO>)"
or
"Close the <P> after the second <OL>
because the second <OL> belongs at the
same level as the first <OL>, and is thus
under <P>, not under <FOO>"
This insight moves the problem from allowing
<OL>s in <P>s to allowing <OL>s in <P>s AND
DECLARING: "whenever you find an <OL> as a sibling
in an unclosed <P>, it belongs to the <P>, not
the superordinate <OL>."
I dare saying that there is no way to express
such priorities or implicit stickyness of certain
tags in a DTD, or is there? [No?]
HTML can handle such ambiguities because they
are not ambigous in HTML. The parser tacitly assumes
priorities, or stickyness of nested unclosed tags.
Anything wrong with my reasoning?
Thanks for your attention
Tin
--
Lisp kann nicht kratzen, denn Lisp ist fluessig
Lisp kann nicht kratzen, denn Lisp ist fluessig