Algorithm for expanding minimized tags

Post by Benjamin Niemann
I'm working on a SGML parser and struggle with tag minimization.

Hmm... there is a fully recognized one available at...

<http://www.jclark.com/sp/>

...no need to roll your own really.

Post by Benjamin Niemann
E.g.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<title>foo</title>
<p>bar</p>
</html>
where <head>, </head>, <body> and </body> are implicitly inserted. Is the
algorithm used to determine how tags are implicitly opened and closed
documented anywhere?

Yes, the information your parser needs to imply and insert missing
elements is in the external subset that your...

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">

...refers to.

Your own parser must be made in a way that it can use a full SGML
declaration, concatenated with a declaration subset, further
concatenated with your own document instance, before it even understands
shit about what you want it to do :-)

--
Rex

Benjamin Niemann

2005-05-03 21:50:20 UTC

Post by Jan Roland Eriksson

Post by Benjamin Niemann
I'm working on a SGML parser and struggle with tag minimization.

Hmm... there is a fully recognized one available at...
<http://www.jclark.com/sp/>
...no need to roll your own really.

I want to implement various additional features - more verbose error
handling (nsgmls' output is often too technical and not well suited for
non-geeks) or warning messages for valid but (for various reasons)
problematic constructs.
And I really don't want to move back to C++ hell, so hacking SP is no option
for me.

Post by Jan Roland Eriksson

Yes, the information your parser needs to imply and insert missing
elements is in the external subset that your...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
...refers to.

What I am searching for is the *algorithm* - how to decide which tag the
parser has to infer, e.g. upon encountering the <p>.
I guess this is just a handfull of simple rules, and I suppose it is
unambiguouly defined in the standard - and this is the piece of information
I am looking for.

Post by Jan Roland Eriksson
Your own parser must be made in a way that it can use a full SGML
declaration, concatenated with a declaration subset, further
concatenated with your own document instance, before it even understands
shit about what you want it to do :-)

Correct - and this has already been done :-)
I wouldn't even think about rolling my own parser, if I'd be unaware of the
(approximate) complexity of such a project.
But many other people would - so your advice is legitimate ;)

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Nick Kew

2005-05-03 23:23:11 UTC

Post by Benjamin Niemann
I want to implement various additional features - more verbose error
handling (nsgmls' output is often too technical and not well suited for
non-geeks) or warning messages for valid but (for various reasons)
problematic constructs.

You can do that within OpenSP. I've done it for Page Valet (error
messages adapted to the problem of validation for web-heads). At one
time I had it hacked much further to add optional HTML accessibility
analysis, then AccessValet (using a different parser) forked off as
a separate tool.

Post by Benjamin Niemann
And I really don't want to move back to C++ hell, so hacking SP is no option
for me.

Now that's an argument with which I can sympathise.

Post by Benjamin Niemann
E.g.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<title>foo</title>
<p>bar</p>
</html>

If it's really HTML you're dealing with, you might want to look at the
HTMLparser module of libxml2, or even HTML Tidy. But neither of those
is an SGML parser.

Post by Benjamin Niemann
What I am searching for is the *algorithm* - how to decide which tag the
parser has to infer, e.g. upon encountering the <p>.
I guess this is just a handfull of simple rules, and I suppose it is
unambiguouly defined in the standard - and this is the piece of information
I am looking for.

It's in the DTD.

Again, for HTML see above.

--
Nick Kew

Benjamin Niemann

2005-05-04 14:40:43 UTC

I did evaluate my options. Thankfully there are no deadlines or budgets I
have to worry about - so I can choose the implementation that seems most
fun for me.

Post by Benjamin Niemann
And I really don't want to move back to C++ hell, so hacking SP is no
option for me.

Now that's an argument with which I can sympathise.

;)

Post by Benjamin Niemann
E.g.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<title>foo</title>
<p>bar</p>
</html>

If it's really HTML you're dealing with, you might want to look at the
HTMLparser module of libxml2, or even HTML Tidy. But neither of those
is an SGML parser.

See above.

It's in the DTD.

The DTD only contains the *information* which tags have optional start/end
tags, which tags can be contained in which tag, etc.

But (as often) just after I posted the question, my brain popped up another
possible way to solve the problem - in this case a not yet tried Google
query, which pointed me to
http://daedalus.cs.berkeley.edu/software/pub/glomop/msreport.doc , which in
section 4.2 contains what I would call an algorithm.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Trevor Jenkins

2005-05-04 21:13:23 UTC

It's in the DTD.

The DTD only contains the *information* which tags have optional start/end
tags, which tags can be contained in which tag, etc.

You need to look a the content models for each element (at this stage they
are not yet tags). If the content model contains only commas (formally the
seq connector) then you can infer precisely what element is required next.
The moment you allow the other to connectors (and & or) then you have have
a set of elements that might be applicable. Add in the occurence
indicators (opt, plus, rep) and you have a set of possible elements.
Finally if your element declaration includes ranked groups, expections
then you've really got a wonderful erector set to play with.

As to how this can be handled outside of an sp try looking at the elisp
code for the SGML editing mode. You might also want to read and inwardly
digest Charles Goldfarb's The SGML Handbook as there are implementation
hints in there.

Now if your proposed alternative to James Clark's well established parsers
is going to handle RANK, DATATAG, CONCUR, and SUBDOC features in addition
to the OMITTAG feature I'd be interested in using it.

Regards, Trevor

<>< Re: deemed!

Benjamin Niemann

2005-05-05 11:21:03 UTC

Post by Trevor Jenkins

Post by Benjamin Niemann
What I am searching for is the *algorithm* - how to decide which tag
the parser has to infer, e.g. upon encountering the <p>.
I guess this is just a handfull of simple rules, and I suppose it is
unambiguouly defined in the standard - and this is the piece of
information I am looking for.

It's in the DTD.

The DTD only contains the *information* which tags have optional
start/end tags, which tags can be contained in which tag, etc.

You need to look a the content models for each element (at this stage they
are not yet tags). If the content model contains only commas (formally the
seq connector) then you can infer precisely what element is required next.
The moment you allow the other to connectors (and & or) then you have have
a set of elements that might be applicable. Add in the occurence
indicators (opt, plus, rep) and you have a set of possible elements.
Finally if your element declaration includes ranked groups, expections
then you've really got a wonderful erector set to play with.
As to how this can be handled outside of an sp try looking at the elisp
code for the SGML editing mode. You might also want to read and inwardly
digest Charles Goldfarb's The SGML Handbook as there are implementation
hints in there.

I already considered this, but at the moment this investion is a bit too
large for my current project (we are bit bit short of money...)

Post by Trevor Jenkins
Now if your proposed alternative to James Clark's well established parsers
is going to handle RANK, DATATAG, CONCUR, and SUBDOC features in addition
to the OMITTAG feature I'd be interested in using it.

As other have already suspected, my parser will only have to deal with HTML.
Though I want to do the parsing the 'strict' SGML way, I will only
implement the features that browsers support or that are required for
validation. Everything else - even if valid SGML - should be considered an
error in this use case.
Extending the module into a full-featured (and pure Python) SGML parser
might be an interesting challange, but doesnot have priority atm.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Jan Roland Eriksson

2005-05-05 20:23:54 UTC

[...]

...my parser will only have to deal with HTML ... parsing the 'strict' SGML
way, I will only implement the features that browsers support or that are
required for validation...
Everything else - even if valid SGML - should be considered an error
in this use case.

Hmm, I wonder how you are going to determine what features that browsers
do support?

Questions like ... which browser brands? what generations of those
browsers? ... should be considered.

What would you do with older elements not available in current HTML4.01
rec but are still "supported" in some major browsers.

Finally there is an issue with some browsers of today that still do not
correctly parse (or even care about) some elements defined in HTML4.01

--
Rex

Benjamin Niemann

2005-05-05 21:22:15 UTC

Post by Jan Roland Eriksson
[...]

Post by Trevor Jenkins
Now if your proposed alternative to James Clark's well established
parsers is going to handle RANK, DATATAG, CONCUR, and SUBDOC features in
addition to the OMITTAG feature I'd be interested in using it.

...my parser will only have to deal with HTML ... parsing the 'strict'
SGML way, I will only implement the features that browsers support or that
are required for validation...
Everything else - even if valid SGML - should be considered an error
in this use case.

Hmm, I wonder how you are going to determine what features that browsers
do support?

Some valid SGML feature that should not be used in HTML are already defined
in the HTML spec (e.g. App. B of the 4.01 spec).

Post by Jan Roland Eriksson
Questions like ... which browser brands? what generations of those
browsers? ... should be considered.
What would you do with older elements not available in current HTML4.01
rec but are still "supported" in some major browsers.
Finally there is an issue with some browsers of today that still do not
correctly parse (or even care about) some elements defined in HTML4.01

The parser will be the core of the application. Another things that is
needed is a 'database of patterns' - constructs that are known to cause
problems in certain browsers. It should basicly capture the knowledge that
a webdeveloper/-designer should have in order to build robust and
futureproof pages.
Building such a database will require a lot of work and research of
course :)

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

unknown

2005-05-04 20:06:30 UTC

Post by Benjamin Niemann
I'm working on a SGML parser and struggle with tag minimization. E.g.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<title>foo</title>
<p>bar</p>
</html>
where <head>, </head>, <body> and </body> are implicitly inserted. Is the
algorithm used to determine how tags are implicitly opened and closed
documented anywhere? I guess it's in the ISO 8879:1986 document [...]

As a matter of fact, it's not.

The Standard doesn't actually specify how a parser is to infer
missing tags. Instead, it describes *when tags can be removed*
from an otherwise normalized document; the algorithm for
reinserting omitted tags is left as an exercise to the implementor.

End-tag inference is pretty straightforward: if you see a
start-tag for an element that's not legal at the current
point in the content model, insert an end-tag for the current
element (and issue a diagnostic if the element wasn't declared
with "- O" or "O O").

Start-tag inference is a lot more complicated. It boils down
to determining when an element is "contextually required",
which is not at all easy to figure out from the language
in the Standard. I've got some notes on this lying around
somewhere...

However, if you only need to deal with HTML, you can get
by with a few simple heuristics; the only elements that
allow start-tag omission are HTML, HEAD, BODY, and TBODY,
and these are always contextually required where legal.
IOW: if the first thing you see isn't an <HTML> start-tag,
insert one; if the next thing you see isn't a <HEAD> start-tag,
insert one; then if you see something that can only appear
in the BODY, insert a </HEAD> end-tag and a <BODY> start-tag.
Once you've seen or inserted a <BODY> start-tag, you only
need to worry about end-tag inference (which is easy)
and omitted <TBODY> start-tags.

--Joe English

unknown

2005-05-04 20:50:59 UTC

[...] I've got some notes on this lying around somewhre [...]

See also:
<URL: http://groups.google.com/groups?selm=***@trystero.art.com >

--je

Nick Kew

2005-05-04 21:23:05 UTC

Post by unknown
However, if you only need to deal with HTML, you can get
by with a few simple heuristics; the only elements that
allow start-tag omission are HTML, HEAD, BODY, and TBODY,
and these are always contextually required where legal.
IOW: if the first thing you see isn't an <HTML> start-tag,
insert one; if the next thing you see isn't a <HEAD> start-tag,
insert one; then if you see something that can only appear
in the BODY, insert a </HEAD> end-tag and a <BODY> start-tag.
Once you've seen or inserted a <BODY> start-tag, you only
need to worry about end-tag inference (which is easy)
and omitted <TBODY> start-tags.

A few years back this was an issue, when we looked at how
different parsers normalise HTML. Specifically, I was providing
an OpenSP-based webservice, and Jim Ley programmed up client
interfaces for it in a couple of browsers. So the question
was whether the different parsers could be expected to
produce identical trees, and if not, what to do about it.

My recollection of this is woefully inadequate, but Jim recently
wrote a note about it at
http://jibbering.com/discussion/fuzzy-pointers.html
in which he singles out <tbody> as the implied element that
regularly produced ambiguity. For anyone interested in the gory
detail, there's more on the subject in the lists.w3.org archives,
specifically the WAI/ER group round about late-2001 and 2002.

--
Nick Kew

Benjamin Niemann

2005-05-05 11:53:28 UTC