Discussion:
Batch converter Word to SGML
(too old to reply)
Jany Quintard
2005-09-08 12:27:56 UTC
Permalink
Does anyone know if there exist a batch converter able to transform Word
files into SGML (most probably (X)HTML).
I can use OpenOffice to convert Word files into pretty clean HTML files,
but I need to process each file by hand, and I would like some automatic
process, ideally running under Unix/Linux or cygwin.

Any hints

Jany
Simo Melenius
2005-09-08 13:10:29 UTC
Permalink
Post by Jany Quintard
Does anyone know if there exist a batch converter able to transform
Word files into SGML (most probably (X)HTML). I can use OpenOffice
to convert Word files into pretty clean HTML files,
If this is true, then I believe the worst part is behind.
(Additionally, a small utility program to tidy HTML files further
might possibly be handy.)
Post by Jany Quintard
but I need to process each file by hand, and I would like some
automatic process, ideally running under Unix/Linux or cygwin.
You can do this mechanically at least, modulo correct doctypes and
DTDs:

xmllint seems to have an option to read in documents using a HTML
parser and output them with an XML serializer. Also, OpenSP comes with
`osx` that reads an SGML document and outputs it as an XML document.

I assume that OpenOffice generates fairly good but not always 100%
valid HTML.

xmllint might be a good start. I would guess the input HTML must be
pretty valid in order to avoid OpenSP from bailing out (at least with
default settings). I don't know what parser xmllint uses but it might
be more lenient than OpenSP.


br,
S
--
***@iki.fi -- Today is the car of the cdr of your life.
Jany Quintard
2005-09-08 15:49:14 UTC
Permalink
Post by Simo Melenius
Post by Jany Quintard
Does anyone know if there exist a batch converter able to transform
Word files into SGML (most probably (X)HTML). I can use OpenOffice
to convert Word files into pretty clean HTML files,
If this is true, then I believe the worst part is behind.
(Additionally, a small utility program to tidy HTML files further
might possibly be handy.)
I am used to html-tidy and sgml utilities (Openjade and DSSSL) and my
only problem is here to find a tool which I could use on a command line,
such as in :

for files in *.doc
do
mysupertool $file | tidy -m -i -raw > $(basename $file .doc).html
# nifty process to output SG/XML files
done

The pain here, is that I can't (or am not aware that I can) use
Openoffice this way.
I have to do the transformation interactively, and I need to process a
lot of files automagically. I would even be ready use DOS command lines,
if needed ;-)

Jany
Simo Melenius
2005-09-08 20:56:18 UTC
Permalink
Post by Jany Quintard
I am used to html-tidy and sgml utilities (Openjade and DSSSL) and my
only problem is here to find a tool which I could use on a command
Ah, I probably misread something. And it wasn't even late. Hrhmpf.
Post by Jany Quintard
mysupertool $file | tidy -m -i -raw > $(basename $file .doc).html
The pain here, is that I can't (or am not aware that I can) use
Openoffice this way.
OpenOffice has a GUI-driven mass-converter built in. If it's a
one-time mass conversion operation, that should be enough.

Alternatively, an OpenOffice macro could be programmed to do batches
of load+save operations or invoke the OOo's own batch converter. There
seemed to be various such macros available through Google.

Of course, an even more die-hard way would be the OpenOffice's UNO API
that has bindings to various (scripting) languages, including Java and
Python, using which you could script OpenOffice completely :-)

Googling for "openoffice batch converter" seemed to find several
interesting hits.


HTH,
br,
S
--
***@iki.fi -- Today is the car of the cdr of your life.
miseryman
2005-09-08 22:13:02 UTC
Permalink
Post by Jany Quintard
Does anyone know if there exist a batch converter able to transform Word
files into SGML (most probably (X)HTML).
I can use OpenOffice to convert Word files into pretty clean HTML files,
but I need to process each file by hand, and I would like some automatic
process, ideally running under Unix/Linux or cygwin.
You might take a look at Antiword, which has an option to output to XML
(DocBook): http://www.winfield.demon.nl/
--
Raf
William F Hammond
2005-09-14 19:27:47 UTC
Permalink
Post by miseryman
Post by Jany Quintard
Does anyone know if there exist a batch converter able to transform Word
files into SGML (most probably (X)HTML).
I can use OpenOffice to convert Word files into pretty clean HTML files,
but I need to process each file by hand, and I would like some automatic
process, ideally running under Unix/Linux or cygwin.
You might take a look at Antiword, which has an option to output to XML
(DocBook): http://www.winfield.demon.nl/
Abiword, now part of Gnome Office, comes with a suite of scriptable
tools for various conversions. One of them is called "wvHtml".

See http://www.gnome.org/gnome-office/

-- Bill

Peter Flynn
2005-09-08 22:16:01 UTC
Permalink
Post by Jany Quintard
Does anyone know if there exist a batch converter able to transform Word
files into SGML (most probably (X)HTML).
I can use OpenOffice to convert Word files into pretty clean HTML files,
but I need to process each file by hand, and I would like some automatic
process, ideally running under Unix/Linux or cygwin.
The only reliable commercial product I have used for this was EBT's
DynaTag (Windows GUI and batch, Unix only batch). This was sold when
EBT ceased trading, and the purchaser (Enigma) doesn't seem to understand
SGML or XML, and has no clue what the product is. I believe it is still
theoretically available from them, but it's barely mentioned on their site.

However, I also believe it is still available, in a newer form, from Red
Bridge Software, who operate from the same premises as EBT used to in
Providence, RI. Ask 'em.

It was based on the Rainbow DTD, designed as an exchange format between
wordprocessors, so that you could write a translation from Product X to
Rainbow, and then use someone else's product to convert *from* Rainbow to
their format. This never really happened, but I have a copy of the original
Rainbow software still -- probably illegally -- at
ftp://ftp.ucc.ie/pub/sgml/rainbow/

DynaTag displays your Word file and lets you map Word named styles to XML
output elements you make up. It can handle nesting, splitting, and
encapsulation, and produces a DTD to accompany what you export. Having done
this for a sample batch, you can then let it rip on a big folder-full. The
GUI for training is Windows only, but the specs could be used with their
Unix batch engine. Having got it into your homebrew but representative XML
format, you can then write an XSLT transformation to turn it into whatever
output you need.

Caveat: like most Word-to-XML converters, the input Word *must* use named
styles for its formatting. If everything is just marked "Normal" then you
have only the fonts to guide you, and your conversion quality will suffer.

///Peter
Jany Quintard
2005-09-10 14:00:23 UTC
Permalink
Thanks for the answers.

Jany
Loading...