There is need for Text to XML semi/automatic conversion?

Peter Flynn

2005-12-23 22:26:08 UTC

Post by f***@cogitoergosum.com
Hi, I'm new at this newsgroup and I want do ask some questions and
opinions about this subject.
I'm developing an application focused in a very specific task: clean
and labelling text documents with user-defined structural tags (title,
cite, date, paragraph, itemList, ...). It makes the typical
pre-processing tasks needed for computational linguistics in order to
work with big corporas to use statistical tools.

"User-defined"? Is there a standard for corpus linguistics? Like TEI?

Post by f***@cogitoergosum.com
But I'm worried that this field be too small/specific. I choosed it
because it's a field that I know and where I'd some contacts, *but*
I'm not sure if research departments of universities are able to spend
money/purchase software, or may be they are too used to the free/open
source world.

Some of them have money, some don't. But IMHE they are well used to
using Open Source software, and there is plenty available.

Post by f***@cogitoergosum.com
For this reason I'm looking for some other field where the task of
adding structural labels to text be needed (specifically converting
unstructured and format-oriented documents to structured
function-oriented XML documents). May be some area on publishing, but
I think that they will not be interested in "small" desktop
applications.

Who is "they"?

Post by f***@cogitoergosum.com
Please, any of you had worked for or listen about some business with
this kind of need? Do you think that there is demand for legacy
document conversion in small business?

Lots of us work in or close to this field. There certainly is a demand
for this, but it's very small, especially in small businesses. It is
currently faster and cheaper to send the whole corpus to a company in
the Indian subcontinent or on the Pacific Rim and have it rekeyed or
scanned into XML there. In general, companies are not interested in
legacy documents, and there is little or no business case for them to
be. It's different if someone is suing you over some documented event
that happened in the distant past, but in those cases I suspect the
companies are only too glad for the documents to remain inaccessible.

If there was any interest in preserving them, they wouldn't have used
WordPerfect, Lotus, or Word formats (or whatever) to store them in
in the first place, would they? :-)

Academic research, especially literary and historical research; some
library projects; and some publishing-oriented preservation projects
are more likely to have a demand for this software -- but they don't
have large sums of money to spend on it, and it is arguable that if
they are publicly-funded they should perhaps not be spending that
money this way. And in those fields there are a lot of people who are
very expert in doing these conversions.

You seem to be confused about your objective: you say "...in small
business" but in the preceding paragraph you say that "...they will
not be interested in 'small' desktop applications."

Post by f***@cogitoergosum.com
- Importing form main document formats (TXTs, HTML, RTF, others?).

Those are three very unlikely candidates as there is already
software to handle them in many ways.

Legacy obsolete binary wordprocessing and DTP formats are the hardest
to deal with, especially when they reside on obsolete media.

Post by f***@cogitoergosum.com
- GUI Based for interactive labelling (active learning techniques,
similar to the OCR programs).

I just posted about this the other day: see the thread "looking for a

Post by f***@cogitoergosum.com
- Interactive labelling used to "train" the program by automatic
induction of statistical rules (based on textual, lexical,
typographical and structural properties of the block).

The IR people have been trying to do this for decades.
I may be biased in favour of markup, but I really don't see any progress
here.

Post by f***@cogitoergosum.com
- After trainning the labeller can be used in batch-processing in a
full-atonomous mode.

This is what DynaTag's batch mode does (see reference to thread above).

Post by f***@cogitoergosum.com
- Exporting to user-defined XML (any estandard? docbook? TEI?)

Very, very hard to do in the first pass, because the sequence and
structure may simply not match. Much easier if you use an interim
markup structure, made for the job, and do a final conversion to
the target vocabulary afterwards.

Post by f***@cogitoergosum.com
- A lot of cleaning and normalization small tasks: removing headers,
de-hyphenation, reconstruction of paragraphs with broken lines,
removing non-textual or decorative elements as (asccii art), ...

Yes, very useful, and something that a lot of conversion software
is very bad at.

Post by f***@cogitoergosum.com
I think that legacy document conversion may be a need for many
bussinnes, but I'm not able to found them, may be some of you can give
a clue?

Let us know if you find any businesses who are interested. With the
obvious caveats already mentioned, legacy documents simply are not
interesting for businesses.

///Peter

--
XML FAQ: http://xml.silmaril.ie/