f***@cogitoergosum.com
2005-12-23 13:17:11 UTC
Hi, I'm new at this newsgroup and I want do ask some questions and
opinions about this subject.
I'm developing an application focused in a very specific task: clean
and labelling text documents with user-defined structural tags (title,
cite, date, paragraph, itemList, ...). It makes the typical
pre-processing tasks needed for computational linguistics in order to
work with big corporas to use statistical tools.
But I'm worried that this field be too small/specific. I choosed it
because it's a field that I know and where I'd some contacts, *but* I'm
not sure if research departments of universities are able to spend
money/purchase software, or may be they are too used to the free/open
source world.
For this reason I'm looking for some other field where the task of
adding structural labels to text be needed (specifically converting
unstructured and format-oriented documents to structured
function-oriented XML documents). May be some area on publishing, but I
think that they will not be interested in "small" desktop applications.
Please, any of you had worked for or listen about some business with
this kind of need? Do you think that there is demand for legacy
document conversion in small business?
Some info about the application:
- Importing form main document formats (TXTs, HTML, RTF, others?).
- GUI Based for interactive labelling (active learning techniques,
similar to the OCR programs).
- Interactive labelling used to "train" the program by automatic
induction of statistical rules (based on textual, lexical,
typographical and structural properties of the block).
- After trainning the labeller can be used in batch-processing in a
full-atonomous mode.
- Exporting to user-defined XML (any estandard? docbook? TEI?)
- A lot of cleaning and normalization small tasks: removing headers,
de-hyphenation, reconstruction of paragraphs with broken lines,
removing non-textual or decorative elements as (asccii art), ...
I think that legacy document conversion may be a need for many
bussinnes, but I'm not able to found them, may be some of you can give
a clue?
thanks very much in advance.
opinions about this subject.
I'm developing an application focused in a very specific task: clean
and labelling text documents with user-defined structural tags (title,
cite, date, paragraph, itemList, ...). It makes the typical
pre-processing tasks needed for computational linguistics in order to
work with big corporas to use statistical tools.
But I'm worried that this field be too small/specific. I choosed it
because it's a field that I know and where I'd some contacts, *but* I'm
not sure if research departments of universities are able to spend
money/purchase software, or may be they are too used to the free/open
source world.
For this reason I'm looking for some other field where the task of
adding structural labels to text be needed (specifically converting
unstructured and format-oriented documents to structured
function-oriented XML documents). May be some area on publishing, but I
think that they will not be interested in "small" desktop applications.
Please, any of you had worked for or listen about some business with
this kind of need? Do you think that there is demand for legacy
document conversion in small business?
Some info about the application:
- Importing form main document formats (TXTs, HTML, RTF, others?).
- GUI Based for interactive labelling (active learning techniques,
similar to the OCR programs).
- Interactive labelling used to "train" the program by automatic
induction of statistical rules (based on textual, lexical,
typographical and structural properties of the block).
- After trainning the labeller can be used in batch-processing in a
full-atonomous mode.
- Exporting to user-defined XML (any estandard? docbook? TEI?)
- A lot of cleaning and normalization small tasks: removing headers,
de-hyphenation, reconstruction of paragraphs with broken lines,
removing non-textual or decorative elements as (asccii art), ...
I think that legacy document conversion may be a need for many
bussinnes, but I'm not able to found them, may be some of you can give
a clue?
thanks very much in advance.