Skip to main content.
home | support | download

Back to List Archive

FW: Re: wvare DOC -> HTML filter

From: Job, James <JJob(at)not-real.ESD.WA.GOV>
Date: Mon May 24 2004 - 17:34:41 GMT
-----Original Message-----
From: Job, James 
Sent: Monday, May 24, 2004 10:04 AM
To: 'swish-e@sunsite.berkeley.edu'
Subject: RE: [SWISH-E] Re: wvare DOC -> HTML filter


#1 WVWARE output is HTML vice CATDOC's plaintext.  My understanding is the
HTML2 parser is the default/preferred indexer in Swish-e

#2 Since it converts WORD HTTP LINKS as <a> tags, SWISH-E should crawl other
documents linked inside the .DOC (darn good reason to use it).

#3 It appears to be very fast (GNU/win32 port- just make sure you pass the
"-1" to disable WMF image file creation).  It's hard for me to see any speed
difference (as they both run sub second times on my system over a 60kb
doc)... I should write a script to load test it though.

#4 In manually converting DOCs to HTML, it does a very presentable job
(tables, typestyles, etc).

#5 Indexed description in SWISH is very nice: (example)

newsletter 6.doc  rank: 633
Newsletter # 6 April 23, 2004 SSON Project Timeline The SSON Core team has
produced a schedule of the tasks and major milestones that will be required
for implementation on July 1st. You will note that even though July 1st is
Employment Security's official transition date many project tasks will occur
during the following months. The reason is that the new AFRS formatted data
only starts accumulating in the system after July 1st. Many AFRS processes
and the reporting classes will need at least a ... 

Last Modified Date: 2004-05-04 13:22:03 Pacific Daylight Time 
Document Size: 4336 
Document Path: http://insideesd.esd.wa.gov/asd/sson/newsletter 6.doc

Finally, Someone so inclined could use WVWARE to add a g**gle-like "view in
HTML" option to your swish.cgi (get the file, call the filter & return the
resulting HTML).

I haven't tried this on Linux (yet), but it should work (as long as the
filter can find wvware).

James Job
jjob@esd.wa.gov


-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Saturday, May 22, 2004 10:16 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: wvare DOC -> HTML filter


On Sat, May 22, 2004 at 08:31:07PM -0700, Job, James wrote:
> Here is what it took (on Win2003 Server using SWISH-E 2.4.2):
> 
> 1.  Download & Install "Complete package, except sources" (setup) from 
> http://gnuwin32.sourceforge.net/packages/wv.htm (2mb- Jan 2004). 2.  
> Add "C:\Program Files\GnuWin32\bin" to your system PATH. 3.  Create 
> Doc2html.pm in c:\swish-e\lib\swish-e\perl\SWISH\filters\ as follows 
> (a quickly hacked Doc2txt.pm):
> 
> #=====================================================================
> ======
> =====
> package SWISH::Filters::Doc2html;

Yea!  A new filter.

Did you compare this with how well (or not well) catdoc does? When would
someone want this over catdoc?

Thanks!

-- 
Bill Moseley
moseley@hank.org



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Mon May 24 10:34:41 2004