Skip to main content.
home | support | download

Back to List Archive

Re: Parsing doc, xls and excel files with swish-e and libxml2

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Mon Jun 27 2005 - 20:16:56 GMT
On Mon, 2005-06-27 at 12:04 -0700, Animesh Bansriyar wrote:
> root@laptop:/tmp# /usr/local/swish-e/bin/swish-e -i /opt/work_data/Neolinux.doc -v 20
> Indexing Data Source: "File-System"
> Indexing "/opt/work_data/Neolinux.doc"
> 
> Checking file "/opt/work_data/Neolinux.doc"...
>   Neolinux.doc - Using DEFAULT (HTML2) parser -  (290 words)

Word documents are enclosed in a binary OLE wrapper document but there
are text portions within some files.  Parsing a word document with the
HTML parser may or may not result in complete nonsense.  The fact that
it works at all is simply good fortune on your part.

The reason it does not work on Windows is surely due to the way
Microsoft's C runtime processes files.  Windows treats text and binary
files differently.  UNIX systems do not.  Swish-e expects text input and
therefore tells Windows to send it text.  Since MS Word docs are almost
entirely binary Windows probably does not bother to send Swish-e the
document at all.

You really do need to filter the Word documents.  And if you plan to
perform phrase searches the program that filters the Word documents
really do need to understand OLE containers.  (i.e. catdoc, wvware,
antiword, etc)  Otherwise your Word document may be scrambled because
OLE containers are often written in append mode.

> But here is the problem: I have been trying to do the exact thing on Windows 
> but have failed to do this so far. On windows there was no libpcre and I had
> to do a lot of ugly hacks to get everything compiled properly under MinGW and
> msys.

What sort of hacks?  I've never built Swish-e using MinGW under Windows.
All of my Windows builds are made on a Debian-based system using the
mingw32, mingw32-binutils, mingw32-runtime packages.  The script
swish-e/src/win32/build.sh will compile Swish-e under MinGW setting all
the required configure options.  You would need to modify some portions
of the build scripts to compile Swish-e natively under Windows.


To build Swish-e you need the development headers and libraries for
pcre, zlib, and libxml2 compiled for Windows.  These development
packages can be found on their respective websites.  Here is a copy of
my build directory which contains a buildable version of swish-e 2.4.3:
http://www.webaugur.com/wares/files/swish-e/swish-build-2005-03-29.tar.gz


Is there any reason you need to compile Swish-e at all?  The Windows
installer allows you to remove components during installation.

-- 
 David Norris
  http://www.webaugur.com/dave/
  ICQ - 412039
Received on Mon Jun 27 13:16:58 2005