Skip to main content.
home | support | download

Back to List Archive

Re: cygwin: email archive indexing problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Nov 26 2001 - 17:37:43 GMT
At 04:58 PM 11/26/01 +0100, lanz+usenet@wsl.ch wrote:
>  Bill> The indexing script for the swish-e archive indexes hypermail
>  Bill> archive files, which are html docs.  I decided to go the quick
>  Bill> and easy route and just use regular expression matching.  See
>  Bill> http://swish-e.org/Discussion/search/index_hypermail.pl
>
>Hm, I'll try that later (I don't know perl). Thank you very much for
>the script (idea).

The script is very small, so you should be able to figure it out.  Post
again if you have questions.

[We (swish-e users) should build a library of scripts that will parse
various things, such as mail archives].


>With an entry like
>
>FileMatch filename contains "^\d+$"
>
>I get "err: Failed to complie regular expression '^d+$',
>pattern. Error: 167970536" with my cygwin compiled swish-e system
>(daily snapshot) under WinNT. Similarly for FileRules entries. An
>entry

As seen form a previous post today, use the pre-compiled binary on the
swish-e.org download page.

I can't get the cygwin verison to compile on my Windows ME machine:

http.c: In function `get':
http.c:393: warning: implicit declaration of function `sleep'
http.c: In function `lgetpid':
http.c:502: warning: implicit declaration of function `getpid'
http.c: In function `http_indexpath':
http.c:685: `unlink' undeclared (first use in this function)
http.c:685: (Each undeclared identifier is reported only once
http.c:685: for each function it appears in.)
make[1]: *** [http.o] Error 1
make[1]: Leaving directory `/cygdrive/c/home/swish-e/src'
make: *** [swish-e] Error 2



>NoContents .overview .temp .prop
>
>seems to be ignored (at least for the .temp and .prop files; see
>comment below).
>
>
>  >> swish-e seems to scan index.swish-e.temp and
>  >> index.swish.e.prop.temp, or what does the "Warning: Substitute
>  >> possible embedded null character(s) in file index.swish-e" (and
>  >> index.swish-e.temp, index.swish-e.prop, index.swish-e.prop.temp)
>  >> mean? I have set "NoContents .swish-e .temp .prop" in my config
>  >> file.
>
>  Bill> The embedded null message means that your document probably
>  Bill> has an embedded null and was thus truncated.  (It really means
>  Bill> that the files system said that the document was X bytes long,
>  Bill> but strlen(buf) says it's Y bytes, and Y < X.)
>
>  Bill> It might also mean you are trying to index binary data that
>  Bill> contains a null.
>
>I did ask swish-e to create the index in the indexed directory, and
>the error message concerning the embedded null was on the swish-e
>generated index temporary files! I store my index file in a different
>directory now. ;-)

Ah, I see.  I used use IndexOnly or -i and specify what I'm indexing, so I
don't index the index files by mistake...


>
>  Bill> If you use the libxml2 parser you won't have this problem with
>  Bill> HTML docs.
>
>I have mail messages. Text files!?

Yes, but you would use a -S  prog to parse the mail messages into fields
(metanames and properties in swish), so then you would use either the HTML
or XML parser to index.  For example, the index_hpermail.pl script parses
out the subject, author, author email, and date into 
            <meta name="subject" content="foo">
so the HTML parser is used.  You could just as easily format as XML:

<all>
   <subject>foo</subject>
   <author>bas</author>
   ...
</all>


As I mentioned in a second post, the embedded null warning is only with the
original parsers HTML, XML, and TXT.  That's because for those parsers the
*entire* file is read into a buffer in memory.  That buffer is then used
like a string.  If the string lenght < file size, the buffer is scanned for
nulls, and they are replaced with a newline.

When you build with libxml2 (which is a library for parsing HTML and XML)
you add three new parsers, HTML2, XML2, and TXT2.  TXT2 doesn't use
libxml2, but it uses the same parser.c code for reading from files in 2K
chunks as HTML2 and XML2.


>  Bill> BTW - Many people do this:
>
>  Bill> IndexOnly .html .htm NoContents .gif .jpeg
>
>  Bill> But swish will never see the .gif and .jpeg since it's only
>  Bill> looking at .htm and .html.
>
>What is the IndexOnly syntax for just indexing files with NO extension
>name? My mail messages are stored in files named NNNNN, where the N
>are digits.

You need to use FileRules, as you were trying.  But if you are using -S
prog you use a regular expression.  In index_hypermail.pl it does this:

sub wanted {
    return if -d;
    return  unless /^\d+\.html$/;
    ...

which ignore directories and only indexes files that match the regular
expression.

>I use the latest development version of swish-e. The problems are
>caused by the sometimes very large mail attachments embedded in the
>mail message files (usually base64 encoded). It would be nice to have
>a simple option (filter) in the swish-e configuration file, which
>would prevent scanning embedded mail attachments (I mean the base64
>encoded parts of a mail message with Content-Type:
>Multipart-Mixed). 

That's why you need perl ;)  If you use -S prog you can really control what
swish indexes.  So you can throw away the attachments if needed.  It's
really the way to go.  This kind of thing is exactly why -S prog was added.

I don't know what you are trying to index, but you might look at:

http://search.cpan.org/doc/ERYQ/MIME-tools-5.411a/lib/MIME/Tools.pm
http://search.cpan.org/doc/VPARSEVAL/Mail-MboxParser-0.23/MboxParser.pm


>That means, I could set 
>
>TranslateCharacters :ascii7:
>WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz.-
>
>and search for the string "Zürich"?

Her's how you test such things:

> cat c
TranslateCharacters :ascii7:
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz.-

> cat 1.html
Zürich

> ./swish-e -c c -i 1.html -T indexed_words -v 0
Indexing Data Source: "File-System"
    Adding:[swishdefault:1]   'zurich'   Pos:1  Stuct:0x1 ( FILE )
Indexing done!

> ./swish-e -w 'Zürich' -H 0
1000 1.html "1.html" 8




Bill Moseley
mailto:moseley@hank.org
Received on Mon Nov 26 17:38:40 2001