Skip to main content.
home | support | download

Back to List Archive

Re: How does the prog method work?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Aug 09 2001 - 23:22:22 GMT
At 04:01 PM 08/09/01 -0700, Scott Schultz wrote:
>Maybe I'm dense, but Perl has always been just so much
>obfuscation to me. I can't tell from the sample programs
>how the program called by the prog method informs Swish-E
>that one document has finished and that another is beginning.
>
>In other words, my understanding of the input program is
>that it feeds a stream of documents to Swish-E, which Swish-E
>indexes. However, the examples make it look as if the prog
>method program is nothing more than a sort of filter wrapper.
>
>Which is correct? If it IS used as an input stream, how does
>it demarc the documents?

Length.  It's kind of like a http POST.  Swish reads in headers
line-by-line, and one of the headers is a content-length.  When swish reads
a blank line, it knows the content follows.  Then it reads content-length
bytes in.  The plan is to expand the number of headers that can be passed
to swish.  

Perl is very fast.  You can use -S prog at -S fs speed.  What kills perl is
the compiling the script for every document.  So with swish calling the
perl program and then sending the docs, you only have the expense of
compiling perl once.

Does that make sense?

how about this:

> cat fast.pl
#!/usr/local/bin/perl -w
use strict;


for ( 1..50000 ) {

    my $doc = <<EOF;
<html>
<head>
    <title>This is doc $_!</title>
</head>
<body>
This is the text text and more $_ text
</body>
</html>
EOF


    my $path = "File$_";
    my $size = length $doc;
    my $mtime = time;

    print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $path

EOF

    print $doc;
}

> ./swish-e -S prog -i ./fast.pl 
Indexing Data Source: "External-Program"
Indexing ./fast.pl..
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 50007 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
50007 unique words indexed.
Writing file list ...
Property Sorting complete.                                         
Writing sorted index ...
50000 files indexed.  6227788 total bytes.
Elapsed time: 00:00:19 CPU time: 00:00:18
Indexing done!

Wow! 50,000 "files" in under twenty seconds.




Bill Moseley
mailto:moseley@hank.org
Received on Thu Aug 9 23:22:49 2001