At 08:56 PM 01/10/02 -0800, Frank Heasley wrote:
>>>* they are named, in sequence, 10000.htm to 19999.htm
>>>* I want to index an arbitrary subset of 1000 of those files
>>>
>>>A regex won't match.
>><<<<
>>Well, I wonder if you could /1[0-9][0-9][0-9][0-9]\.htm/
>
>you could, but that would match all of your files.
Sorry, I was doing too many things at once. I thought you wanted to match
that range.
>> my $num = $1 if /(\d+)\.html/;
>> return unless $num && $num > 1235 && $num < 8712;
>
>Umm... I think you're trying to produce a list of files here, which is not
>our problem... we already know what the list is - it's a pre-assigned set
>of files.
Again, I was thinking of a range of numbers.
If you know a range of files, use IndexDir in the config. I've done it
LOTS of times when I'm trying to narrow down a bug to a single source file.
I have listed thousands of files, and then slowly cut them in 1/2 until I
found the problem.
IF your list of files is generated, then the -S prog approach will be good
since you can put the code in to determine what files to index on-the-fly.
Sorry for not reading your question more carefully!
>I assume there's some mechanism using swishspider that retrieves and
>indexxes web files one by one. Would that, perhaps, be an approach?
Well, not swishspider. swishspider only fetches documents, it's not really
a spider (unlike spider.pl used with -S prog) which is a real spider).
If you mean -S prog, sure. That's exactly how it works. You write a
program to fetch file files (records, URLs, whatever) and feed them to swish.
--
Bill Moseley
mailto:moseley@hank.org
Received on Fri Jan 11 05:10:36 2002