I modified the swish.cgi and swish.conf files and I have made some progress.
The links no longer have the NULL statement. However, the files are still
inaccessible. When I check the URL for the file, it indicates the file is
in the cgi-bin directory when in reality it is in the documenation
directory.
The swish.cgi file is located in the cgi-bin directory, and the swish.conf
file is in the documentation directory.
When I created the index, I was in the documentation directory, and the
syntax
that was used was the following: /usr/local/bin/swish-e -c swish.conf -v 3.
I've included the two files in this e-mail.
The 'spaces' that I mentioned in the previous e-mail refer to the filenames.
For
example, one file that has been indexed is:
Windows Workstation Environment Variables for IDL.pdf
-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Bill Moseley
Sent: Tuesday, January 06, 2004 4:55 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Unable to retrieve documents
On Tue, Jan 06, 2004 at 10:17:06AM -0800, Kaplan, Andrew H. wrote:
> I have set up our webserver such that the swish.cgi page comes up when
> a person wants to retrieve a document. When the text is entered the
> results screen does appear with the appropriate links to the documents
> in question. However, users are unable to access the documents.
Seems like if they can't be accessed then they are not appropriate
links.
> The results screen does show the names of the files with their extensions,
ie:
> pdf, doc, etc. Immediately under
> the files the word NULL appears in parentheses.
That NULL is in the FAQ. See the swish.cgi docs.
> The information about the file
> including its modification date,
> size, and path also appears. Clicking on the file causes the error screen
>
> Not Found -- The requested url was not found on this
> server
> to appear.
Well, that's just a web server issue -- you have to make sure the paths
point to the right locations.
You can rewrite the the path when indexing (in the swish-e config file)
with ReplaceRules, and you can also prepend text to each path by a
setting the the swish.cgi config file.
> The files that are being indexed are either Adobe pdf, MS-Word doc,
MS-Excel
> xls, and htm documents. They all have
> spaces between the words in their titles. The server itself has the
catdoc,
> xls2csv, and xpdf programs installed.
Space between their words in their "titles"? Or do you mean file names. I
suspect you
mean file names. You don't give much details so I can't know for sure, but
here's
an example of indexing files with a space:
Notice that the href is correct:
moseley@bumby:~/apache$ echo "hello" > "file with space.txt"
moseley@bumby:~/apache$ swish-e -i "file with space.txt" -v0
moseley(at)not-real.bumby:~/apache$ GET http://localhost/apache/swish.cgi?query=hello |
grep txt
<dt>1 <a href="file%20with%20space.txt">file with space.txt</a>
<small>-- rank: <b>1000</b></small></dt>
<tr><td><small>Document Path:</small></td><td><small> <b>file with
space.txt</b></small></td></tr>
> What do I need to do to correct this problem? Thanks.
Something like the above few lines that demonstrate the problem.
Here's another example with spidering:
moseley@bumby:~/apache$ cp test.pdf "test pdf with spaces.pdf"
moseley@bumby:~/apache$ /usr/local/lib/swish-e/spider.pl default
http://localhost/apache/test%20pdf%20with%20spaces.pdf | swish-e -S prog -i
stdin -v0
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Summary for: http://localhost/apache/test%20pdf%20with%20spaces.pdf
Total Bytes: 12,593 (12593.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
moseley(at)not-real.bumby:~/apache$ GET http://localhost/apache/swish.cgi?query=the |
grep pdf
<dt>1 <a
href="http://localhost/apache/test%20pdf%20with%20spaces.pdf">http://localho
st/apache/test pdf with spaces.pdf</a> <small>-- rank:
<b>1000</b></small></dt>
<tr><td><small>Document Path:</small></td><td><small>
<b>http://localhost/apache/test pdf with spaces.pdf</b></small></td></tr>
--
Bill Moseley
moseley@hank.org
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Jan 7 14:55:20 2004