Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Dr Michael Daly <gp(at)not-real.holisticgp.com.au>
Date: Wed Feb 13 2008 - 00:02:33 GMT
could this be a permissions issue?

Rgds

> Hi,
>
> I am using spider.pl to crawl. I have only 1 pdf on the entire intranet
> as a test. I have tried both the domain and ip in the hyperlink.
> I did some extensive testing yesterday. The strange thing is if I use
> pdftotext or pdftohtml directly on a local file then it generates the
> output correctly.
> It seems to have a big problem opening the pdf after running swish-e.
> this same pdf can be opened directly from a browser (as a binary file)
> and as stated before it opens when directly applying pdftotext and
> pdftohtml in cmd.
> Heres the pdftohtml error:
>
>  (523 words)
> http://*****.au/dsdweb/v4/apps/web/secure/docs/25.pdf - Using HTML
> 2 parser - Error: Couldn't open file ''http://*****.au/dsdweb/v4/a
> pps/web/secure/docs/25.pdf''
>  (no words indexed)
>
> Also I am not sure how to turn on the -T debugging - can you assist me
> with this.
> Verbose is active.
>
> Thanks.
> Liam.
>
>
>
>
>
>
> -----Original Message-----
> From: users-bounces@lists.swish-e.org
> [mailto:users-bounces@lists.swish-e.org] On Behalf Of Peter Karman
> Sent: Tuesday, 12 February 2008 1:01 PM
> To: Swish-e Users Discussion List
> Subject: Re: [swish-e] Swish-e not indexing doc or PDF files
>
>
>
> Liam Buchanan wrote on 2/11/08 6:26 PM:
>> Hi,
>> Hope someone can suggest a solution to this frustrating problem.
>> We are running swish-e on our development server that indexes our
>> production intranet server. However the problem lies in the inability
>> for the indexing to process .doc or PDF files. When the search reaches
>
>> a hyperlink that is linked to a PDF or doc file the process halts and
>> the error message is produced below (under output)  Before running
>> swish-e, we connect to our production server via a proxy connection
>> first (ntlmaps)
>
> it isn't clear to me how you are aggregating your documents. spider.pl ?
> Some other crawler?
>
> The FileFilter config can work at odds with the SWISH::Filter stuff in
> spider.pl, effectively trying to convert non-text files 2x.
>
> Try indexing one, troublesome, document. Break down the process:
> fetching the doc, feeding it to swish-e, etc. Turn on verbosity and the
> -T debugging options.
>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>
> ------------------------------------------------------------------------
> ----
> Unless stated otherwise, this email, together with any attachments, is
> intended for the named recipient(s) only and may contain privileged and
> confidential information. If received in error, you are asked to inform
> the sender as quickly as possible and delete this email and any copies
> of this from your computer system network.
>
> If not an intended recipient of this email, you must not copy,
> distribute or take any action(s) that relies on it; any form of
> disclosure, modification, distribution and/or publication of this email
> is also prohibited.
>
> Unless stated otherwise, this email represents only the views of the
> sender and not the views of the Queensland Government.
> ------------------------------------------------------------------------
> ----
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>


Dr Michael Daly MB, BS
GradDip(Integrative Medicine), GradCert(Evidence Based Practice),
M Bus(Information Innovation), GradDip(Document Management)
03 9521 0352
0413 879 029
http://www.holisticgp.com.au/contactdetails.htm

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 12 19:02:32 2008