Skip to main content.
home | support | download

Back to List Archive

Indexing files without an extension

From: dennis lastor <dennis.lastor(at)not-real.gmail.com>
Date: Tue Feb 07 2006 - 03:27:11 GMT
I am trying to index a wiki page that contains links to other wiki pages
without extensions.

For example one of the pages could be http://internal_site/Page_With_Text

I have read through several of the FAQs and threads but have not been able
to
find anything on this topic.  I have no trouble indexing PDFs, DOCs, TXT,
HTML,
etc, and everything works GREAT!  I would just like to index these pages
without
extensions.

I am using the "prog" method by running:

swish-e -S prog -c swish.conf

My swish.conf looks like:

# Example for spidering
# Use the "spider.pl" program included with Swish-e
IndexDir spider.pl


#Path to filters
FilterDir /tool/bin/


# Define what sites to index.  Just add to the bottom of this

SwishProgParameters default http://Internal_Site/WegPage1            =20
          \
                                        =20
http://Internal_Site/WebPage2
\
                                        =20
http://Internal_Site/WebPage3





# ? DefaultContents HTML2
IndexContents HTML* .htm .html .shtml .pdf .doc .ppt .xls
StoreDescription HTML* <body> 300


# Look at PDFs
#FileFilter .pdf /tool/bin/pdftotext   "'%p' -"




#Break the word up into stemed words
FuzzyIndexingMode Stemming_en


# Show ALL info while indexing
IndexReport 3


#compress
CompressPositions yes

Whenever I run swish-e it correclty indexes all of the PDFs, etc..etc...but
not the internal wiki sites (without extensions)
but rather says there are no unique words to index.

I am also not sure if the 'CompressPositions yes' will compress the index
files or not.

Any help would be greatly appreciated.  Swish-e has been invaluable in
indexing our tech documents, and I would
love to have it index these wiki pages where most of our documents exists.

Thanks again!
Dennis



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Mon Feb 6 19:27:15 2006