Skip to main content.
home | support | download

Back to List Archive

Re: More clues on indexing problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Sep 24 2002 - 03:23:39 GMT
At 07:30 AM 09/23/02 -0700, H Vernon Leighton wrote:
>As I said before, if you leave obeyRobotsNoIndex off, then swish-e will
>index all of the pages in the directory tree. However, with obeyRobots set
>to yes, it will index all of the pages properly, but it will not return all
>of them from a legitimate search. We have tried -T INDEXED_WORDS and other
>tests, and the pages are apparently being indexed, just not returned from
>the search. 

..

>Say the word "testword" is in three documents: sample1.html, sample2.html
>and sample3.html. Say that swish-e indexes them in that exact order.
>
>If sample2.html has the tag <meta name="robots" content="noindex">, then in
>INDEXED_WORDS, both sample1.html and sample3.html appear in the index under
>"testword." However, if you do the search:
>
>swish-e -w testword -f swish_test.index
>
>you will only get sample3.html returned from the search.

Is this what you are describing?  sample2.html has the noindex tag:

~/swish-e.2.2-patches/src > cat sample?.html
<html>
<head>
<title>This is sample1</title>
</head>
<body>
testword
</body>
</html>

<html>
<head>
<title>This is sample2</title>
<meta name="robots" content="noindex">
</head>
<body>
testword
</body>
</html>

<html>
<head>
<title>This is sample3</title>
</head>
<body>
testword
</body>
</html>

Here's the config file:

~/swish-e.2.2-patches/src > cat c
IndexContents HTML2 .html
obeyRobotsNoIndex yes

Now index, see that sample2.html is ignored:

~/swish-e.2.2-patches/src > ./swish-e -c c -i sample1.html sample2.html
sample3.html  -v3
Parsing config file 'c'
Indexing Data Source: "File-System"
Indexing "sample1.html"

Checking file "sample1.html"...
  sample1.html - Using HTML2 parser -  (4 words)
Indexing "sample2.html"

Checking file "sample2.html"...
  sample2.html - Using HTML2 parser -  (Skipped due to Robots Excluion Rule
in meta tag)
Indexing "sample3.html"

Checking file "sample3.html"...
  sample3.html - Using HTML2 parser -  (4 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
5 unique words indexed.
4 properties sorted.                                              
2 files indexed.  297 total bytes.  11 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

Now search.  I get sample1 and sample3 as expected.

~/swish-e.2.2-patches/src > ./swish-e -w testword
# SWISH format: 2.2.1
# Search words: testword
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.050 seconds
1000 sample3.html "This is sample3" 86
1000 sample1.html "This is sample1" 86
.

>If, however, the
>"noindex" tag appears in sample1.html and not in sample2.html, then you
>will get sample2.html and sample3.html returned in the search. If the
>"noindex" tag appears in sample3.html only, then you will get no results
>from the search for "testword". With obeyRobotsNoIndex off, you will get
>all three no matter where the noindex tag is. 

Interesting.  I don't see that at all in my test setup I'm using above.

There is special code we use to back-out a partially indexed file (when
"noindex" is found).  That would be the only place I would suspect a
problem could be happening.

Can you put together a sample of files like above that demonstrate this?

Thanks,

BTW --

>IndexOnly .html .htm

..

>NoContents .doc .gif .js .pdf .php .txt .xml 

Those won't be indexed because you are only indexing .html and .htm



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Sep 24 03:27:15 2002