Skip to main content.
home | support | download

Back to List Archive

Re: More clues on indexing problem

From: H Vernon Leighton <vleighto(at)not-real.Princeton.EDU>
Date: Tue Sep 24 2002 - 12:22:56 GMT
Dear Swish-e list,

  Well, I apologize if I have made blanket statements that prove not to be
correct in all cases. We are still having a problem, but it is, of course,
not as simple as I would like to believe. 

  I had been slipping the word "testword" in and out of large, complicated
pages that were already on our site. So in my previous email, I gave the
example of simple1.html, etc., but I had not created the simplest case of
very small pages. I was only using those names to make my description
clear. 

So in order to generate the output to demonstrate to Bill just what the
output was, I created three sample files for real, and put the word
"testword" in them. When I tried to run my "noindex" test on that small
group, the indexer behaved itself. If sample2.html had the "noindex" tag,
both 1 and 3 would be returned. That output is the first group listed
below. So Swish-e does work as advertised under some circumstances. 

So I went back, and reinserted "testword" into some of the production
pages, and the weird, erroneous phenomenon came back. Because of that, I
have created a second batch of output, which lists what happens when the
"noindex" is in one of the serious pages on the site. Perhaps the header
include or the javascript is creating some of the problem?

####  First batch: testword just in sample pages. a00002.html does not have
the word "testword".

##Copy of files:

<html>
<head>
<title>sample</title>
</head>
<body>
testword
</body>
</html>

##For sample2.html, I inserted:

<meta name="robots" content="noindex">

##Part of output from indexing:

bash-2.03$ swish-e -c swish_test.conf -f swish_test.index 
Indexing Data Source: "File-System"
Indexing "/usr/local/apache/htdocs/data"

In dir "/usr/local/apache/htdocs/data/a00002":
  sample1.html - Using HTML2 parser -  (2 words)
  a00002.html - Using HTML2 parser -  (Skipped due to Robots Excluion Rule
in meta tag)
  sample2.html - Using HTML2 parser -  (Skipped due to Robots Excluion Rule
in meta tag)
  sample3.html - Using HTML2 parser -  (2 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1850 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1850 unique words indexed.
5 properties sorted.						  
40 files indexed.  11755964 total bytes.  13400 total words.
Elapsed time: 00:00:01 CPU time: 00:00:01
Indexing done!
bash-2.03$ 

## Here is the search and response:

bash-2.03$ swish-e -w testword -f swish_test.index 
# SWISH format: 2.2
# Search words: testword
# Number of hits: 2
# Search time: 0.001 seconds
# Run time: 0.112 seconds
200 /data/a00002/sample3.html "sample" 76
200 /data/a00002/sample1.html "sample" 76
.

##Comment: As you can see, this works correctly.

#### Second batch: Now the bad case:

I have placed the word "testword"in /data/a00001/a00001.html, all of the
sample pages and in /data/a00002/a00002.html. I slipped the "noindex" tag
into both sample2.html and a00002.html.

##The top of the a00002.html file:

<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<title>testword DataSet a00002</title>
<script src="/includes/standard.js"></script>
<meta name="robots" content="noindex">
<meta name="description" content="This data collection offers information
on ...">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<!--#include virtual="/includes/header.html"-->
<table width="700" border="0" cellspacing="0" cellpadding="0">
<tr><td width="135" valign="top">
...


##The relevant output from the indexing:

bash-2.03$ swish-e -c swish_test.conf -f swish_test.index 
Indexing Data Source: "File-System"
Indexing "/usr/local/apache/htdocs/data"

In dir "/usr/local/apache/htdocs/data/a00001":
  a00001.html - Using HTML2 parser -  (328 words)

In dir "/usr/local/apache/htdocs/data/a00002":
  sample1.html - Using HTML2 parser -  (2 words)
  a00002.html - Using HTML2 parser -  (Skipped due to Robots Excluion Rule
in meta tag)
  sample2.html - Using HTML2 parser -  (Skipped due to Robots Excluion Rule
in meta tag)
  sample3.html - Using HTML2 parser -  (2 words))

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1850 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1850 unique words indexed.
5 properties sorted.						  
40 files indexed.  11755972 total bytes.  13401 total words.
Elapsed time: 00:00:01 CPU time: 00:00:01
Indexing done!

##Output from the search:

bash-2.03$ swish-e -w testword -f swish_test.index 
# SWISH format: 2.2
# Search words: testword
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.111 seconds
1000 http://www.cpanda.org/data/a00002/sample3.html "sample" 76
.

##Comment: As you can see, this does not work correctly. The "noindex" in
a00002.html blocks the earlier pages.

#####
Received on Tue Sep 24 12:26:27 2002