Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-E IgnoreMetaTags does not work

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Jul 30 2010 - 15:16:07 GMT
Dennis Gerasimenko wrote on 7/27/10 12:50 PM:
> Hi.
> 
> I am running Swish-E 2.4.7 (on RHEL5) and I am trying to skip a few HTML
> tags (specifically “script” as in <script ...></script>) inside HEAD and
> BODY, while parsing HTML files but, despite configuration directive
> “IgnoreMetaTags script style link select”, tag “script” is still being
> parsed. That generates many errors such as:
> 
> error: Unexpected end tag : dt '<dt>' + listingName + '</dt>' +

the error seems to come from libxml2. I was able to reproduce (see end of this
email).

I'm not sure if this is a bug in swish-e or not. But to work around it, I would
probably just strip out the <script> tag content before handing the file to
swish-e. If you're using spider.pl or DirTree.pl, you could probably add a
simple regex to strip out the contents of the <script> tags with:

 $buf =~ s,<script[^>]*>.+?</script>,,sgi;

see the 'filter_content' callback in spider.pl and the process_file() function
in DirTree.pl.

Or use swish3 with a custom Aggregator where you filter the content yourself:

 % swish3 -S MyAggregator -c conffile -i files

where MyAggregator looks like:

 package MyAggregator;
 use strict;
 use base qw( SWISH::Prog::Aggregator::FS );

 sub init {
    my $self = shift;
    $self->SUPER::init(@_);
    $self->set_filter( \&my_filter );
 }

 sub my_filter {
    my $doc = shift;
    my $buf = $doc->content;
    $buf =~ s,<script[^>]*>.+?</script>,,sgi;
    $doc->content($buf);
    return $doc;
 }
 1;


my test case below.

[karpet@pekmac:~/tmp/nometa]$ cat script.html
<html>
 <head>
  <title>i have script</title>
  <script type="text/javascript">
   //<!-- noindex -->//
   var foo = '<foo>bar</foo>';
   //<!-- index -->//
  </script>
 </head>
 <body>
  <p>hello world</p>
 </body>
</html>

[karpet@pekmac:~/tmp/nometa]$ cat conf
# Ignore select HTML tag
IgnoreMetaTags script style link select

[karpet@pekmac:~/tmp/nometa]$ swish-e -c conf -i script.html  -T indexed_words
parsed_words parsed_tags parsed_text properties -v9
Parsing config file 'conf'
Indexing Data Source: "File-System"
Indexing "script.html"

Checking file "script.html"...
  script.html - Using DEFAULT (HTML2) parser - i have script
White-space found word 'i'
    Adding:[1:swishdefault(1)]   'i'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
White-space found word 'have'
    Adding:[1:swishdefault(1)]   'have'   Pos:6  Stuct:0x7 ( HEAD TITLE FILE )
White-space found word 'script'
    Adding:[1:swishdefault(1)]   'script'   Pos:7  Stuct:0x7 ( HEAD TITLE FILE )
<script> (meta [no meta name defined] *Start Ignore*)
<script> (property [no meta name defined] *Start Ignore*)
script.html:6: error: Unexpected end tag : foo
   var foo = '<foo>bar</foo>';
                            ^
</script> (meta) end ignore
</script> (property) end ignore

hello world
White-space found word 'hello'
    Adding:[1:swishdefault(1)]   'hello'   Pos:14  Stuct:0x9 ( BODY FILE )
White-space found word 'world'
    Adding:[1:swishdefault(1)]   'world'   Pos:15  Stuct:0x9 ( BODY FILE )

 (5 words)
          swishdocpath: 6 ( 11) S: "script.html"
            swishtitle: 7 ( 13) S: "i have script"
          swishdocsize: 8 (  8) N: "225"
     swishlastmodified: 9 (  8) D: "2010-07-30 09:54:53 CDT"

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
5 unique words indexed.
4 properties sorted.
1 file indexed.  225 total bytes.  5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

[karpet@pekmac:~/tmp/nometa]$ swish3 -S MyAggregator -c conf -i script.html -v
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
5 unique words indexed.
4 properties sorted.
1 file indexed.  105 total bytes.  5 total words.   # NOTICE byte count less
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
1 documents in 00:00:00


-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Jul 30 11:16:10 2010