hi,
so i found the script (and found that my memory is not so good) - it is
only filter and I did not use XML but HTML2. But the main thing, it works!
hope that helps
Config:
IndexFile sxw.index
IncludeConfigFile .\conf\common.config
IndexDir d:/temp/test
IndexOnly .sxw
DefaultContents HTML
IndexContents HTML2 .sxw .xml .sxwrtf #<-- here
IndexContents TXT .txt
FollowSymLinks yes
IndexComments no
UndefinedMetaTags index #<-- here?
UndefinedXMLAttributes index
StoreDescription HTML2 <office:body> 1000 #<-- here
StoreDescription HTML <body> 1000
StoreDescription TXT 1000
FileFilter .sxw 'perl -w ./filter/simplesxw.pl' '"%P"'
Script:
#! "C:\Perl\bin\perl.exe"
use strict;
#simple tool to extract text from sxw files
#receives path to file, outputs the concatenated xml files (in utf8
encoding)
#oOO stores files in utf8, if you want something else, you must convert
the file yourself
#my sollution was not portable /not even very quick/, so I did not add it
my $input_file = shift || die "Usage: $0 <filename>\n";
my $slash = "\\";
#this is important, path to unzip utility "-p" says that it should
#pipe output to STDOUT
my $_util = "unzip -p";
#what files do you want to extract from sxw file, only two of them are
important
my $_conf = "meta.xml content.xml";
our $MYLOGFILE = 'sxwlog.txt';
#print STDERR " $input_file\n\n###!--------------------------------";
my @new = split(/\//, $input_file);
$input_file = join ($slash, @new);
#print STDERR "new name is - $input_file\n\n";
print STDERR "$_util $input_file $_conf\n";
#on my windows version of unzip - "|" seems not to work; hope you linux
guys have better system
open (INPUT, "|$_util \"$input_file\" $_conf") || die "can't
open $input_file: $!";
while (<INPUT>) {
print "$_";
#print(replaceChar($_)); #you may do something with the contents
# do something with $_
}
save_to_file("OK: $input_file\n");
close(INPUT) || die "can't close $input_file: $!";
sub save_to_file {
my $str = shift;
if ($MYLOGFILE) {
open (MYFILE, ">>$MYLOGFILE") || die "Check MYLOGFILE: $!";
print (MYFILE "$str");
close (MYFILE) || "Can't close $MYLOGFILE: $!";
}
else {
print STDERR $str;
}
}
#you have a chance to do some cleaning inside of xml file
sub replaceChar {
my $str = shift;
for ( $str ) {
# s/&[sS]caron;/¹/go;
}
return $str;
}
1;
Philip Young napsal(a):
> Hey,
>
> As I'm having alot of frustration trying to get the meta.xml (document
> properties) and the content.xml to be indexed. I would like the
> content to be indexed into the "swishdefault" category (normal indexed
> content) and the document properties indexed with the
> "UndefinedMetatags auto" .
>
> So I'm Just looking for a quick and dirty way to accomplish this task.
> Originally I thought of concatenating the two .xmls to be indexed
> like so:
>
> FileFilterMatch "/usr/bin/unzip" "-p \"%p\" meta.xml content.xml"
> /\.(sxw|sxc|sxi|odt)$/i
>
> This line compiles and indexes with no syntax errors. But the problem
> is it does not seem to index properly.
>
> Anyone got any ideas on how to get the meta.xml and content.xml indexed?
>
> My swish.conf file is located below.
>
> Thankyou,
>
>
> Philip Young
>
> -- swish.conf --
> IndexDir /var/www/test
> IndexFile /var/www/test/index.swish-e
> IndexName Documents
> IndexOnly .xml .htm .html .txt .doc .rtf .sxw .sxc .sxi .odt
> DefaultContents TXT
> SwishProgParameters -S fs
>
> ReplaceRules replace /var/www/test /test
> ExtractPath subject regex !^/test/([^/]+)/.*$!$1!
>
> # Allow extra searching by title, path
> metanames swishtitle swishdocpath
> UndefinedMetaTags auto
>
> IndexContents TXT* .pdf
> FileFilter .pdf "/usr/bin/pdftotext" "'%p' -"
> #SWISH::Filter .pdf "/usr/bin/pdftotext" "'%p' -"
>
> IndexContents TXT* .doc
> FileFilter .doc "/usr/bin/catdoc" "-s8859-1 -d8859-1 '%p'"
> #SWISH::Filter .doc "/usr/bin/catdoc" "-s8859-1 -d8859-1 '%p'"
>
> IndexContents TXT* .rtf
> FileFilter .doc "/usr/bin/catdoc" "'%p'"
> #SWISH::Filter .doc "/usr/bin/catdoc" "'%p'"
>
> FileFilterMatch "/usr/bin/unzip" "-p \"%p\" meta.xml" /\.(sxw|sxc|sxi|odt)$/i
> IndexContents XML* .sxw .sxc .sxi .odt
> StoreDescription XML* <text:p>
>
> FileFilterMatch "/usr/bin/unzip" "-p \"%p\" content.xml" /\.(sxw|sxc|sxi|odt)$/i
> IndexContents XML* .sxw .sxc .sxi .odt
> StoreDescription XML* <text:p>
>
>
>
>
Received on Tue May 31 00:13:30 2005