Skip to main content.
home | support | download

Back to List Archive

Re: Some Questions about 2.2RC1 and XML

From: Bill Humphries <whump(at)not-real.apple.com>
Date: Sat Sep 07 2002 - 03:01:50 GMT
On Thursday, September 5, 2002, at 06:19 PM, Bill Moseley wrote:

>> 	"NOTE: Entities within XML files and files parsed with libxml2 are
>> converted regardless of this setting."
>
> Right, that option only works for HTML docs (docs parsed by html.c).  
> XML,
> XML2 both convert entities.
>
>> My current workaround for this is to build an XML result string, then 
>> pass
>> it through Tidy (http://tidy.sourceforge.net/) to re-escape entities.
>
> What needs to be escaped besides < and >?  Seems like it would be slow 
> to
> span an external program to do this.  If you are using perl the things 
> like
> CGI.pm and HTML::Entities can do this work.

We plan to process the results with XSLT (which is used for the rest of 
the site's presentation layer), so we need all the entities converted 
back to unicode equivalents. This is all being done in PHP. So there 
may be a string function handy, I need to glance back at the man pages.

>
>> 2) I'm indexing XML source documents in the file system. I can use the
>> configuration to use the first 100 characters of the document's root
>> element, 'page', as the description:
>>
>> PropertyNamesMaxLength 100 swishdescription
>> PropertyNameAlias swishdescription page
>>
>> However, when swish-e constructs the index, it's taking the attribute
>> values, as well as the text nodes of 'page'.
>
> Can you put together a small example?

Here's the relevant portion of the config file:

MetaNames page.title container.title container.access
XMLClassAttributes page.title container.access
PropertyNameAlias swishtitle page.title
PropertyNamesMaxLength 100 swishdescription
PropertyNameAlias swishdescription page
UndefinedMetaTags index
UndefinedXMLAttributes ignore
# But only index the .xml files
IndexOnly .xml
IndexContents XML2 .xml

Then an example file to index:

<?xml version="1.0"?>
<page changed="false" name="dsAuthTest" new="false" title="dsAuth Test">
   <container access="employee" changed="false" new="false" 
shorttitle="Employee" title="Section One">
     <para changed="false" new="false">Content viewable by all 
employees.</para>
   </container>
   <breadcrumb>
   <a href="/areas/hrweb/employee/">HRWeb</a>
   &gt;&gt;
   <a href="/areas/hrweb/employee/sandbox/">sandbox</a>
   &gt;&gt; dsAuth Test</breadcrumb>
</page>

Then when I search

% /usr/local/bin/swish-e -f index.swish-e -w dsAuth -x '%p\n%d'
# SWISH format: 2.2rc1
# Search words: dsAuth
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.104 seconds
/areas/hrweb/employee/sandbox/dsauthtest.html
false dsAuthTest false dsAuth Test employee false false Employee 
Section One false false Content vie.

As you can see, the description is pulling in the attribute values 
instead of the text nodes.

>> I'd also like to specify a location in the document to use as the
>> description, ie /page/section[1]/para[1].
>>
>> The workaround here would be to use the prog method to load pages and 
>> use
>> some xpath tool to extract that location and use as the page 
>> description.
>
> I'm not 100% clear what you want, but using -S prog with the available
> tools will probably give you the most control.

That's probably the way to go, just some expense on the indexing side, 
but then I'm indexing on the order of 3,000 pages, so that's no great 
burden.

Thanks.

-- whump
Received on Sat Sep 7 03:05:23 2002