Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] XML parsing not returning Title

From: Robinson Craig <Craig.Robinson(at)not-real.nrw.qld.gov.au>
Date: Mon Dec 03 2007 - 23:57:59 GMT
>On 11/28/2007 08:18 AM, Peter Karman wrote:
>> 
>> On 11/27/2007 04:56 PM, Robinson Craig wrote:
>> 
>>> I've run the same config and files with 2.4.5 (current stable
release)
>>> installed on a DEV box (in readiness for deployment out to PROD),
with
>>> the same result (incidentally with no parsing errors).
>>>
>> 
>> what indexing method (-S) are you using? Can you paste the exact
command you
>> are using to index?
>> 
>
>nevermind. I see the issue.
>
>You need to add:
>
> PropertyNameAlias swishtitle title
>
>to your config.
>
>The HTML parser knows about the special '<title>' tagset and uses that
for the
>swishtitle property. The XML parser doesn't know about it. Since you
are
>indexing .pdf files with the HTML parser (is that what you really
want?), the
>.pdf docs get the title magic, but the .html docs (or anything else
parsed with
>the XML parser) needs a little help knowing which tag to use as the
swishtitle.
>
>
>-- 
>Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/
>

Hi Peter,

I did try "PropertyNameAlias swishtitle title", and it does something
unexpected. Now the HTML page (parsed using XML2) returns the title
beautifully, but now the PDF file (parsed using HTML2) puts the content
of the PDF in the "title" field (see attached text file:
PDF_title_contents.txt).

However, what really does interest me is the comment: ">indexing .pdf
files with the HTML parser (is that what you really want?)". I am
thinking that my approach is some-what non-standard :-).

What we are doing is converting the PDF to HTML by using "pdftotext
-htmlmeta" (part of Xpdf) and then indexing using HTML2. From what I
have gleaned from around the place, this seems to be one way of doing
it. What would be the alternative? Or, better still, the "Standard" way
for indexing PDF metadata as well as content? I have been reading (in
the forum) about how SWISH::Filters::Pdf2HTML uses 'pdfinfo'(also part
of Xpdf). I haven't really investigated using spider.pl (which uses
Pdf2HTML by default) yet as I am trying to do this from the file system,
but would that be considered the more standard approach?

Thanks for your assistance.

Cheers, Craig



 

************************************************************************
The information in this email together with any attachments is
intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material.
Any form of review, disclosure, modification, distribution
and/or publication of this email message is prohibited, unless
as a necessary part of Departmental business.
If you have received this message in error, you are asked to
inform the sender as quickly as possible and delete this message
and any copies of this message from your computer and/or your
computer system network.
************************************************************************


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Dec 3 18:58:05 2007