Nick scribbled on 5/10/05 4:43 PM:
> I was thinking that, but I didn't know how to do it right. I'm not that
> familiar with the perl regex, what is this doing to split it? My concern
> was that the filename might contain a '/' or a '\' char in it and I didn't
> know how to reliably split it.
I see in a test that my original doesn't work for long path names.
I don't have a windows box to test it on, so I don't know which path separator
the filter uses under Windows: '\' or '/'.
But this should catch either, I would think:
$content =~ s,<title>(.+)[\\/]([^<]+)</title>,<title>$2</title>,i;
that says, in english:
match '<title>' followed by one or more characters, till you find a / or a \
(escaping the \ since it is a special char), followed by one or more 'not <'
characters, followed by '</title>'
the .+ is greedy, so it should match multiple instances of .+[\\/] till it hits
the end of the path name.
try it out and see if it works for you. If it does, I'll make the change and
check it in.
>
>>how about retaining at least the file name without the leading path?
>>
>> my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>+ $content =~ s,<title>(.+?)/([^<]+)</title>,<title>$2</title>,i;
>>
>>
>>
>>
>>Nick scribbled on 5/10/05 9:12 AM:
>>
>>>These two modules create titles inconsistent with the other ones. This
>>>is
>>>due to the filtering programs using the full path as the title.
>>>
>>>Obviously it would be best to have a "real" document title, but if we
>>>can't have that I think that it would be better to use only the name of
>>>the file itself, not the full path. This way it would be consistent
>>>between all the modules.
>>>
>>>I see this comment in pp2html.pm so I don't think I'm too off base here:
>>>
>>>Currently produces document titles like /tmp/foo1234. Need to alter
>>>to pass actual document title.
>>>
>>>
>>>Below are diffs for both modules. I realize that this isn't best (it
>>>would be nice to have a "real" title), but I think it is better than it
>>>was before.
>>>
>>>
>>>--- XLtoHTML.pm 2004-10-02 18:09:14.000000000 -0500
>>>+++ XLtoHTML.pm.patched 2005-05-10 09:08:18.000000000 -0500
>>>@@ -37,6 +37,9 @@
>>> # update the document's content type
>>> $doc->set_content_type( 'text/html' );
>>>
>>>+ # remove the full path in the title
>>>+ $content_ref =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>+
>>> # If filtered must return either a reference to the doc or a
>>>pathname.
>>> return \$content_ref;
>>>
>>>
>>>--- pp2html.pm 2005-03-23 23:55:06.000000000 -0600
>>>+++ pp2html.pm.patched 2005-05-10 09:08:11.000000000 -0500
>>>@@ -15,6 +15,10 @@
>>> my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>> # update the document's content type
>>> $doc->set_content_type( 'text/html' );
>>>+
>>>+ # remove the full path in the title
>>>+ $content =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>+
>>> return \$content;
>>> }
>>>
>>
>>--
>>Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
>>
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Tue May 10 14:51:46 2005