On Fri, 4 Dec 1998, Jacques Delsemme wrote:
> Comment tags can contain other tags, and your regular expression only
> removes characters up to the next > (as it should, for all the other
> tags cannot contain other tags).
Oh, thanks for the explanation: makes sense.
> There was also an error in the regular expression I provided to remove
> comments. It should have a question mark after the .* to match only up
> to the first -->, that is:
>
> s/<!--.*?-->//gi;
OK. Nit: you don't need the 'i'.
> Another glitch I ran into concerns binary files. ... Is there an easy way
> to recognize if a file is binary?
Yes, the Perl -B file test. (See p. 85 in the Programming Perl
book, 2nd. ed.)
> It'd be tedious to list all of them (.gif, .jpg, .exe, ...), and filter them
> out before getting the description.
Wouldn't simply checking the filename extension for /\.txt$/
work? You most likely don't want the first 50 words of a
PostScript file (say) even though a PostScript file is a text
file.
- Paul
Received on Fri Dec 4 16:11:15 1998