Home » Tutorials » Html parser |
Tassilo v. Parseval
Newcomers to Perl often want to know how to parse HTML. For instance, to extract the text between between <p> and </p> tags, or to extract content by assembling and following hyperlinks.
HTML is treacherous in that in looks as though it could be handled with just a few regular expressions. Even when you slurp the whole file and work on large strings, sooner or later regular expressions won't be enough.
The HTML::Parser module provides powerful mechanisms for extracting content, tags and tag attributes from any html stream.
The subclassing approach that HTML::Parser offers is worth knowing as it is a general technique (used by other Perl modules as well). The idea behind it requires only a bit of understanding of OOP concepts.
HTML::Parser is a class that provides a few methods that you will be
using verbatim, such as parse()
, parse_file()
or parse_chunk()
. What
they do is walk through the HTML and once they have identified a
certain HTML construct (a start or end tag, plain text etc.) they
trigger methods (they are a bit like callbacks) and pass them the stuff
they have identified. Those callback methods are the one you have to
provide.
In order to make this whole thing work, you create a subclass of
HTML::Parser. This subclass will inherit all the methods from
HTML::Parser (most notably the various parse()
functions). Some methods
however you will have to override (that is: replace them so that they
suit your needs). Quite naturally, it makes sense to override the
callbacks because those are the parts you want to customize.
So take this subclass:
package MyParser; use base qw(HTML::Parser);
That's a fully functional subclass of HTML::Parser. Now you create an object of this class and see what happens when it parses a file:
package main; my $parser = MyParser->new; $parser->parse_file("file.html");
When you put the two code fragments above in a file and run it, you'll notice that nothing appears to be happening.
But you'll also notice that you don't get any errors like calling
non-existent functions. That's because you call two methods on $parser
that were inherited from HTML::Parser, namely new()
and parse_file()
.
Further above I said that parse_file()
would trigger those callbacks,
but seemingly it doesn't do that (because nothing is happening). But
actually, MyParser::parse_file()
does call them. As you did not override
them, it calls the default methods HTML::Parser::start/end/text/etc
(after all, those methods were inherited by 'MyParser'). Those methods
are empty (which can be confirmed when you have a look at the source
code of HTML/Parser.pm.
In order to make your parser do something useful, you provide those methods yourself:
package MyParser; use base qw(HTML::Parser); # we use these three variables to count something our ($text_elements, $start_tags, $end_tags); # here HTML::text/start/end are overridden sub text { $text_elements++ } sub start { $start_tags++ } sub end { $end_tags++ } package main; # Test the parser my $html = <<EOHTML; <html> <head> <title>Bla</title> </head> <body> Here's the body. </body> </html> EOHTML my $parser = MyParser->new; $parser->parse( $html ); # parse() is also inherited from HTML::Parser print <<EOREPORT; text elements: $MyParser::text_elements start tags : $MyParser::start_tags end tags : $MyParser::end_tags EOREPORT __END__ text elements: 7 start tags : 4 end tags : 4
In the previous example we used the parse_file()
method to parse the html in
the file "file.html", but here for clarity we use the parse()
method to parse
the html contained in the $html variable.
So it appears MyParser::text()
has been called 7 times (7 apparently
because HTML::Parser also considers white-space), start()
and
end()
four times each (which makes sense: you have <html>, <head>,
<title> and <body> plus their corresponding closing tags).
The above parser only does counting. But the callback methods, (ie: text()
,
start()
and end()
), are called with arguments, (which we chose to ignore above).
The first argument is always the 'MyParser' object (as always with perl methods). The additional arguments are those you are really interested in: They are the broken down elements of HTML.
package MyParser; use base qw(HTML::Parser); # This parser only looks at opening tags sub start { my ($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq 'a') { print "URL found: ", $attr->{ href }, "\n"; } } package main; my $html = <<EOHTML; <html> <body> <a href="http://www.first.com" target="bla">One link</a> <a href="http://www.second.com">Second link</a> </body> </html> EOHTML my $parser = MyParser->new; $parser->parse( $html ); __END__ URL found: http://www.first.com URL found: http://www.second.com
The above is essentially a cheap link extractor. The interesting part is the start-callback:
sub start { my ($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq 'a') { print "URL found: ", $attr->{ href }, "\n"; print " all attributes: @$attrseq\n"; } }
It is called with five arguments. $self is the object itself, $tagname is the name of the start tag, $attr is a hash-reference containing the attributes as key/value pairs, $attrseq is an array-reference which lists the attribute keys in the order in which they appeared in the tag, and $origtext is eventually the original text as it appeared in the HTML snippet.
The start-callback will be called four times for the given HTML string. It will only do something when it encounters an <a> tag:
if ($tagname eq 'a') {
In this case it looks up the value of the 'href' attribute:
print "URL found: ", $attr->{ href }, "\n";
Additionally, it prints all the attributes in the order in which they appeared:
print " all attributes: @$attrseq\n";
For the first <a> tag, this is "href target". For the second one, only "href".
You simply ignore all the stuff you are not interested in. The above parser doesn't care about end-tags or plain text. It only looks at the start-tags to find links in the HTML document.
It's quite easy to integrate more complicated logic into a parser. For instance if you need to parse other documents when they are referenced in an attribute. Likewise, this parser can be made to work recursively: Whenever it encounters a link to another document, it retrieves this document, parses it for more links and follows them as well (until it has walked through the whole www ;-):
package MyParser; use base qw(HTML::Parser); use LWP::Simple (); sub start { my ($self, $tagname, $attr) = @_; if ($tagname eq 'a') { my $url = $attr->{ href }; print "URL found: $url\n"; # make a new parser to parse # the document referenced by $url my $p = MyParser->new; $p->parse( LWP::Simple::get($url) ); } }
This parser will probably never stop because it doesn't keep track of the websites it has already parsed. However, it's not very hard to prevent infinite recursion:
# ... my %already_parsed; sub start { my ($self, $tagname, $attr) = @_; if ($tagname eq 'a') { my $url = $attr->{ href }; print "URL found: $url\n"; return if $already_parsed{ $url }; # not yet parsed $already_parsed{ $url }++; MyParser->new->parse( LWP::Simple::get($url) ); } }
Hopefully the above is already all you need to write your first HTML::Parser based program. It takes a little time to get used to event-based approaches so you might want to experiment a bit with it. Once you have grokked it, you'll realize how convenient and powerful HTML::Parser is.
Tassilo
perldoc base perldoc HTML::Parser perldoc -f package