Tweet

Using the HTML::Parser module

Tassilo v. Parseval

Newcomers to Perl often want to know how to parse HTML. For instance, to extract the text between between <p> and </p> tags, or to extract content by assembling and following hyperlinks.

HTML is treacherous in that in looks as though it could be handled with just a few regular expressions. Even when you slurp the whole file and work on large strings, sooner or later regular expressions won't be enough.

The HTML::Parser module provides powerful mechanisms for extracting content, tags and tag attributes from any html stream.

Subclassing

The subclassing approach that HTML::Parser offers is worth knowing as it is a general technique (used by other Perl modules as well). The idea behind it requires only a bit of understanding of OOP concepts.

HTML::Parser is a class that provides a few methods that you will be using verbatim, such as parse(), parse_file() or parse_chunk(). What they do is walk through the HTML and once they have identified a certain HTML construct (a start or end tag, plain text etc.) they trigger methods (they are a bit like callbacks) and pass them the stuff they have identified. Those callback methods are the one you have to provide.

In order to make this whole thing work, you create a subclass of HTML::Parser. This subclass will inherit all the methods from HTML::Parser (most notably the various parse() functions). Some methods however you will have to override (that is: replace them so that they suit your needs). Quite naturally, it makes sense to override the callbacks because those are the parts you want to customize.

So take this subclass:

    package MyParser;
    use base qw(HTML::Parser);

That's a fully functional subclass of HTML::Parser. Now you create an object of this class and see what happens when it parses a file:

    package main;
    my $parser = MyParser->new;
    $parser->parse_file("file.html");

When you put the two code fragments above in a file and run it, you'll notice that nothing appears to be happening.

But you'll also notice that you don't get any errors like calling non-existent functions. That's because you call two methods on $parser that were inherited from HTML::Parser, namely new() and parse_file().

Further above I said that parse_file() would trigger those callbacks, but seemingly it doesn't do that (because nothing is happening). But actually, MyParser::parse_file() does call them. As you did not override them, it calls the default methods HTML::Parser::start/end/text/etc (after all, those methods were inherited by 'MyParser'). Those methods are empty (which can be confirmed when you have a look at the source code of HTML/Parser.pm.

Providing methods

In order to make your parser do something useful, you provide those methods yourself:

    package MyParser;
    use base qw(HTML::Parser);

    # we use these three variables to count something
    our ($text_elements, $start_tags, $end_tags);

    # here HTML::text/start/end are overridden 
    sub text	{ $text_elements++  }
    sub start	{ $start_tags++	    }
    sub end	{ $end_tags++	    }

    package main;

    # Test the parser

    my $html = <<EOHTML;
    <html>
	<head>
	    <title>Bla</title>
	</head>
	<body>
	Here's the body.
	</body>
    </html>
    EOHTML

    my $parser = MyParser->new;
    $parser->parse( $html );	# parse() is also inherited from HTML::Parser

    print <<EOREPORT;
    text elements:  $MyParser::text_elements
    start tags   :  $MyParser::start_tags
    end tags     :  $MyParser::end_tags
    EOREPORT

    __END__
    text elements: 7
    start tags   : 4
    end tags     : 4

In the previous example we used the parse_file() method to parse the html in the file "file.html", but here for clarity we use the parse() method to parse the html contained in the $html variable.

So it appears MyParser::text() has been called 7 times (7 apparently because HTML::Parser also considers white-space), start() and end() four times each (which makes sense: you have <html>, <head>, <title> and <body> plus their corresponding closing tags).

The above parser only does counting. But the callback methods, (ie: text(), start() and end()), are called with arguments, (which we chose to ignore above).

The first argument is always the 'MyParser' object (as always with perl methods). The additional arguments are those you are really interested in: They are the broken down elements of HTML.

Next Parser

    package MyParser;
    use base qw(HTML::Parser);

    # This parser only looks at opening tags
    sub start { 
	my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
	if ($tagname eq 'a') {
	    print "URL found: ", $attr->{ href }, "\n";
	}
    }

    package main;

    my $html = <<EOHTML;
    <html>
	<body>
	    <a href="http://www.first.com" target="bla">One link</a>
	    <a href="http://www.second.com">Second link</a>
	</body>
    </html>
    EOHTML

    my $parser = MyParser->new;
    $parser->parse( $html );
    __END__
    URL found: http://www.first.com
    URL found: http://www.second.com

The above is essentially a cheap link extractor. The interesting part is the start-callback:

    sub start {
	my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
	if ($tagname eq 'a') {
	    print "URL found: ", $attr->{ href }, "\n";
	    print "  all attributes: @$attrseq\n";
	}
    }

It is called with five arguments. $self is the object itself, $tagname is the name of the start tag, $attr is a hash-reference containing the attributes as key/value pairs, $attrseq is an array-reference which lists the attribute keys in the order in which they appeared in the tag, and $origtext is eventually the original text as it appeared in the HTML snippet.

The start-callback will be called four times for the given HTML string. It will only do something when it encounters an <a> tag:

    if ($tagname eq 'a') {

In this case it looks up the value of the 'href' attribute:

    print "URL found: ", $attr->{ href }, "\n";

Additionally, it prints all the attributes in the order in which they appeared:

    print "  all attributes: @$attrseq\n";

For the first <a> tag, this is "href target". For the second one, only "href".

You simply ignore all the stuff you are not interested in. The above parser doesn't care about end-tags or plain text. It only looks at the start-tags to find links in the HTML document.

It's quite easy to integrate more complicated logic into a parser. For instance if you need to parse other documents when they are referenced in an attribute. Likewise, this parser can be made to work recursively: Whenever it encounters a link to another document, it retrieves this document, parses it for more links and follows them as well (until it has walked through the whole www ;-):

    package MyParser;
    use base qw(HTML::Parser);
    use LWP::Simple ();

    sub start {
	my ($self, $tagname, $attr) = @_;
	if ($tagname eq 'a') {
	    my $url = $attr->{ href };
	    print "URL found: $url\n";

	    # make a new parser to parse
	    # the document referenced by $url

	    my $p = MyParser->new;
	    $p->parse( LWP::Simple::get($url) );
	}
    }

This parser will probably never stop because it doesn't keep track of the websites it has already parsed. However, it's not very hard to prevent infinite recursion:

    ...
    my %already_parsed;
    sub start {
	my ($self, $tagname, $attr) = @_;
	if ($tagname eq 'a') {
	    my $url = $attr->{ href };
	    print "URL found: $url\n";
	    return if $already_parsed{ $url };

	    # not yet parsed
	    $already_parsed{ $url }++;
	    MyParser->new->parse( LWP::Simple::get($url) );
	}
    }

Conclusion

Hopefully the above is already all you need to write your first HTML::Parser based program. It takes a little time to get used to event-based approaches so you might want to experiment a bit with it. Once you have grokked it, you'll realize how convenient and powerful HTML::Parser is.

Tassilo

See also

    perldoc base
    perldoc HTML::Parser
    perldoc -f package
[Top]