Tweet

Retrieving web pages (LWP)

In this tutorial you will learn how to retrieve the source for web pages. The first example covers simply retrieving the page and storing it either in a variable or a file. The second example shows the more complex possibilities available.

Solution 1: LWP::Simple

This first example uses the very friendly LWP::Simple module. This module allows you to request a url and either store the HTML in a variable, print it, or write it to a file.

In this example we are retrieving the HTML to a variable:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::Simple;

    my $content = get('http://www.perlmeme.org') or die 'Unable to get page';

    exit 0;

The LWP::Simple module provides only a functional interface - that is, there is no object oriented interface to use.

You can also use LWP::Simple to print the web page source directly to STDOUT. It is exactly the same as the previous example except we use getprint instead of get.

    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::Simple;

    getprint('http://www.perlmeme.org') or die 'Unable to get page';

    exit 0;

The third example shows how to get the web page source and write it directly to a file, using LWP::Simple. It uses the getstore method that outputs the web page source directly to the given filename:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::Simple;

    getstore('http://www.perlmeme.org', 'test.html') or die 'Unable to get page';

    exit 0;

Solution 2: LWP

If you want to do more with the web page source than store it, you may want to consider using the full object oriented LWP::UserAgent interface. The package Bundle::LWP contains the standard LWP modules that you will need.

Firstly, to start your script:

    #!/usr/bin/perl -w
    use strict;
    use warnings;
    use LWP::UserAgent;

For the Lazy (this is a good thing), you most likely also want to use:

    use HTTP::Request::Common qw(POST);

You can export the GET method if you do not need POST.

    use HTTP::Request::Common qw(GET);

Define your User Agent

You then need to define your user agent:

    my $ua = LWP::UserAgent->new;

This is the object that acts as a browser and makes requests and receives responses.

Define the request

Next you need to create the request object that will be used to request the url. Since we are using the HTTP::Request::Common module, we can use the exported POST method. It accepts a URL as its first parameter, and a list of arguments to be passed to the url (e.g. form arguments).

    my $req = POST 'http://www.perlmeme.org', [];

Or passing in form arguments:

    my $req = POST 'http://www.perlmeme.org', [name => 'Bob', age => 24];

The GET method is used in a similar way to the first example:

    my $req = GET 'http://www.perlmeme.org';

You can also pass header data to the GET and POST methods.

Making the request

Once you have defined your request object, use the UserAgent to make the request:

    my $res = $ua->request($req);

The request method returns a HTTP::Response object. This object contains the status code of the response, and the content of the page if the request was successful.

The response

You can check if the request was successful by using the is_success method:

    if ($res->is_success) {
        print $res->content;
    } else {
        print $res->status_line . "\n";
    }

User Agents

If you want your program to be represented as a particular agent, for example Mozilla 8.0, you can set this using the agent method:

    $ua->agent('Mozilla/8.0');

Or, for example, an Internet Explorer example:

    $ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)');

Proxies

For whatever reason, you may want your requests to be made through a proxy. You can set different proxies for different protocols. Here is an example of setting a proxy for the ftp protocol:

    $ua->proxy(ftp => 'http://some.proxy.com');

Cookies

Sometimes you will want your program to store the cookies created by retrieved web pages. The LWP bundle provides a HTTP::Cookies module that will handle cookies for you. You need to use this module:

    use HTTP::Cookies;

And then set up a cookie_jar:

    $au->cookie_jar(
        HTTP::Cookies->new(
            file => 'mycookies.txt',
            autosave => 1
        )
    );

LWP User Agent will now automatically store the cookies in the specified file, and they cookies will be available to future requests.

SSL

If you are requesting any urls using the SSL protocol (for example, a https page) you will first need to install an appropriate SSL module. The two modules currently supported by LWP are Crypt::SSLeay and IO::Socket::SSL. The Crypt::SSLeay module is preferred. Once you have installed either of these modules, you can request SSL encrypted urls just like other urls.

Working example

Below is a working script that requests a url and, if successful, prints the contents to standard out.

    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::UserAgent;
    use HTTP::Request::Common qw(GET);
    use HTTP::Cookies;

    my $ua = LWP::UserAgent->new;

    # Define user agent type
    $ua->agent('Mozilla/8.0');

    # Cookies
    $ua->cookie_jar(
        HTTP::Cookies->new(
            file => 'mycookies.txt',
            autosave => 1
        )
    );

    # Request object
    my $req = GET 'http://www.perlmeme.org';

    # Make the request
    my $res = $ua->request($req);

    # Check the response
    if ($res->is_success) {
        print $res->content;
    } else {
        print $res->status_line . "\n";
    }

    exit 0;

See also

    perldoc LWP::Simple
    perldoc lwpcook
    perldoc LWP
Revision: 1.4 [Top]