Perl

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Regex...HTML::Parser...Getting webpage data?

    5 answers - 904 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    I'm pretty new to Perl, my past experience has been in modifying other
    peoples code in order to do what I want it to do but now I'm trying to write
    my own to do a specific task that I can't find code for and am having
    issues. I am trying to retrieve data from a webpage, say
    for
    example, the price of a 2006 1oz Silver American Eagle in the 20-99 price
    break quantity. Should I use Regex to do that or would I be better off with
    HTML::Parser ? I've attemped Regex since I seem to understand it better but
    haven't had much success it getting it to pull the right price. HTML::Parser
    I understand even less than Regex but I've read that its a more reliable way
    of pulling webpage data ? I can't seem to find "easy" to understand
    documentation on it though so I'm even farther away from getting it to work
    then Regex, Any advice ?
  • No.1 | | 2500 bytes | |

    Wesley Bresson wrote:
    >

    I'm pretty new to Perl, my past experience has been in modifying other
    peoples code in order to do what I want it to do but now I'm trying to
    write
    my own to do a specific task that I can't find code for and am having
    issues. I am trying to retrieve data from a webpage, say
    for
    example, the price of a 2006 1oz Silver American Eagle in the 20-99 price
    break quantity. Should I use Regex to do that or would I be better off with
    HTML::Parser ? I've attemped Regex since I seem to understand it better but
    haven't had much success it getting it to pull the right price.
    HTML::Parser
    I understand even less than Regex but I've read that its a more reliable
    way
    of pulling webpage data ? I can't seem to find "easy" to understand
    documentation on it though so I'm even farther away from getting it to work
    then Regex, Any advice ?

    Two Web questions in one day! It's hard to know exactly how you're going to your
    code Wesley, but the stuff below should be a good starter. It pulls in the web
    site and parses it using HTML::TreeBuilder. It looks for all table row <tr>
    elements that contain exactly five table data <tdelements, which is all the
    item details plus a few stragglers. The real item data has an item number in the
    format #9999 in the second <tdelement, so ignore everything that's not like
    that. Finally the description and price are pulled from the relevant elements,
    and the numeric price value extracted with a regex. Everything that falls within
    your price bracket is then printed. I didn't restrict it to 2006 stuff as there
    weren't any at the time I wrote this, but it's easy to see how to do it I hope.

    HTH,

    Rob

    use strict;
    use warnings;

    use LWP::Simple;
    use HTML::TreeBuilder;

    my $html = get '';

    my $tree = HTML::TreeBuilder->new_from_content($html);

    my @tr = $tree->find_by_tag_name('tr');

    foreach my $tr (@tr) {

    my @td = $tr->find_by_tag_name('td');
    next unless @td == 5;

    my ($number, $desc, $price) = map $as_trimmed_text, @td[1, 2, 4];
    next unless $number =~ /#\d+/;

    my ($dollars) = $price =~ /\$([\d\.]+)/;
    next unless $dollars >= 20 and $dollars < 100;

    print $desc, "\n", $price, "\n\n"
    }
  • No.2 | | 1139 bytes | |

    08/03/2006 01:27 PM, Wesley Bresson wrote:
    I'm pretty new to Perl, my past experience has been in modifying other
    peoples code in order to do what I want it to do but now I'm trying to
    write
    my own to do a specific task that I can't find code for and am having
    issues. I am trying to retrieve data from a webpage, say
    for
    example, the price of a 2006 1oz Silver American Eagle in the 20-99 price
    break quantity. Should I use Regex to do that or would I be better off with
    HTML::Parser ? I've attemped Regex since I seem to understand it better but
    haven't had much success it getting it to pull the right price.
    HTML::Parser
    I understand even less than Regex but I've read that its a more reliable
    way
    of pulling webpage data ? I can't seem to find "easy" to understand
    documentation on it though so I'm even farther away from getting it to work
    then Regex, Any advice ?

    Parsing HTML is not "easy." Correctly parsing HTML using
    regexes is impossible. HTML::Parser is probably your best chance.

    If you need a tutorial, search for one.
  • No.3 | | 2040 bytes | |


    Two Web questions in one day! It's hard to know exactly how you're going
    to your
    code Wesley, but the stuff below should be a good starter. It pulls in the
    web
    site and parses it using HTML::TreeBuilder. It looks for all table row
    <tr>
    elements that contain exactly five table data <tdelements, which is all
    the
    item details plus a few stragglers. The real item data has an item number
    in the
    format #9999 in the second <tdelement, so ignore everything that's not
    like
    that. Finally the description and price are pulled from the relevant
    elements,
    and the numeric price value extracted with a regex. Everything that falls
    within
    your price bracket is then printed. I didn't restrict it to 2006 stuff as
    there
    weren't any at the time I wrote this, but it's easy to see how to do it I
    hope.

    HTH,

    Rob
    --
    use strict;
    use warnings;

    use LWP::Simple;
    use HTML::TreeBuilder;

    my $html = get
    '';

    my $tree = HTML::TreeBuilder->new_from_content($html);

    my @tr = $tree->find_by_tag_name('tr');

    foreach my $tr (@tr) {

    my @td = $tr->find_by_tag_name('td');
    next unless @td == 5;

    my ($number, $desc, $price) = map $as_trimmed_text, @td[1, 2, 4];
    next unless $number =~ /#\d+/;

    my ($dollars) = $price =~ /\$([\d\.]+)/;
    next unless $dollars >= 20 and $dollars < 100;

    print $desc, "\n", $price, "\n\n"
    }

    Thanks for your example script using HTML::Treebuilder, however I'm trying
    to figure out why it appears to grab some items but not others. I've removed
    the $20-100 limitation (I didn't need it, I really just need to poll one
    item) but am still missing some of the items. For example, the most obvious,
    are the 2 1986-2006 eagle at the top of the page, the script grabs one but
    not the other, any idea why ? Does it have to do with it looking for the 5
    td's ?
  • No.4 | | 334 bytes | |

    08/04/2006 02:25 PM, Wesley Bresson wrote:

    Thanks for your example script using HTML::Treebuilder, however I'm
    trying to figure out why it appears to grab some items but not others.
    []

    What appears to grab some items but not others? You didn't
    show anyone your program, so how can they comment on it?
  • No.5 | | 2128 bytes | |

    Wesley Bresson wrote:
    >

    Thanks for your example script using HTML::Treebuilder, however I'm
    trying to figure out why it appears to grab some items but not others.
    I've removed the $20-100 limitation (I didn't need it, I really just
    need to poll one item) but am still missing some of the items. For
    example, the most obvious, are the 2 1986-2006 eagle at the top of the
    page, the script grabs one but not the other, any idea why ? Does it
    have to do with it looking for the 5 td's ?

    Hello Wesley.

    The script fails because the site is an appalling example of HTML and
    HTML::TreeBuilder cannot parse it successfully. There are many spurious closing
    tags without matching opening ones, as well as a lot of missing closing tags;
    the page as a whole simply doesn't hold together.

    I have managed to establish that the HTML tables containing the pricing
    information will parse on their own, so I offer this hack to get the information
    you need. It works by scanning the input and extracting just the pricing tables,
    then submitting these to HTML::TreeBuilder. It's not pretty but it will probably
    suffice for what you need. Please buy from these people: they need your money
    for better Web development staff!

    Cheers,

    Rob

    use strict;
    use warnings;

    use LWP::Simple;
    use HTML::TreeBuilder;

    my $html = get '';
    my @newhtml;

    my $in_table;

    foreach (split /\n/, $html) {

    next if /^\s*<!\s*$/;

    if (m%<table\b%) {
    $in_table++ if /"pricesTable"/ or $in_table;
    }

    if ($in_table) {
    push @newhtml, $_;
    $in_table-- if m%</table\b%;
    }
    }

    my $tree = HTML::TreeBuilder->new_from_content(join '', @newhtml);

    my @table = $tree->look_down(_tag ='table', id ='pricesTable');

    foreach my $table (@table) {

    my @content = $table->content_list;

    foreach my $elem (@content) {
    print $elem->as_trimmed_text, "\n";
    }
    }

Re: Regex...HTML::Parser...Getting webpage data?


max 4000 letters.
Your nickname that display:
In order to stop the spam: 0 + 9 =
QUESTION ON "Perl"

EMSDN.COM