Regex...HTML::Parser...Getting webpage data?
5 answers - 904 bytes -

I'm pretty new to Perl, my past experience has been in modifying other
peoples code in order to do what I want it to do but now I'm trying to write
my own to do a specific task that I can't find code for and am having
issues. I am trying to retrieve data from a webpage, say
for
example, the price of a 2006 1oz Silver American Eagle in the 20-99 price
break quantity. Should I use Regex to do that or would I be better off with
HTML::Parser ? I've attemped Regex since I seem to understand it better but
haven't had much success it getting it to pull the right price. HTML::Parser
I understand even less than Regex but I've read that its a more reliable way
of pulling webpage data ? I can't seem to find "easy" to understand
documentation on it though so I'm even farther away from getting it to work
then Regex, Any advice ?
No.1 | | 2500 bytes |
| 
Wesley Bresson wrote:
>
I'm pretty new to Perl, my past experience has been in modifying other
peoples code in order to do what I want it to do but now I'm trying to
write
my own to do a specific task that I can't find code for and am having
issues. I am trying to retrieve data from a webpage, say
for
example, the price of a 2006 1oz Silver American Eagle in the 20-99 price
break quantity. Should I use Regex to do that or would I be better off with
HTML::Parser ? I've attemped Regex since I seem to understand it better but
haven't had much success it getting it to pull the right price.
HTML::Parser
I understand even less than Regex but I've read that its a more reliable
way
of pulling webpage data ? I can't seem to find "easy" to understand
documentation on it though so I'm even farther away from getting it to work
then Regex, Any advice ?
Two Web questions in one day! It's hard to know exactly how you're going to your
code Wesley, but the stuff below should be a good starter. It pulls in the web
site and parses it using HTML::TreeBuilder. It looks for all table row <tr>
elements that contain exactly five table data <tdelements, which is all the
item details plus a few stragglers. The real item data has an item number in the
format #9999 in the second <tdelement, so ignore everything that's not like
that. Finally the description and price are pulled from the relevant elements,
and the numeric price value extracted with a regex. Everything that falls within
your price bracket is then printed. I didn't restrict it to 2006 stuff as there
weren't any at the time I wrote this, but it's easy to see how to do it I hope.
HTH,
Rob
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
my $html = get '';
my $tree = HTML::TreeBuilder->new_from_content($html);
my @tr = $tree->find_by_tag_name('tr');
foreach my $tr (@tr) {
my @td = $tr->find_by_tag_name('td');
next unless @td == 5;
my ($number, $desc, $price) = map $as_trimmed_text, @td[1, 2, 4];
next unless $number =~ /#\d+/;
my ($dollars) = $price =~ /\$([\d\.]+)/;
next unless $dollars >= 20 and $dollars < 100;
print $desc, "\n", $price, "\n\n"
}
No.2 | | 1139 bytes |
| 
08/03/2006 01:27 PM, Wesley Bresson wrote:
I'm pretty new to Perl, my past experience has been in modifying other
peoples code in order to do what I want it to do but now I'm trying to
write
my own to do a specific task that I can't find code for and am having
issues. I am trying to retrieve data from a webpage, say
for
example, the price of a 2006 1oz Silver American Eagle in the 20-99 price
break quantity. Should I use Regex to do that or would I be better off with
HTML::Parser ? I've attemped Regex since I seem to understand it better but
haven't had much success it getting it to pull the right price.
HTML::Parser
I understand even less than Regex but I've read that its a more reliable
way
of pulling webpage data ? I can't seem to find "easy" to understand
documentation on it though so I'm even farther away from getting it to work
then Regex, Any advice ?
Parsing HTML is not "easy." Correctly parsing HTML using
regexes is impossible. HTML::Parser is probably your best chance.
If you need a tutorial, search for one.
No.3 | | 2040 bytes |
| 
Two Web questions in one day! It's hard to know exactly how you're going
to your
code Wesley, but the stuff below should be a good starter. It pulls in the
web
site and parses it using HTML::TreeBuilder. It looks for all table row
<tr>
elements that contain exactly five table data <tdelements, which is all
the
item details plus a few stragglers. The real item data has an item number
in the
format #9999 in the second <tdelement, so ignore everything that's not
like
that. Finally the description and price are pulled from the relevant
elements,
and the numeric price value extracted with a regex. Everything that falls
within
your price bracket is then printed. I didn't restrict it to 2006 stuff as
there
weren't any at the time I wrote this, but it's easy to see how to do it I
hope.
HTH,
Rob
--
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
my $html = get
'';
my $tree = HTML::TreeBuilder->new_from_content($html);
my @tr = $tree->find_by_tag_name('tr');
foreach my $tr (@tr) {
my @td = $tr->find_by_tag_name('td');
next unless @td == 5;
my ($number, $desc, $price) = map $as_trimmed_text, @td[1, 2, 4];
next unless $number =~ /#\d+/;
my ($dollars) = $price =~ /\$([\d\.]+)/;
next unless $dollars >= 20 and $dollars < 100;
print $desc, "\n", $price, "\n\n"
}
Thanks for your example script using HTML::Treebuilder, however I'm trying
to figure out why it appears to grab some items but not others. I've removed
the $20-100 limitation (I didn't need it, I really just need to poll one
item) but am still missing some of the items. For example, the most obvious,
are the 2 1986-2006 eagle at the top of the page, the script grabs one but
not the other, any idea why ? Does it have to do with it looking for the 5
td's ?
No.4 | | 334 bytes |
| 
08/04/2006 02:25 PM, Wesley Bresson wrote:
Thanks for your example script using HTML::Treebuilder, however I'm
trying to figure out why it appears to grab some items but not others.
[]
What appears to grab some items but not others? You didn't
show anyone your program, so how can they comment on it?
No.5 | | 2128 bytes |
| 
Wesley Bresson wrote:
>
Thanks for your example script using HTML::Treebuilder, however I'm
trying to figure out why it appears to grab some items but not others.
I've removed the $20-100 limitation (I didn't need it, I really just
need to poll one item) but am still missing some of the items. For
example, the most obvious, are the 2 1986-2006 eagle at the top of the
page, the script grabs one but not the other, any idea why ? Does it
have to do with it looking for the 5 td's ?
Hello Wesley.
The script fails because the site is an appalling example of HTML and
HTML::TreeBuilder cannot parse it successfully. There are many spurious closing
tags without matching opening ones, as well as a lot of missing closing tags;
the page as a whole simply doesn't hold together.
I have managed to establish that the HTML tables containing the pricing
information will parse on their own, so I offer this hack to get the information
you need. It works by scanning the input and extracting just the pricing tables,
then submitting these to HTML::TreeBuilder. It's not pretty but it will probably
suffice for what you need. Please buy from these people: they need your money
for better Web development staff!
Cheers,
Rob
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
my $html = get '';
my @newhtml;
my $in_table;
foreach (split /\n/, $html) {
next if /^\s*<!\s*$/;
if (m%<table\b%) {
$in_table++ if /"pricesTable"/ or $in_table;
}
if ($in_table) {
push @newhtml, $_;
$in_table-- if m%</table\b%;
}
}
my $tree = HTML::TreeBuilder->new_from_content(join '', @newhtml);
my @table = $tree->look_down(_tag ='table', id ='pricesTable');
foreach my $table (@table) {
my @content = $table->content_list;
foreach my $elem (@content) {
print $elem->as_trimmed_text, "\n";
}
}