15 Million RAW
14 answers - 589 bytes -

Hi,
What I have to do is:
1- Read Line from an input file
2- Validate the raw (for example: is second char == 2?)
3- Split the line
4- Write the validated and splitted raw in an output file whit a different
order (for example: last 2 digits I have to write as first 2 digits)
I have to loop it for 15 MILLIN line!!!!
How long does perl takes?
The program I written takes 25 sec for 10.000 line too much
Do you have suggestion?
I have to think in parallel way? (fork ?!?!)
Thanks
Lorenzo
No.1 | | 150 bytes |
| 
Lorenzo Caggioni wrote:
The program I written takes 25 sec for 10.000 line too much
How quickly do you need to it if 25 seconds is too long?
No.2 | | 394 bytes |
| 
Lorenzo Caggioni wrote:
Hi,
Hello,
What I have to do is:
1- Read Line from an input file
2- Validate the raw (for example: is second char == 2?)
3- Split the line
4- Write the validated and splitted raw in an output file whit a different
order (for example: last 2 digits I have to write as first 2 digits)
Have you done that? Where is it?
John
No.3 | | 2691 bytes |
| 
Thu, 24 Nov 2005, Pierre Smolarek wrote:
Lorenzo Caggioni wrote:
The program I written takes 25 sec for 10.000 line too much
How quickly do you need to it if 25 seconds is too long?
If 10,000 lines take 25 seconds, you're doing 400 lines per second.
At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.
While asking for a firmer definition for "faster" is a fair question,
it's fair to assume that he wants to do better than 10.4 hours :-)
That said, the canned answer applies here. If the problem is --
1 Read Line from an input file
2 Validate the raw (for example: is second char == 2?)
3 Split the line
4 Write the validated and splitted raw in an output file with a
different order (for example: last 2 digits I have to write as
first 2 digits)
-- then, in order to give *any* constructive advice, we need:
* to see the code in question
* to know if the code has been benchmarked
If we can't see the code, we can't possibly offer useful suggestions.
If we don't have benchmark info to know what part of the code is taking
so long, we can't even speculate as to where to start optimizing things.
of the suggestions in Damian Conway's _Perl Best Practices_ is a
simple piece of advice: "Don't Code -- Benchmark It". For
details, look over this excerpt from the book:
It's sound advice. The book's next suggestion -- which I can't seem to
find a reference to online, so you're just going to have to find a copy
of the book itself -- is "Don't optimize data structures -- measure
them." This is also sound advice. If you use a module like Devel::Size
to determine how space is being allocated, you can get a better sense of
where you might be choking on data and, in turn, have a sense of where
you need to fix things.
you've used such tools to map out how your program is consuming
time and space, you can start making decisions about how to reduce that
consumption, by speeding up critical sections, reducing memory use, or
just throwing more RAM and CPU at the problem if you're starved there
and software optimizations seem like they might not be enough. But until
you've figured out where the time is being spent, or what system
resource is being exhausted, you can't properly address the problem.
Really, you could do a whole lot worse than by just getting a copy of
_Perl Best Practices_ and using its advice to rewrite your program from
scratch. Almost everyone could improve their code this way :-)
No.4 | | 2892 bytes |
| 
Here's my 2peneth.
Avoid regex. While it's powerfull, it's also expensive.
Short but sweet
Gary
Friday 25 November 2005 3:31 am, Chris Devers wrote:
Thu, 24 Nov 2005, Pierre Smolarek wrote:
Lorenzo Caggioni wrote:
The program I written takes 25 sec for 10.000 line too
much
How quickly do you need to it if 25 seconds is too long?
If 10,000 lines take 25 seconds, you're doing 400 lines per second.
At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.
While asking for a firmer definition for "faster" is a fair question,
it's fair to assume that he wants to do better than 10.4 hours :-)
That said, the canned answer applies here. If the problem is --
1 Read Line from an input file
2 Validate the raw (for example: is second char == 2?)
3 Split the line
4 Write the validated and splitted raw in an output file with a
different order (for example: last 2 digits I have to write as
first 2 digits)
-- then, in order to give *any* constructive advice, we need:
* to see the code in question
* to know if the code has been benchmarked
If we can't see the code, we can't possibly offer useful suggestions.
If we don't have benchmark info to know what part of the code is
taking so long, we can't even speculate as to where to start
optimizing things.
of the suggestions in Damian Conway's _Perl Best Practices_ is a
simple piece of advice: "Don't Code -- Benchmark It". For
details, look over this excerpt from the book:
It's sound advice. The book's next suggestion -- which I can't seem
to find a reference to online, so you're just going to have to find a
copy of the book itself -- is "Don't optimize data structures --
measure them." This is also sound advice. If you use a module like
Devel::Size to determine how space is being allocated, you can get a
better sense of where you might be choking on data and, in turn, have
a sense of where you need to fix things.
you've used such tools to map out how your program is consuming
time and space, you can start making decisions about how to reduce
that consumption, by speeding up critical sections, reducing memory
use, or just throwing more RAM and CPU at the problem if you're
starved there and software optimizations seem like they might not be
enough. But until you've figured out where the time is being spent,
or what system resource is being exhausted, you can't properly
address the problem.
Really, you could do a whole lot worse than by just getting a copy of
_Perl Best Practices_ and using its advice to rewrite your program
from scratch. Almost everyone could improve their code this way
:-)
No.5 | | 580 bytes |
| 
Fri, 25 Nov 2005, Gary Stainburn wrote:
Here's my 2peneth.
Avoid regex. While it's powerfull, it's also expensive.
Short but sweet
And useful!
Because we know that regular expressions are the problem here, right?
Err, wait, we haven't seen any code, or any benchmarks, so we don't.
Efficient regexes run efficiently.
Inefficient regexes run inefficiently.
Measuring can help identify potential problems.
But in this case, we don't even know if that's where the problem lies.
No.6 | | 1040 bytes |
| 
Attached you can find the code an a input file to try it.
I'm sorry if the code is not realy commented and if it is no real clear, but
i have to delete some line because it is base on a database
Now the program can run without any DB.
You can find even a profile for the program.
Thanks
Lorenzo
11/25/05, Chris Devers <cdevers (AT) pobox (DOT) comwrote:
Fri, 25 Nov 2005, Gary Stainburn wrote:
Here's my 2peneth.
Avoid regex. While it's powerfull, it's also expensive.
Short but sweet
And useful!
Because we know that regular expressions are the problem here, right?
Err, wait, we haven't seen any code, or any benchmarks, so we don't.
Efficient regexes run efficiently.
Inefficient regexes run inefficiently.
Measuring can help identify potential problems.
But in this case, we don't even know if that's where the problem lies.
>
>
>
No.7 | | 1988 bytes |
| 
Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
Attached you can find the code an a input file to try it.
I'm sorry if the code is not realy commented and if it is no real clear,
but i have to delete some line because it is base on a database
From a short view into the code, I see optimization potential
(some may have quite an effect, others may not) in:
a) main::SplitRowByLength:
instead of substr, you could try and benchmark direct extraction of the fields
with a single regex along the lines my @fields=$line=~/(.{1})(.{4})/;
unpack may be better; not much experience with it.
b) in the top level while loop:
avoid the repeated eval (can't see a purpose for that). I may have
overlooked something, but why
$xFieldValue = '($cdr[0]';
$xFieldValue .= ',\@cdr,\$cdrsline,\$dbh)';
eval ("".$xFieldValue);
instead of a simple
($cdr[0],\@cdr,\$cdrsline,\$dbh)
(where the ref to $dbh is unneccessary since it is an object, and $cdr[0]
could be replaced by a preceeding my $cdr0=$cdr[0] and then use $cdr0)
?
Then, first make a my variable instead of using the same hash lookup several
times. F.i $globalParameters{"FileFieldDelimiter"} is used many times.
c) generally
Avoid most of the string interpolation where not necessary (hash keys, around
integers, left from '=>' etc.)
d) shorten some subs
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}
=>
sub fmtCurrencyCodeTEST {'EUR'}
sub fmtTLGATTR2_int_natTEST {
my ($xServiceCode,$xInputCDR) = @_;
return $xInputCDR->[20];
}
=>
sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}
etc.
e) fmtTLGConvertDateTEST
here the many substr could be avoided
Since I'm still a beginner, be carful with my advices
hopefully at least 2 cents,
joe
No.8 | | 2840 bytes |
| 
Making the subs shorter will maybe help a little in the speed of processing
but it will make it a lot more difficult for the person that gets to take
over the maintanace. When you know what you are doing and why it is easy to
read it, but when you get a big program written like that and are asked to
support it you will go looking for the guy that wrote it and give him a
good old kick in the because of all the headache he cost you.
So unless this is a very personal script that will not ever be handed over
to anyone and your memory is good enough to remember what you are doing
where and why please make sure you do not write subs like that unless you
are very good at documenting your code as you are writting it.
11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:
Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
Attached you can find the code an a input file to try it.
I'm sorry if the code is not realy commented and if it is no real clear,
but i have to delete some line because it is base on a database
From a short view into the code, I see optimization potential
(some may have quite an effect, others may not) in:
a) main::SplitRowByLength:
instead of substr, you could try and benchmark direct extraction of the
fields
with a single regex along the lines my @fields=$line=~/(.{1})(.{4})/;
unpack may be better; not much experience with it.
b) in the top level while loop:
avoid the repeated eval (can't see a purpose for that). I may have
overlooked something, but why
$xFieldValue = '($cdr[0]';
$xFieldValue .= ',\@cdr,\$cdrsline,\$dbh)';
eval ("".$xFieldValue);
instead of a simple
($cdr[0],\@cdr,\$cdrsline,\$dbh)
(where the ref to $dbh is unneccessary since it is an object, and $cdr[0]
could be replaced by a preceeding my $cdr0=$cdr[0] and then use $cdr0)
?
Then, first make a my variable instead of using the same hash lookup
several
times. F.i $globalParameters{"FileFieldDelimiter"} is used many
times.
--
c) generally
Avoid most of the string interpolation where not necessary (hash keys,
around
integers, left from '=>' etc.)
d) shorten some subs
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}
=>
sub fmtCurrencyCodeTEST {'EUR'}
sub fmtTLGATTR2_int_natTEST {
my ($xServiceCode,$xInputCDR) = @_;
return $xInputCDR->[20];
}
=>
sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}
etc.
e) fmtTLGConvertDateTEST
here the many substr could be avoided
--
Since I'm still a beginner, be carful with my advices
hopefully at least 2 cents,
joe
--
No.9 | | 2452 bytes |
| 
Rob Coops am Freitag, 25. November 2005 14.13:
Making the subs shorter will maybe help a little in the speed of processing
but it will make it a lot more difficult for the person that gets to take
over the maintanace. When you know what you are doing and why it is easy to
read it, but when you get a big program written like that and are asked to
support it you will go looking for the guy that wrote it and give him a
good old kick in the because of all the headache he cost you.
So unless this is a very personal script that will not ever be handed over
to anyone and your memory is good enough to remember what you are doing
where and why please make sure you do not write subs like that unless you
are very good at documenting your code as you are writting it.
Hi Rob
[see inline]
11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:
Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
Attached you can find the code an a input file to try it.
[]
d) shorten some subs
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}
=>
sub fmtCurrencyCodeTEST {'EUR'}
sub fmtTLGATTR2_int_natTEST {
my ($xServiceCode,$xInputCDR) = @_;
return $xInputCDR->[20];
}
=>
sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}
etc.
[]
, making subs shorter with less local variables won't improve performance
significantly. That's why a listed it at the end :-)
Concerning bad maintanability, I don't see much problems in my examples, since
there is no obfuscating of algorithms and such, but only direct access to the
arguments - the difference is not very big.
And of course a sub should be documented:
- purpose
- side effects
- parameter description
- description of the return values
- (etc.)
Compare:
# purpose: return currency
# in: --
# out: constant string 'EUR'
#
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}
# purpose: return currency
# in: --
# out: constant string 'EUR'
#
sub fmtCurrencyCodeTEST {'EUR'}
In this example, you could even omit the comments, since it's obvious what's
the purpose of the sub.
Have a look into the perl source; you will find lots of such examples.
greetings,
joe
No.10 | | 2994 bytes |
| 
I made some changes in the program (delete eval, edjust subs )
Now the program takes less then 3 sec but it loses all the structure
The main thing that increase performance is delete the eval("fun name").
I do it in this way because the name of the function is retrived from a
database.
is there another way to recal a function retrining his name from a variable?
Any suggestions?
Thanks
11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:
Rob Coops am Freitag, 25. November 2005 14.13:
Making the subs shorter will maybe help a little in the speed of
processing
but it will make it a lot more difficult for the person that gets to
take
over the maintanace. When you know what you are doing and why it is easy
to
read it, but when you get a big program written like that and are asked
to
support it you will go looking for the guy that wrote it and give him
a
good old kick in the because of all the headache he cost you.
So unless this is a very personal script that will not ever be handed
over
to anyone and your memory is good enough to remember what you are doing
where and why please make sure you do not write subs like that unless
you
are very good at documenting your code as you are writting it.
Hi Rob
[see inline]
11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:
Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
Attached you can find the code an a input file to try it.
[]
d) shorten some subs
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}
=>
sub fmtCurrencyCodeTEST {'EUR'}
sub fmtTLGATTR2_int_natTEST {
my ($xServiceCode,$xInputCDR) = @_;
return $xInputCDR->[20];
}
=>
sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}
etc.
[]
, making subs shorter with less local variables won't improve
performance
significantly. That's why a listed it at the end :-)
Concerning bad maintanability, I don't see much problems in my examples,
since
there is no obfuscating of algorithms and such, but only direct access to
the
arguments - the difference is not very big.
And of course a sub should be documented:
- purpose
- side effects
- parameter description
- description of the return values
- (etc.)
Compare:
# purpose: return currency
# in: --
# out: constant string 'EUR'
#
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}
# purpose: return currency
# in: --
# out: constant string 'EUR'
#
sub fmtCurrencyCodeTEST {'EUR'}
In this example, you could even omit the comments, since it's obvious
what's
the purpose of the sub.
--
Have a look into the perl source; you will find lots of such examples.
greetings,
joe
No.11 | | 935 bytes |
| 
Nov 25, Lorenzo Caggioni said:
I made some changes in the program (delete eval, edjust subs )
Now the program takes less then 3 sec but it loses all the structure
The main thing that increase performance is delete the eval("fun name").
I do it in this way because the name of the function is retrived from a
database.
is there another way to recal a function retrining his name from a variable?
Yes, it's called a dispatch table:
my %functions = (
abc =\&do_this,
def =\&do_that,
ghi =\&do_something_else,
);
Those \& things are REFERENCES to functions. So you do:
while (my @row = get_stuff_from_database()) {
# assuming $row[0] is abc or def or ghi
# that is, $row[0] holds the nickname of the function
my $code = $functions{$row[0]};
$code->(@arguments);
}
So when $row[0] is 'abc', we call do_this(). Etc.
No.12 | | 372 bytes |
| 
Lorenzo Caggioni:
Please don't toppost, and cut all the text that you don't react on.
is there another way to recal a function retrining his name from a
variable?
If the set of functions is limited, use if:
if ('abc' eq $func) {
abc
} elseif ('def' eq $func) {
def
}
put them in a hash.
No.13 | | 2240 bytes |
| 
Chris Devers wrote:
Thu, 24 Nov 2005, Pierre Smolarek wrote:
>>Lorenzo Caggioni wrote:
>>
The program I written takes 25 sec for 10.000 line too much
>>
>
>>How quickly do you need to it if 25 seconds is too long?
If 10,000 lines take 25 seconds, you're doing 400 lines per second.
At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.
While asking for a firmer definition for "faster" is a fair question,
it's fair to assume that he wants to do better than 10.4 hours :-)
That said, the canned answer applies here. If the problem is --
1 Read Line from an input file
2 Validate the raw (for example: is second char == 2?)
3 Split the line
4 Write the validated and splitted raw in an output file with a
different order (for example: last 2 digits I have to write as
first 2 digits)
-- then, in order to give *any* constructive advice, we need:
* to see the code in question
* to know if the code has been benchmarked
If we can't see the code, we can't possibly offer useful suggestions.
If we don't have benchmark info to know what part of the code is taking
so long, we can't even speculate as to where to start optimizing things.
of the suggestions in Damian Conway's _Perl Best Practices_ is a
simple piece of advice: "Don't Code -- Benchmark It". For
details, look over this excerpt from the book:
It's sound advice. The book's next suggestion -- which I can't seem to
find a reference to online, so you're just going to have to find a copy
of the book itself -- is "Don't optimize data structures -- measure
them." This is also sound advice. If you use a module like Devel::Size
to determine how space is being allocated, you can get a better sense of
where you might be choking on data and, in turn, have a sense of where
you need to fix things.
There's also Profil (Devel::Profil) to find out where you are spending
that 25 minutes.
No.14 | | 1151 bytes |
| 
Lorenzo Caggioni wrote:
Attached you can find the code an a input file to try it.
I'm sorry if the code is not realy commented and if it is no real clear, but
i have to delete some line because it is base on a database
Now the program can run without any DB.
You can find even a profile for the program.
have mentioned optimizations but I noticed a few errors:
89 if ($InvalidReason eq undef)
You can not use the value undef in a comparison, that should be:
if ( ! defined $InvalidReason )
And:
311 @{$inputCDR_HASH{"0"}} = @{$xInputCDR} if $xInputCDR != undef;
@{$inputCDR_HASH{"0"}} = @{$xInputCDR} if defined $xInputCDR;
392 return $globalParameters{"GNV_INTERF_MDIFIER"}{"11"}{"NATTLG"} if
$xServiceCode = 9510;
393 return $globalParameters{"GNV_INTERF_MDIFIER"}{"10"}{"INTTLG"} if
$xServiceCode = 9520;
If you had warnings enabled then perl would have warned you that you are doing
an asignment instead of a comparison. You should have these two lines at the
beginning of your program:
use warnings;
use strict;
John