Perl

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • 15 Million RAW

    14 answers - 589 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi,
    What I have to do is:
    1- Read Line from an input file
    2- Validate the raw (for example: is second char == 2?)
    3- Split the line
    4- Write the validated and splitted raw in an output file whit a different
    order (for example: last 2 digits I have to write as first 2 digits)
    I have to loop it for 15 MILLIN line!!!!
    How long does perl takes?
    The program I written takes 25 sec for 10.000 line too much
    Do you have suggestion?
    I have to think in parallel way? (fork ?!?!)
    Thanks
    Lorenzo
  • No.1 | | 150 bytes | |

    Lorenzo Caggioni wrote:
    The program I written takes 25 sec for 10.000 line too much
    How quickly do you need to it if 25 seconds is too long?
  • No.2 | | 394 bytes | |

    Lorenzo Caggioni wrote:
    Hi,

    Hello,

    What I have to do is:

    1- Read Line from an input file
    2- Validate the raw (for example: is second char == 2?)
    3- Split the line
    4- Write the validated and splitted raw in an output file whit a different
    order (for example: last 2 digits I have to write as first 2 digits)

    Have you done that? Where is it?

    John
  • No.3 | | 2691 bytes | |

    Thu, 24 Nov 2005, Pierre Smolarek wrote:

    Lorenzo Caggioni wrote:
    The program I written takes 25 sec for 10.000 line too much

    How quickly do you need to it if 25 seconds is too long?

    If 10,000 lines take 25 seconds, you're doing 400 lines per second.

    At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.

    While asking for a firmer definition for "faster" is a fair question,
    it's fair to assume that he wants to do better than 10.4 hours :-)

    That said, the canned answer applies here. If the problem is --
    1 Read Line from an input file
    2 Validate the raw (for example: is second char == 2?)
    3 Split the line
    4 Write the validated and splitted raw in an output file with a
    different order (for example: last 2 digits I have to write as
    first 2 digits)
    -- then, in order to give *any* constructive advice, we need:

    * to see the code in question
    * to know if the code has been benchmarked

    If we can't see the code, we can't possibly offer useful suggestions.

    If we don't have benchmark info to know what part of the code is taking
    so long, we can't even speculate as to where to start optimizing things.

    of the suggestions in Damian Conway's _Perl Best Practices_ is a
    simple piece of advice: "Don't Code -- Benchmark It". For
    details, look over this excerpt from the book:

    It's sound advice. The book's next suggestion -- which I can't seem to
    find a reference to online, so you're just going to have to find a copy
    of the book itself -- is "Don't optimize data structures -- measure
    them." This is also sound advice. If you use a module like Devel::Size
    to determine how space is being allocated, you can get a better sense of
    where you might be choking on data and, in turn, have a sense of where
    you need to fix things.

    you've used such tools to map out how your program is consuming
    time and space, you can start making decisions about how to reduce that
    consumption, by speeding up critical sections, reducing memory use, or
    just throwing more RAM and CPU at the problem if you're starved there
    and software optimizations seem like they might not be enough. But until
    you've figured out where the time is being spent, or what system
    resource is being exhausted, you can't properly address the problem.

    Really, you could do a whole lot worse than by just getting a copy of
    _Perl Best Practices_ and using its advice to rewrite your program from
    scratch. Almost everyone could improve their code this way :-)
  • No.4 | | 2892 bytes | |

    Here's my 2peneth.

    Avoid regex. While it's powerfull, it's also expensive.

    Short but sweet

    Gary

    Friday 25 November 2005 3:31 am, Chris Devers wrote:
    Thu, 24 Nov 2005, Pierre Smolarek wrote:
    Lorenzo Caggioni wrote:
    The program I written takes 25 sec for 10.000 line too
    much

    How quickly do you need to it if 25 seconds is too long?

    If 10,000 lines take 25 seconds, you're doing 400 lines per second.

    At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.

    While asking for a firmer definition for "faster" is a fair question,
    it's fair to assume that he wants to do better than 10.4 hours :-)

    That said, the canned answer applies here. If the problem is --

    1 Read Line from an input file
    2 Validate the raw (for example: is second char == 2?)
    3 Split the line
    4 Write the validated and splitted raw in an output file with a
    different order (for example: last 2 digits I have to write as
    first 2 digits)

    -- then, in order to give *any* constructive advice, we need:

    * to see the code in question
    * to know if the code has been benchmarked

    If we can't see the code, we can't possibly offer useful suggestions.

    If we don't have benchmark info to know what part of the code is
    taking so long, we can't even speculate as to where to start
    optimizing things.

    of the suggestions in Damian Conway's _Perl Best Practices_ is a
    simple piece of advice: "Don't Code -- Benchmark It". For
    details, look over this excerpt from the book:

    It's sound advice. The book's next suggestion -- which I can't seem
    to find a reference to online, so you're just going to have to find a
    copy of the book itself -- is "Don't optimize data structures --
    measure them." This is also sound advice. If you use a module like
    Devel::Size to determine how space is being allocated, you can get a
    better sense of where you might be choking on data and, in turn, have
    a sense of where you need to fix things.

    you've used such tools to map out how your program is consuming
    time and space, you can start making decisions about how to reduce
    that consumption, by speeding up critical sections, reducing memory
    use, or just throwing more RAM and CPU at the problem if you're
    starved there and software optimizations seem like they might not be
    enough. But until you've figured out where the time is being spent,
    or what system resource is being exhausted, you can't properly
    address the problem.

    Really, you could do a whole lot worse than by just getting a copy of
    _Perl Best Practices_ and using its advice to rewrite your program
    from scratch. Almost everyone could improve their code this way
    :-)
  • No.5 | | 580 bytes | |

    Fri, 25 Nov 2005, Gary Stainburn wrote:

    Here's my 2peneth.

    Avoid regex. While it's powerfull, it's also expensive.

    Short but sweet

    And useful!

    Because we know that regular expressions are the problem here, right?

    Err, wait, we haven't seen any code, or any benchmarks, so we don't.

    Efficient regexes run efficiently.

    Inefficient regexes run inefficiently.

    Measuring can help identify potential problems.

    But in this case, we don't even know if that's where the problem lies.
  • No.6 | | 1040 bytes | |

    Attached you can find the code an a input file to try it.

    I'm sorry if the code is not realy commented and if it is no real clear, but
    i have to delete some line because it is base on a database

    Now the program can run without any DB.
    You can find even a profile for the program.

    Thanks

    Lorenzo

    11/25/05, Chris Devers <cdevers (AT) pobox (DOT) comwrote:

    Fri, 25 Nov 2005, Gary Stainburn wrote:

    Here's my 2peneth.

    Avoid regex. While it's powerfull, it's also expensive.

    Short but sweet

    And useful!

    Because we know that regular expressions are the problem here, right?

    Err, wait, we haven't seen any code, or any benchmarks, so we don't.

    Efficient regexes run efficiently.

    Inefficient regexes run inefficiently.

    Measuring can help identify potential problems.

    But in this case, we don't even know if that's where the problem lies.
    >
    >
    >
  • No.7 | | 1988 bytes | |

    Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
    Attached you can find the code an a input file to try it.

    I'm sorry if the code is not realy commented and if it is no real clear,
    but i have to delete some line because it is base on a database

    From a short view into the code, I see optimization potential
    (some may have quite an effect, others may not) in:

    a) main::SplitRowByLength:

    instead of substr, you could try and benchmark direct extraction of the fields
    with a single regex along the lines my @fields=$line=~/(.{1})(.{4})/;

    unpack may be better; not much experience with it.

    b) in the top level while loop:

    avoid the repeated eval (can't see a purpose for that). I may have
    overlooked something, but why

    $xFieldValue = '($cdr[0]';
    $xFieldValue .= ',\@cdr,\$cdrsline,\$dbh)';
    eval ("".$xFieldValue);

    instead of a simple

    ($cdr[0],\@cdr,\$cdrsline,\$dbh)

    (where the ref to $dbh is unneccessary since it is an object, and $cdr[0]
    could be replaced by a preceeding my $cdr0=$cdr[0] and then use $cdr0)

    ?

    Then, first make a my variable instead of using the same hash lookup several
    times. F.i $globalParameters{"FileFieldDelimiter"} is used many times.

    c) generally

    Avoid most of the string interpolation where not necessary (hash keys, around
    integers, left from '=>' etc.)

    d) shorten some subs

    sub fmtCurrencyCodeTEST {
    my($xCurr) = "EUR";
    return $xCurr;
    }
    =>
    sub fmtCurrencyCodeTEST {'EUR'}

    sub fmtTLGATTR2_int_natTEST {
    my ($xServiceCode,$xInputCDR) = @_;
    return $xInputCDR->[20];
    }
    =>
    sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}

    etc.

    e) fmtTLGConvertDateTEST

    here the many substr could be avoided

    Since I'm still a beginner, be carful with my advices
    hopefully at least 2 cents,

    joe
  • No.8 | | 2840 bytes | |

    Making the subs shorter will maybe help a little in the speed of processing
    but it will make it a lot more difficult for the person that gets to take
    over the maintanace. When you know what you are doing and why it is easy to
    read it, but when you get a big program written like that and are asked to
    support it you will go looking for the guy that wrote it and give him a
    good old kick in the because of all the headache he cost you.
    So unless this is a very personal script that will not ever be handed over
    to anyone and your memory is good enough to remember what you are doing
    where and why please make sure you do not write subs like that unless you
    are very good at documenting your code as you are writting it.

    11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:

    Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
    Attached you can find the code an a input file to try it.

    I'm sorry if the code is not realy commented and if it is no real clear,
    but i have to delete some line because it is base on a database

    From a short view into the code, I see optimization potential
    (some may have quite an effect, others may not) in:

    a) main::SplitRowByLength:

    instead of substr, you could try and benchmark direct extraction of the
    fields
    with a single regex along the lines my @fields=$line=~/(.{1})(.{4})/;

    unpack may be better; not much experience with it.

    b) in the top level while loop:

    avoid the repeated eval (can't see a purpose for that). I may have
    overlooked something, but why

    $xFieldValue = '($cdr[0]';
    $xFieldValue .= ',\@cdr,\$cdrsline,\$dbh)';
    eval ("".$xFieldValue);

    instead of a simple

    ($cdr[0],\@cdr,\$cdrsline,\$dbh)

    (where the ref to $dbh is unneccessary since it is an object, and $cdr[0]
    could be replaced by a preceeding my $cdr0=$cdr[0] and then use $cdr0)

    ?

    Then, first make a my variable instead of using the same hash lookup
    several
    times. F.i $globalParameters{"FileFieldDelimiter"} is used many
    times.
    --
    c) generally

    Avoid most of the string interpolation where not necessary (hash keys,
    around
    integers, left from '=>' etc.)

    d) shorten some subs

    sub fmtCurrencyCodeTEST {
    my($xCurr) = "EUR";
    return $xCurr;
    }
    =>
    sub fmtCurrencyCodeTEST {'EUR'}

    sub fmtTLGATTR2_int_natTEST {
    my ($xServiceCode,$xInputCDR) = @_;
    return $xInputCDR->[20];
    }
    =>
    sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}

    etc.

    e) fmtTLGConvertDateTEST

    here the many substr could be avoided
    --
    Since I'm still a beginner, be carful with my advices
    hopefully at least 2 cents,

    joe
    --
  • No.9 | | 2452 bytes | |

    Rob Coops am Freitag, 25. November 2005 14.13:
    Making the subs shorter will maybe help a little in the speed of processing
    but it will make it a lot more difficult for the person that gets to take
    over the maintanace. When you know what you are doing and why it is easy to
    read it, but when you get a big program written like that and are asked to
    support it you will go looking for the guy that wrote it and give him a
    good old kick in the because of all the headache he cost you.
    So unless this is a very personal script that will not ever be handed over
    to anyone and your memory is good enough to remember what you are doing
    where and why please make sure you do not write subs like that unless you
    are very good at documenting your code as you are writting it.

    Hi Rob

    [see inline]

    11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:
    Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
    Attached you can find the code an a input file to try it.

    []
    d) shorten some subs

    sub fmtCurrencyCodeTEST {
    my($xCurr) = "EUR";
    return $xCurr;
    }
    =>
    sub fmtCurrencyCodeTEST {'EUR'}

    sub fmtTLGATTR2_int_natTEST {
    my ($xServiceCode,$xInputCDR) = @_;
    return $xInputCDR->[20];
    }
    =>
    sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}

    etc.
    []

    , making subs shorter with less local variables won't improve performance
    significantly. That's why a listed it at the end :-)

    Concerning bad maintanability, I don't see much problems in my examples, since
    there is no obfuscating of algorithms and such, but only direct access to the
    arguments - the difference is not very big.

    And of course a sub should be documented:
    - purpose
    - side effects
    - parameter description
    - description of the return values
    - (etc.)

    Compare:

    # purpose: return currency
    # in: --
    # out: constant string 'EUR'
    #
    sub fmtCurrencyCodeTEST {
    my($xCurr) = "EUR";
    return $xCurr;
    }

    # purpose: return currency
    # in: --
    # out: constant string 'EUR'
    #
    sub fmtCurrencyCodeTEST {'EUR'}

    In this example, you could even omit the comments, since it's obvious what's
    the purpose of the sub.

    Have a look into the perl source; you will find lots of such examples.

    greetings,

    joe
  • No.10 | | 2994 bytes | |

    I made some changes in the program (delete eval, edjust subs )

    Now the program takes less then 3 sec but it loses all the structure

    The main thing that increase performance is delete the eval("fun name").
    I do it in this way because the name of the function is retrived from a
    database.
    is there another way to recal a function retrining his name from a variable?

    Any suggestions?

    Thanks

    11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:

    Rob Coops am Freitag, 25. November 2005 14.13:
    Making the subs shorter will maybe help a little in the speed of
    processing
    but it will make it a lot more difficult for the person that gets to
    take
    over the maintanace. When you know what you are doing and why it is easy
    to
    read it, but when you get a big program written like that and are asked
    to
    support it you will go looking for the guy that wrote it and give him
    a
    good old kick in the because of all the headache he cost you.
    So unless this is a very personal script that will not ever be handed
    over
    to anyone and your memory is good enough to remember what you are doing
    where and why please make sure you do not write subs like that unless
    you
    are very good at documenting your code as you are writting it.

    Hi Rob

    [see inline]

    11/25/05, John Doe <security.department (AT) tele2 (DOT) chwrote:
    Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
    Attached you can find the code an a input file to try it.

    []
    d) shorten some subs

    sub fmtCurrencyCodeTEST {
    my($xCurr) = "EUR";
    return $xCurr;
    }
    =>
    sub fmtCurrencyCodeTEST {'EUR'}

    sub fmtTLGATTR2_int_natTEST {
    my ($xServiceCode,$xInputCDR) = @_;
    return $xInputCDR->[20];
    }
    =>
    sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}

    etc.
    []

    , making subs shorter with less local variables won't improve
    performance
    significantly. That's why a listed it at the end :-)

    Concerning bad maintanability, I don't see much problems in my examples,
    since
    there is no obfuscating of algorithms and such, but only direct access to
    the
    arguments - the difference is not very big.

    And of course a sub should be documented:
    - purpose
    - side effects
    - parameter description
    - description of the return values
    - (etc.)

    Compare:

    # purpose: return currency
    # in: --
    # out: constant string 'EUR'
    #
    sub fmtCurrencyCodeTEST {
    my($xCurr) = "EUR";
    return $xCurr;
    }

    # purpose: return currency
    # in: --
    # out: constant string 'EUR'
    #
    sub fmtCurrencyCodeTEST {'EUR'}

    In this example, you could even omit the comments, since it's obvious
    what's
    the purpose of the sub.
    --
    Have a look into the perl source; you will find lots of such examples.

    greetings,

    joe
  • No.11 | | 935 bytes | |

    Nov 25, Lorenzo Caggioni said:

    I made some changes in the program (delete eval, edjust subs )

    Now the program takes less then 3 sec but it loses all the structure

    The main thing that increase performance is delete the eval("fun name").
    I do it in this way because the name of the function is retrived from a
    database.
    is there another way to recal a function retrining his name from a variable?

    Yes, it's called a dispatch table:

    my %functions = (
    abc =\&do_this,
    def =\&do_that,
    ghi =\&do_something_else,
    );

    Those \& things are REFERENCES to functions. So you do:

    while (my @row = get_stuff_from_database()) {
    # assuming $row[0] is abc or def or ghi
    # that is, $row[0] holds the nickname of the function
    my $code = $functions{$row[0]};

    $code->(@arguments);
    }

    So when $row[0] is 'abc', we call do_this(). Etc.
  • No.12 | | 372 bytes | |

    Lorenzo Caggioni:

    Please don't toppost, and cut all the text that you don't react on.

    is there another way to recal a function retrining his name from a
    variable?

    If the set of functions is limited, use if:

    if ('abc' eq $func) {
    abc
    } elseif ('def' eq $func) {
    def
    }

    put them in a hash.
  • No.13 | | 2240 bytes | |

    Chris Devers wrote:
    Thu, 24 Nov 2005, Pierre Smolarek wrote:


    >>Lorenzo Caggioni wrote:
    >>

    The program I written takes 25 sec for 10.000 line too much
    >>

    >
    >>How quickly do you need to it if 25 seconds is too long?


    If 10,000 lines take 25 seconds, you're doing 400 lines per second.

    At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.

    While asking for a firmer definition for "faster" is a fair question,
    it's fair to assume that he wants to do better than 10.4 hours :-)

    That said, the canned answer applies here. If the problem is --
    1 Read Line from an input file
    2 Validate the raw (for example: is second char == 2?)
    3 Split the line
    4 Write the validated and splitted raw in an output file with a
    different order (for example: last 2 digits I have to write as
    first 2 digits)
    -- then, in order to give *any* constructive advice, we need:

    * to see the code in question
    * to know if the code has been benchmarked

    If we can't see the code, we can't possibly offer useful suggestions.

    If we don't have benchmark info to know what part of the code is taking
    so long, we can't even speculate as to where to start optimizing things.

    of the suggestions in Damian Conway's _Perl Best Practices_ is a
    simple piece of advice: "Don't Code -- Benchmark It". For
    details, look over this excerpt from the book:

    It's sound advice. The book's next suggestion -- which I can't seem to
    find a reference to online, so you're just going to have to find a copy
    of the book itself -- is "Don't optimize data structures -- measure
    them." This is also sound advice. If you use a module like Devel::Size
    to determine how space is being allocated, you can get a better sense of
    where you might be choking on data and, in turn, have a sense of where
    you need to fix things.

    There's also Profil (Devel::Profil) to find out where you are spending
    that 25 minutes.
  • No.14 | | 1151 bytes | |

    Lorenzo Caggioni wrote:
    Attached you can find the code an a input file to try it.

    I'm sorry if the code is not realy commented and if it is no real clear, but
    i have to delete some line because it is base on a database

    Now the program can run without any DB.
    You can find even a profile for the program.

    have mentioned optimizations but I noticed a few errors:

    89 if ($InvalidReason eq undef)

    You can not use the value undef in a comparison, that should be:

    if ( ! defined $InvalidReason )

    And:

    311 @{$inputCDR_HASH{"0"}} = @{$xInputCDR} if $xInputCDR != undef;

    @{$inputCDR_HASH{"0"}} = @{$xInputCDR} if defined $xInputCDR;

    392 return $globalParameters{"GNV_INTERF_MDIFIER"}{"11"}{"NATTLG"} if
    $xServiceCode = 9510;
    393 return $globalParameters{"GNV_INTERF_MDIFIER"}{"10"}{"INTTLG"} if
    $xServiceCode = 9520;

    If you had warnings enabled then perl would have warned you that you are doing
    an asignment instead of a comparison. You should have these two lines at the
    beginning of your program:

    use warnings;
    use strict;

    John

Re: 15 Million RAW


max 4000 letters.
Your nickname that display:
In order to stop the spam: 8 + 7 =
QUESTION ON "Perl"

EMSDN.COM