Perl

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Comparing two files of 8million lines/rows ...

    5 answers - 2191 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi all,
    I have two database tables, one is local and one is on a WAN. They are supposed
    to be in-sync but they at the moment, they are not. There are 8million+ plus
    rows on this table.
    I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
    EMP@DBWAN ), leave that running for hours and all thru the night and guess
    what, am still waiting for the result to come out
    So what I decided is I want to extract the records into a flat file and then
    write a Perl script to skim thru each line and check whether it exists on the
    other file. While there will be 8million+ lines, the file will not be big
    beacuse am only extracting one-column from the table.
    Does anyone have an existing Perl code that does a similar thing like this
    already? It will be much appreciated if you can send it to me and then I will
    just modify it accordingly.
    Example logic that I have is this:
    FILE1:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT
    FILE2:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT
    BUGS-BUNNY
    So search FILE1 for all line entries of FILE2 then output whoever does not exist
    into FILE3. So after running the script, I should have
    FILE3:
    BUGS-BUNNY
    What I currently have is that I read all of FILE2's entries into an array? Read
    FILE1 one line at a time using m/// and if there is no m///, print that to
    FILE3.
    It seems to work fine for 1000 lines of entries, but am not particularly sure
    how long will that take for 8million+ rows, not even sure if I can create an
    array to contain 8million+ plus rows, if I can't, then am back to doing this on
    the database instead. Another option am looking at is just to read FILE1 one
    line at a time and do a grep "$string_to_search" FILE2 but I do not know how to
    do a grep-like syntax against a file on Perl especially if the search string is
    a variable.
    Why I prefer using a script is so am not putting loads into the database not to
    mention that I can put more logic into the script than on the SQL statement.
    Any advise or other options will be very much appreciated Thanks in
    advance.
  • No.1 | | 2873 bytes | |

    <benbart (AT) orcon (DOT) net.nzwrote in message
    @webmail.slingshot.co.nz
    Hi all,

    Hello,

    I have two database tables, one is local and one is on a WAN. They are
    supposed
    to be in-sync but they at the moment, they are not. There are 8million+
    plus
    rows on this table.

    I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
    EMP@DBWAN ), leave that running for hours and all thru the night and guess
    what, am still waiting for the result to come out

    So what I decided is I want to extract the records into a flat file and
    then
    write a Perl script to skim thru each line and check whether it exists on
    the
    other file. While there will be 8million+ lines, the file will not be big
    beacuse am only extracting one-column from the table.

    Does anyone have an existing Perl code that does a similar thing like this
    already? It will be much appreciated if you can send it to me and then I
    will
    just modify it accordingly.

    Example logic that I have is this:

    FILE1:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT

    FILE2:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT
    BUGS-BUNNY

    This will work, based on you given data

    use strict;
    use warnings;

    my @file2 = qw/mickey minnie donald goofy pluto /;
    my @file1 = qw /mickey minnie donald goofy pluto bunny/;
    my %hash = map { $_ =undef } @file2;

    while(<@file1>) {
    unless(exists $hash{$_}) {
    print $_, "\n";
    }
    }

    output:
    bunny

    caveats:
    This will only print out the element that were present on file1 and were
    not on file2,
    and i'm also a beginner.

    So search FILE1 for all line entries of FILE2 then output whoever does not
    exist
    into FILE3. So after running the script, I should have

    FILE3:
    BUGS-BUNNY

    What I currently have is that I read all of FILE2's entries into an array?
    Read
    FILE1 one line at a time using m/// and if there is no m///, print that to
    FILE3.

    It seems to work fine for 1000 lines of entries, but am not particularly
    sure
    how long will that take for 8million+ rows, not even sure if I can create
    an
    array to contain 8million+ plus rows, if I can't, then am back to doing
    this on
    the database instead. Another option am looking at is just to read FILE1
    one
    line at a time and do a grep "$string_to_search" FILE2 but I do not know
    how to
    do a grep-like syntax against a file on Perl especially if the search
    string is
    a variable.

    Why I prefer using a script is so am not putting loads into the database
    not to
    mention that I can put more logic into the script than on the SQL
    statement.

    Any advise or other options will be very much appreciated Thanks in
    advance.
    --
    welcome, HTH.

    /joseph
  • No.2 | | 3197 bytes | |

    08/16/2006 04:35 PM, benbart (AT) orcon (DOT) net.nz wrote:
    Hi all,

    I have two database tables, one is local and one is on a WAN. They are supposed
    to be in-sync but they at the moment, they are not. There are 8million+ plus
    rows on this table.

    I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
    EMP@DBWAN ), leave that running for hours and all thru the night and guess
    what, am still waiting for the result to come out

    So what I decided is I want to extract the records into a flat file and then
    write a Perl script to skim thru each line and check whether it exists on the
    other file. While there will be 8million+ lines, the file will not be big
    beacuse am only extracting one-column from the table.

    Does anyone have an existing Perl code that does a similar thing like this
    already? It will be much appreciated if you can send it to me and then I will
    just modify it accordingly.

    Example logic that I have is this:

    FILE1:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT

    FILE2:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT
    BUGS-BUNNY

    So search FILE1 for all line entries of FILE2 then output whoever does not exist
    into FILE3. So after running the script, I should have

    FILE3:
    BUGS-BUNNY

    What I currently have is that I read all of FILE2's entries into an array? Read
    FILE1 one line at a time using m/// and if there is no m///, print that to
    FILE3.

    It seems to work fine for 1000 lines of entries, but am not particularly sure
    how long will that take for 8million+ rows, not even sure if I can create an
    array to contain 8million+ plus rows, if I can't, then am back to doing this on
    the database instead. Another option am looking at is just to read FILE1 one
    line at a time and do a grep "$string_to_search" FILE2 but I do not know how to
    do a grep-like syntax against a file on Perl especially if the search string is
    a variable.

    Why I prefer using a script is so am not putting loads into the database not to
    mention that I can put more logic into the script than on the SQL statement.

    Any advise or other options will be very much appreciated Thanks in
    advance.

    , you want to do this as efficiently as possible
    because of the humongous size of the data. Hashes are the
    fastest structures for letting you know if some data has
    already been seen.

    Your problem is core. Do you have enough core memory to read
    all of the data one of the table columns into memory? If so,
    then the solution be almost trivial; if not, then it's
    probably not trivial, but also not hard.

    My advice is to attempt to suck the entire column for one
    table into memory; FILE1 should become hash keys (with empty
    values). Then you would open FILE2, loop through the lines
    (records) and output any record that does not appear in the hash.

    However, if you can't get the entire FILE1 into memory, then
    I'd suggest converting FILE1 into a berkeley database and
    using DB_File to tie it to a hash; from there on, the solution
    would be like the above.

    HTH
  • No.3 | | 3858 bytes | |

    benbart (AT) orcon (DOT) net.nz wrote:
    >

    I have two database tables, one is local and one is on a WAN. They are
    supposed to be in-sync but they at the moment, they are not. There are
    8million+ plus rows on this table.
    >

    I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
    EMP@DBWAN ), leave that running for hours and all thru the night and guess
    what, am still waiting for the result to come out
    >

    So what I decided is I want to extract the records into a flat file and then
    write a Perl script to skim thru each line and check whether it exists on the
    other file. While there will be 8million+ lines, the file will not be big
    beacuse am only extracting one-column from the table.
    >

    Does anyone have an existing Perl code that does a similar thing like this
    already? It will be much appreciated if you can send it to me and then I will
    just modify it accordingly.
    >

    Example logic that I have is this:
    >

    FILE1:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT
    >

    FILE2:
    MICKEY
    MINNIE
    DNALD
    GFY
    PLUT
    BUGS-BUNNY
    >

    So search FILE1 for all line entries of FILE2 then output whoever does not
    exist into FILE3. So after running the script, I should have
    >

    FILE3:
    BUGS-BUNNY
    >

    What I currently have is that I read all of FILE2's entries into an array?
    Read FILE1 one line at a time using m/// and if there is no m///, print that
    to FILE3.
    >

    It seems to work fine for 1000 lines of entries, but am not particularly sure
    how long will that take for 8million+ rows, not even sure if I can create an
    array to contain 8million+ plus rows, if I can't, then am back to doing this
    on the database instead. Another option am looking at is just to read FILE1
    one line at a time and do a grep "$string_to_search" FILE2 but I do not know
    how to do a grep-like syntax against a file on Perl especially if the search
    string is a variable.
    >

    Why I prefer using a script is so am not putting loads into the database not
    to mention that I can put more logic into the script than on the SQL
    statement.
    >

    Any advise or other options will be very much appreciated Thanks in
    advance.

    No need for flat files if you are using DBI. Read the EMPN values from
    EMP@DBWAN first as it is a) slower, being on the network, and b) a shorter list,
    and put these values into a Perl hash. 8 million hash values are unlikely to
    take more than about 300MB of memory I would guess, which should be fine on any
    recent PC. Fetching one value at a time will prevent there being two copies of
    the data in memory at once (an array of retrieved values and the derived hash).

    Then just read each EMPN from EMP and print it out if it isn't in the hash.
    This code fragment may help.
    Cheers,

    Rob

    use strict;
    use warnings;

    use DBI;

    my ($dsn, $user, $pass); # Assign these as appropriate
    my $dbh = DBI->connect($dsn, $user, $pass);

    my %employee;

    my $sth = $dbh->prepare('SELECT EMPN FRM EMP@DBWAN');
    $sth->execute;
    while (my ($empno) = $sth->fetchrow_array) {
    $employee{$empno}++;
    }

    $sth = $dbh->prepare('SELECT EMPN FRM EMP');
    $sth->execute;
    while (my ($empno) = $sth->fetchrow_array) {
    print "$empno\n" unless $employee{$empno};
    }
  • No.4 | | 797 bytes | |

    Just an idea, don't know whether it's useful
    If you can get both files sorted (either by adding order to your sql
    query that generates the file, or the commandline 'sort') the problem
    becomes much more easy.
    You'd just have to traverse each file, something like this:

    read word_1 from file_1
    read word_2 from file_2

    while not at the end of the file

    if word_1 < word_2
    print: word_1 is not in file_2 !
    read next word_1 from file_1

    elseif word_2 < word_1
    print: word_2 is not in file_1 !
    read next word_2 from file_2

    else
    read next word_1 from file_1
    read next word_2 from file_2

    end while

    Well, the stopping criterium is not quite correct yet, but you'll get my
    point.
  • No.5 | | 407 bytes | |

    benbart (AT) orcon (DOT) net.nz schreef:

    SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN
    FRM EMP@DBWAN )

    maybe use something like
    select EMPN as E1 from EMP left join EMP@DBWAN as E2 on E1.EMPN =
    E2.EMPN where E2.EMPN IS NULL
    (untested)

    So search FILE1 for all line entries of FILE2 then output whoever
    does not exist

    Look into `sort` and `diff` and `uniq -u`.

Re: Comparing two files of 8million lines/rows ...


max 4000 letters.
Your nickname that display:
In order to stop the spam: 3 + 2 =
QUESTION ON "Perl"

EMSDN.COM