Comparing two files of 8million lines/rows ...
5 answers - 2191 bytes -

Hi all,
I have two database tables, one is local and one is on a WAN. They are supposed
to be in-sync but they at the moment, they are not. There are 8million+ plus
rows on this table.
I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
EMP@DBWAN ), leave that running for hours and all thru the night and guess
what, am still waiting for the result to come out
So what I decided is I want to extract the records into a flat file and then
write a Perl script to skim thru each line and check whether it exists on the
other file. While there will be 8million+ lines, the file will not be big
beacuse am only extracting one-column from the table.
Does anyone have an existing Perl code that does a similar thing like this
already? It will be much appreciated if you can send it to me and then I will
just modify it accordingly.
Example logic that I have is this:
FILE1:
MICKEY
MINNIE
DNALD
GFY
PLUT
FILE2:
MICKEY
MINNIE
DNALD
GFY
PLUT
BUGS-BUNNY
So search FILE1 for all line entries of FILE2 then output whoever does not exist
into FILE3. So after running the script, I should have
FILE3:
BUGS-BUNNY
What I currently have is that I read all of FILE2's entries into an array? Read
FILE1 one line at a time using m/// and if there is no m///, print that to
FILE3.
It seems to work fine for 1000 lines of entries, but am not particularly sure
how long will that take for 8million+ rows, not even sure if I can create an
array to contain 8million+ plus rows, if I can't, then am back to doing this on
the database instead. Another option am looking at is just to read FILE1 one
line at a time and do a grep "$string_to_search" FILE2 but I do not know how to
do a grep-like syntax against a file on Perl especially if the search string is
a variable.
Why I prefer using a script is so am not putting loads into the database not to
mention that I can put more logic into the script than on the SQL statement.
Any advise or other options will be very much appreciated Thanks in
advance.
No.1 | | 2873 bytes |
| 
<benbart (AT) orcon (DOT) net.nzwrote in message
@webmail.slingshot.co.nz
Hi all,
Hello,
I have two database tables, one is local and one is on a WAN. They are
supposed
to be in-sync but they at the moment, they are not. There are 8million+
plus
rows on this table.
I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
EMP@DBWAN ), leave that running for hours and all thru the night and guess
what, am still waiting for the result to come out
So what I decided is I want to extract the records into a flat file and
then
write a Perl script to skim thru each line and check whether it exists on
the
other file. While there will be 8million+ lines, the file will not be big
beacuse am only extracting one-column from the table.
Does anyone have an existing Perl code that does a similar thing like this
already? It will be much appreciated if you can send it to me and then I
will
just modify it accordingly.
Example logic that I have is this:
FILE1:
MICKEY
MINNIE
DNALD
GFY
PLUT
FILE2:
MICKEY
MINNIE
DNALD
GFY
PLUT
BUGS-BUNNY
This will work, based on you given data
use strict;
use warnings;
my @file2 = qw/mickey minnie donald goofy pluto /;
my @file1 = qw /mickey minnie donald goofy pluto bunny/;
my %hash = map { $_ =undef } @file2;
while(<@file1>) {
unless(exists $hash{$_}) {
print $_, "\n";
}
}
output:
bunny
caveats:
This will only print out the element that were present on file1 and were
not on file2,
and i'm also a beginner.
So search FILE1 for all line entries of FILE2 then output whoever does not
exist
into FILE3. So after running the script, I should have
FILE3:
BUGS-BUNNY
What I currently have is that I read all of FILE2's entries into an array?
Read
FILE1 one line at a time using m/// and if there is no m///, print that to
FILE3.
It seems to work fine for 1000 lines of entries, but am not particularly
sure
how long will that take for 8million+ rows, not even sure if I can create
an
array to contain 8million+ plus rows, if I can't, then am back to doing
this on
the database instead. Another option am looking at is just to read FILE1
one
line at a time and do a grep "$string_to_search" FILE2 but I do not know
how to
do a grep-like syntax against a file on Perl especially if the search
string is
a variable.
Why I prefer using a script is so am not putting loads into the database
not to
mention that I can put more logic into the script than on the SQL
statement.
Any advise or other options will be very much appreciated Thanks in
advance.
--
welcome, HTH.
/joseph
No.2 | | 3197 bytes |
| 
08/16/2006 04:35 PM, benbart (AT) orcon (DOT) net.nz wrote:
Hi all,
I have two database tables, one is local and one is on a WAN. They are supposed
to be in-sync but they at the moment, they are not. There are 8million+ plus
rows on this table.
I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
EMP@DBWAN ), leave that running for hours and all thru the night and guess
what, am still waiting for the result to come out
So what I decided is I want to extract the records into a flat file and then
write a Perl script to skim thru each line and check whether it exists on the
other file. While there will be 8million+ lines, the file will not be big
beacuse am only extracting one-column from the table.
Does anyone have an existing Perl code that does a similar thing like this
already? It will be much appreciated if you can send it to me and then I will
just modify it accordingly.
Example logic that I have is this:
FILE1:
MICKEY
MINNIE
DNALD
GFY
PLUT
FILE2:
MICKEY
MINNIE
DNALD
GFY
PLUT
BUGS-BUNNY
So search FILE1 for all line entries of FILE2 then output whoever does not exist
into FILE3. So after running the script, I should have
FILE3:
BUGS-BUNNY
What I currently have is that I read all of FILE2's entries into an array? Read
FILE1 one line at a time using m/// and if there is no m///, print that to
FILE3.
It seems to work fine for 1000 lines of entries, but am not particularly sure
how long will that take for 8million+ rows, not even sure if I can create an
array to contain 8million+ plus rows, if I can't, then am back to doing this on
the database instead. Another option am looking at is just to read FILE1 one
line at a time and do a grep "$string_to_search" FILE2 but I do not know how to
do a grep-like syntax against a file on Perl especially if the search string is
a variable.
Why I prefer using a script is so am not putting loads into the database not to
mention that I can put more logic into the script than on the SQL statement.
Any advise or other options will be very much appreciated Thanks in
advance.
, you want to do this as efficiently as possible
because of the humongous size of the data. Hashes are the
fastest structures for letting you know if some data has
already been seen.
Your problem is core. Do you have enough core memory to read
all of the data one of the table columns into memory? If so,
then the solution be almost trivial; if not, then it's
probably not trivial, but also not hard.
My advice is to attempt to suck the entire column for one
table into memory; FILE1 should become hash keys (with empty
values). Then you would open FILE2, loop through the lines
(records) and output any record that does not appear in the hash.
However, if you can't get the entire FILE1 into memory, then
I'd suggest converting FILE1 into a berkeley database and
using DB_File to tie it to a hash; from there on, the solution
would be like the above.
HTH
No.3 | | 3858 bytes |
| 
benbart (AT) orcon (DOT) net.nz wrote:
>
I have two database tables, one is local and one is on a WAN. They are
supposed to be in-sync but they at the moment, they are not. There are
8million+ plus rows on this table.
>
I tried to do SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN FRM
EMP@DBWAN ), leave that running for hours and all thru the night and guess
what, am still waiting for the result to come out
>
So what I decided is I want to extract the records into a flat file and then
write a Perl script to skim thru each line and check whether it exists on the
other file. While there will be 8million+ lines, the file will not be big
beacuse am only extracting one-column from the table.
>
Does anyone have an existing Perl code that does a similar thing like this
already? It will be much appreciated if you can send it to me and then I will
just modify it accordingly.
>
Example logic that I have is this:
>
FILE1:
MICKEY
MINNIE
DNALD
GFY
PLUT
>
FILE2:
MICKEY
MINNIE
DNALD
GFY
PLUT
BUGS-BUNNY
>
So search FILE1 for all line entries of FILE2 then output whoever does not
exist into FILE3. So after running the script, I should have
>
FILE3:
BUGS-BUNNY
>
What I currently have is that I read all of FILE2's entries into an array?
Read FILE1 one line at a time using m/// and if there is no m///, print that
to FILE3.
>
It seems to work fine for 1000 lines of entries, but am not particularly sure
how long will that take for 8million+ rows, not even sure if I can create an
array to contain 8million+ plus rows, if I can't, then am back to doing this
on the database instead. Another option am looking at is just to read FILE1
one line at a time and do a grep "$string_to_search" FILE2 but I do not know
how to do a grep-like syntax against a file on Perl especially if the search
string is a variable.
>
Why I prefer using a script is so am not putting loads into the database not
to mention that I can put more logic into the script than on the SQL
statement.
>
Any advise or other options will be very much appreciated Thanks in
advance.
No need for flat files if you are using DBI. Read the EMPN values from
EMP@DBWAN first as it is a) slower, being on the network, and b) a shorter list,
and put these values into a Perl hash. 8 million hash values are unlikely to
take more than about 300MB of memory I would guess, which should be fine on any
recent PC. Fetching one value at a time will prevent there being two copies of
the data in memory at once (an array of retrieved values and the derived hash).
Then just read each EMPN from EMP and print it out if it isn't in the hash.
This code fragment may help.
Cheers,
Rob
use strict;
use warnings;
use DBI;
my ($dsn, $user, $pass); # Assign these as appropriate
my $dbh = DBI->connect($dsn, $user, $pass);
my %employee;
my $sth = $dbh->prepare('SELECT EMPN FRM EMP@DBWAN');
$sth->execute;
while (my ($empno) = $sth->fetchrow_array) {
$employee{$empno}++;
}
$sth = $dbh->prepare('SELECT EMPN FRM EMP');
$sth->execute;
while (my ($empno) = $sth->fetchrow_array) {
print "$empno\n" unless $employee{$empno};
}
No.4 | | 797 bytes |
| 
Just an idea, don't know whether it's useful
If you can get both files sorted (either by adding order to your sql
query that generates the file, or the commandline 'sort') the problem
becomes much more easy.
You'd just have to traverse each file, something like this:
read word_1 from file_1
read word_2 from file_2
while not at the end of the file
if word_1 < word_2
print: word_1 is not in file_2 !
read next word_1 from file_1
elseif word_2 < word_1
print: word_2 is not in file_1 !
read next word_2 from file_2
else
read next word_1 from file_1
read next word_2 from file_2
end while
Well, the stopping criterium is not quite correct yet, but you'll get my
point.
No.5 | | 407 bytes |
| 
benbart (AT) orcon (DOT) net.nz schreef:
SELECT EMPN FRM EMP WHERE EMPN NT IN ( SELECT EMPN
FRM EMP@DBWAN )
maybe use something like
select EMPN as E1 from EMP left join EMP@DBWAN as E2 on E1.EMPN =
E2.EMPN where E2.EMPN IS NULL
(untested)
So search FILE1 for all line entries of FILE2 then output whoever
does not exist
Look into `sort` and `diff` and `uniq -u`.