Coverting a big flat file
7 answers - 1014 bytes -

Hi,
I am new to the list and newbie in perl.
I have a big flat file(100G). The file was supposed to be in a single line
but many of records(as it has ^M). There are also ^@ and tabs in between.
I want to first replace the control characters and tabs with space.
I tried this s/[[:cntrl:]\t]/ /g. After replacing the above said characters
with space i have to insert \n after each 1000th character.
But the program hangs after reading about 24G( 1/4th of the file).
I thought of reading the file character by character, check if the character
is ^M||^@||\t. If true replace with the space and write the ouput else
simply write the output. I have to keep track of the count of characters
so as to insert \n after each 1000th character.
Will the above work or is there any other(simple) way to do this?( or should
i just move on to C?)
I am not sure why my first program hang(i ran the program in a machine with
2G RAM).
Regds,
SK
No.1 | | 1310 bytes |
| 
Saravana Kumar wrote:
Hi,
Hello,
I am new to the list and newbie in perl.
I have a big flat file(100G). The file was supposed to be in a single line
but many of records(as it has ^M). There are also ^@ and tabs in between.
I want to first replace the control characters and tabs with space.
I tried this s/[[:cntrl:]\t]/ /g.
The [:cntrl:] character class includes the "\t" character.
After replacing the above said characters
with space i have to insert \n after each 1000th character.
But the program hangs after reading about 24G( 1/4th of the file).
I thought of reading the file character by character, check if the character
is ^M||^@||\t. If true replace with the space and write the ouput else
simply write the output. I have to keep track of the count of characters
so as to insert \n after each 1000th character.
Will the above work or is there any other(simple) way to do this?( or should
i just move on to C?)
I am not sure why my first program hang(i ran the program in a machine with
2G RAM).
You can do what you want if you set the Input Record Separator to read 1000
bytes at a time:
$/ = \1000;
while ( <FILE) {
s/[[:cntrl:]]/ /g;
print "$_\n";
}
John
No.2 | | 1930 bytes |
| 
John W. Krahn wrote:
Saravana Kumar wrote:
>Hi,
Hello,
>I am new to the list and newbie in perl.
>
>I have a big flat file(100G). The file was supposed to be in a single
>line but many of records(as it has ^M). There are also ^@ and tabs in
>between.
>
>I want to first replace the control characters and tabs with space.
>
>I tried this s/[[:cntrl:]\t]/ /g.
The [:cntrl:] character class includes the "\t" character.
>After replacing the above said characters
>with space i have to insert \n after each 1000th character.
>
>But the program hangs after reading about 24G( 1/4th of the file).
>
>I thought of reading the file character by character, check if the
>character is ^M||^@||\t. If true replace with the space and write the
>ouput else
>simply write the output. I have to keep track of the count of characters
>so as to insert \n after each 1000th character.
>
>Will the above work or is there any other(simple) way to do this?( or
>should i just move on to C?)
>
>I am not sure why my first program hang(i ran the program in a machine
>with 2G RAM).
You can do what you want if you set the Input Record Separator to read
1000 bytes at a time:
$/ = \1000;
while ( <FILE) {
s/[[:cntrl:]]/ /g;
print "$_\n";
}
John
Thanks John. That did the trick. I ran the above script with my input file
and redirected the output to another file. Since it is creating a new file
i was wondering whether i can do the changes in the same file ie., read
1000 characters, do the replacement and write the output to the same file.
This will reduce the disk space used(since the file i have is 100G).
TIA,
SK
No.3 | | 3151 bytes |
| 
Saravana Kumar wrote:
John W. Krahn wrote:
>
>>Saravana Kumar wrote:
>>
I am new to the list and newbie in perl.
I have a big flat file(100G). The file was supposed to be in a single
line but many of records(as it has ^M). There are also ^@ and tabs in
between.
I want to first replace the control characters and tabs with space.
I tried this s/[[:cntrl:]\t]/ /g.
>>
>>The [:cntrl:] character class includes the "\t" character.
>>
After replacing the above said characters
with space i have to insert \n after each 1000th character.
But the program hangs after reading about 24G( 1/4th of the file).
I thought of reading the file character by character, check if the
character is ^M||^@||\t. If true replace with the space and write the
ouput else
simply write the output. I have to keep track of the count of characters
so as to insert \n after each 1000th character.
Will the above work or is there any other(simple) way to do this?( or
should i just move on to C?)
I am not sure why my first program hang(i ran the program in a machine
with 2G RAM).
>>
>>You can do what you want if you set the Input Record Separator to read
>>1000 bytes at a time:
>>
>>$/ = \1000;
>>while ( <FILE) {
>s/[[:cntrl:]]/ /g;
>print "$_\n";
>}
>
Thanks John. That did the trick. I ran the above script with my input file
and redirected the output to another file. Since it is creating a new file
i was wondering whether i can do the changes in the same file ie., read
1000 characters, do the replacement and write the output to the same file.
This will reduce the disk space used(since the file i have is 100G).
That is like preparing an apple pie while it is in the oven to save on kitchen
space. You can't easily do it because each of your new records is one character
longer than the original record and you would be overwriting data you hadn't
processed yet. It is possible, in the sense that you could make sure that all
the data is read from the file and held elsewhere (in memory or in a temporary
file) before it is overwritten, but it wouldn't be a simple piece of code to get
working correctly. In any case it is a bad idea because if you have a failure of
any sort part-way through processing then your original data is then lost and
you have no second chance. If the people you are working for expect to have
files of this size and haven't allowed for storage space for several of them at
once then you need to have a word with them about storage planning. You need a
new disk drive: $100 will buy you around 300GB these days and that doesn't buy
enough of your time to write clever software to cope with the lack of disk
space.
Cheers,
Rob
No.4 | | 932 bytes |
| 
Saravana Kumar wrote:
John W. Krahn wrote:
>>You can do what you want if you set the Input Record Separator to read
>>1000 bytes at a time:
>>
>>$/ = \1000;
>>while ( <FILE) {
>s/[[:cntrl:]]/ /g;
>print "$_\n";
>}
Thanks John. That did the trick. I ran the above script with my input file
and redirected the output to another file. Since it is creating a new file
i was wondering whether i can do the changes in the same file ie., read
1000 characters, do the replacement and write the output to the same file.
This will reduce the disk space used(since the file i have is 100G).
Because you are adding characters (the newline) about the only way to do what
you want is to read the entire file into memory, modify it, and then write it
back out to the same file.
John
No.5 | | 932 bytes |
| 
Saravana Kumar wrote:
John W. Krahn wrote:
>>You can do what you want if you set the Input Record Separator to read
>>1000 bytes at a time:
>>
>>$/ = \1000;
>>while ( <FILE) {
>s/[[:cntrl:]]/ /g;
>print "$_\n";
>}
Thanks John. That did the trick. I ran the above script with my input file
and redirected the output to another file. Since it is creating a new file
i was wondering whether i can do the changes in the same file ie., read
1000 characters, do the replacement and write the output to the same file.
This will reduce the disk space used(since the file i have is 100G).
Because you are adding characters (the newline) about the only way to do what
you want is to read the entire file into memory, modify it, and then write it
back out to the same file.
John
No.6 | | 382 bytes |
| 
Saravana Kumar schreef:
Since it is creating
a new file i was wondering whether i can do the changes in the same
file ie., read 1000 characters, do the replacement and write the
output to the same file. This will reduce the disk space used(since
the file i have is 100G).
Just change your specs and don't add the newline with every 1000
characters.
No.7 | | 3653 bytes |
| 
Rob Dixon wrote:
Saravana Kumar wrote:
John W. Krahn wrote:
>
>>Saravana Kumar wrote:
>>
I am new to the list and newbie in perl.
I have a big flat file(100G). The file was supposed to be in a single
line but many of records(as it has ^M). There are also ^@ and tabs in
between.
I want to first replace the control characters and tabs with space.
I tried this s/[[:cntrl:]\t]/ /g.
>>
>>The [:cntrl:] character class includes the "\t" character.
>>
After replacing the above said characters
with space i have to insert \n after each 1000th character.
But the program hangs after reading about 24G( 1/4th of the file).
I thought of reading the file character by character, check if the
character is ^M||^@||\t. If true replace with the space and write the
ouput else
simply write the output. I have to keep track of the count of
characters so as to insert \n after each 1000th character.
Will the above work or is there any other(simple) way to do this?( or
should i just move on to C?)
I am not sure why my first program hang(i ran the program in a machine
with 2G RAM).
>>
>>You can do what you want if you set the Input Record Separator to read
>>1000 bytes at a time:
>>
>>$/ = \1000;
>>while ( <FILE) {
>s/[[:cntrl:]]/ /g;
>print "$_\n";
>}
>
Thanks John. That did the trick. I ran the above script with my input
file and redirected the output to another file. Since it is creating a
new file i was wondering whether i can do the changes in the same file
ie., read 1000 characters, do the replacement and write the output to
the same file. This will reduce the disk space used(since the file i
have is 100G).
That is like preparing an apple pie while it is in the oven to save on
kitchen space. You can't easily do it because each of your new records is
one character longer than the original record and you would be overwriting
data you hadn't processed yet. It is possible, in the sense that you could
make sure that all the data is read from the file and held elsewhere (in
memory or in a temporary file) before it is overwritten, but it wouldn't
be a simple piece of code to get working correctly. In any case it is a
bad idea because if you have a failure of any sort part-way through
processing then your original data is then lost and you have no second
chance. If the people you are working for expect to have files of this
size and haven't allowed for storage space for several of them at once
then you need to have a word with them about storage planning. You need a
new disk drive: $100 will buy you around 300GB these days and that doesn't
buy enough of your time to write clever software to cope with the lack of
disk space.
Cheers,
Rob
I have enough space in the HDD to store more files but this "idea" came to
me just as a thought. I missed the part that adding "\n" will actually
overwrite the first character in the next record, which i haven't read at
all. I am going ahead with the same method( redirecting the output to new
file) so as to save the coding time. Not to mention that i cant loose any
data in that file.
Thanks! for all who replied to my queries. Thanks! for the time spent.
Regds,
SK