Perl

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Coverting a big flat file

    7 answers - 1014 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi,
    I am new to the list and newbie in perl.
    I have a big flat file(100G). The file was supposed to be in a single line
    but many of records(as it has ^M). There are also ^@ and tabs in between.
    I want to first replace the control characters and tabs with space.
    I tried this s/[[:cntrl:]\t]/ /g. After replacing the above said characters
    with space i have to insert \n after each 1000th character.
    But the program hangs after reading about 24G( 1/4th of the file).
    I thought of reading the file character by character, check if the character
    is ^M||^@||\t. If true replace with the space and write the ouput else
    simply write the output. I have to keep track of the count of characters
    so as to insert \n after each 1000th character.
    Will the above work or is there any other(simple) way to do this?( or should
    i just move on to C?)
    I am not sure why my first program hang(i ran the program in a machine with
    2G RAM).
    Regds,
    SK
  • No.1 | | 1310 bytes | |

    Saravana Kumar wrote:
    Hi,

    Hello,

    I am new to the list and newbie in perl.

    I have a big flat file(100G). The file was supposed to be in a single line
    but many of records(as it has ^M). There are also ^@ and tabs in between.

    I want to first replace the control characters and tabs with space.

    I tried this s/[[:cntrl:]\t]/ /g.

    The [:cntrl:] character class includes the "\t" character.

    After replacing the above said characters
    with space i have to insert \n after each 1000th character.

    But the program hangs after reading about 24G( 1/4th of the file).

    I thought of reading the file character by character, check if the character
    is ^M||^@||\t. If true replace with the space and write the ouput else
    simply write the output. I have to keep track of the count of characters
    so as to insert \n after each 1000th character.

    Will the above work or is there any other(simple) way to do this?( or should
    i just move on to C?)

    I am not sure why my first program hang(i ran the program in a machine with
    2G RAM).

    You can do what you want if you set the Input Record Separator to read 1000
    bytes at a time:

    $/ = \1000;
    while ( <FILE) {
    s/[[:cntrl:]]/ /g;
    print "$_\n";
    }

    John
  • No.2 | | 1930 bytes | |

    John W. Krahn wrote:

    Saravana Kumar wrote:
    >Hi,


    Hello,

    >I am new to the list and newbie in perl.
    >
    >I have a big flat file(100G). The file was supposed to be in a single
    >line but many of records(as it has ^M). There are also ^@ and tabs in
    >between.
    >
    >I want to first replace the control characters and tabs with space.
    >
    >I tried this s/[[:cntrl:]\t]/ /g.


    The [:cntrl:] character class includes the "\t" character.


    >After replacing the above said characters
    >with space i have to insert \n after each 1000th character.
    >
    >But the program hangs after reading about 24G( 1/4th of the file).
    >
    >I thought of reading the file character by character, check if the
    >character is ^M||^@||\t. If true replace with the space and write the
    >ouput else
    >simply write the output. I have to keep track of the count of characters
    >so as to insert \n after each 1000th character.
    >
    >Will the above work or is there any other(simple) way to do this?( or
    >should i just move on to C?)
    >
    >I am not sure why my first program hang(i ran the program in a machine
    >with 2G RAM).


    You can do what you want if you set the Input Record Separator to read
    1000 bytes at a time:

    $/ = \1000;
    while ( <FILE) {
    s/[[:cntrl:]]/ /g;
    print "$_\n";
    }

    John

    Thanks John. That did the trick. I ran the above script with my input file
    and redirected the output to another file. Since it is creating a new file
    i was wondering whether i can do the changes in the same file ie., read
    1000 characters, do the replacement and write the output to the same file.
    This will reduce the disk space used(since the file i have is 100G).

    TIA,
    SK
  • No.3 | | 3151 bytes | |

    Saravana Kumar wrote:
    John W. Krahn wrote:
    >
    >>Saravana Kumar wrote:
    >>

    I am new to the list and newbie in perl.

    I have a big flat file(100G). The file was supposed to be in a single
    line but many of records(as it has ^M). There are also ^@ and tabs in
    between.

    I want to first replace the control characters and tabs with space.

    I tried this s/[[:cntrl:]\t]/ /g.
    >>
    >>The [:cntrl:] character class includes the "\t" character.
    >>

    After replacing the above said characters
    with space i have to insert \n after each 1000th character.

    But the program hangs after reading about 24G( 1/4th of the file).

    I thought of reading the file character by character, check if the
    character is ^M||^@||\t. If true replace with the space and write the
    ouput else
    simply write the output. I have to keep track of the count of characters
    so as to insert \n after each 1000th character.

    Will the above work or is there any other(simple) way to do this?( or
    should i just move on to C?)

    I am not sure why my first program hang(i ran the program in a machine
    with 2G RAM).
    >>
    >>You can do what you want if you set the Input Record Separator to read
    >>1000 bytes at a time:
    >>
    >>$/ = \1000;
    >>while ( <FILE) {

    >s/[[:cntrl:]]/ /g;
    >print "$_\n";
    >}
    >

    Thanks John. That did the trick. I ran the above script with my input file
    and redirected the output to another file. Since it is creating a new file
    i was wondering whether i can do the changes in the same file ie., read
    1000 characters, do the replacement and write the output to the same file.
    This will reduce the disk space used(since the file i have is 100G).

    That is like preparing an apple pie while it is in the oven to save on kitchen
    space. You can't easily do it because each of your new records is one character
    longer than the original record and you would be overwriting data you hadn't
    processed yet. It is possible, in the sense that you could make sure that all
    the data is read from the file and held elsewhere (in memory or in a temporary
    file) before it is overwritten, but it wouldn't be a simple piece of code to get
    working correctly. In any case it is a bad idea because if you have a failure of
    any sort part-way through processing then your original data is then lost and
    you have no second chance. If the people you are working for expect to have
    files of this size and haven't allowed for storage space for several of them at
    once then you need to have a word with them about storage planning. You need a
    new disk drive: $100 will buy you around 300GB these days and that doesn't buy
    enough of your time to write clever software to cope with the lack of disk
    space.

    Cheers,

    Rob
  • No.4 | | 932 bytes | |

    Saravana Kumar wrote:
    John W. Krahn wrote:

    >>You can do what you want if you set the Input Record Separator to read
    >>1000 bytes at a time:
    >>
    >>$/ = \1000;
    >>while ( <FILE) {

    >s/[[:cntrl:]]/ /g;
    >print "$_\n";
    >}


    Thanks John. That did the trick. I ran the above script with my input file
    and redirected the output to another file. Since it is creating a new file
    i was wondering whether i can do the changes in the same file ie., read
    1000 characters, do the replacement and write the output to the same file.
    This will reduce the disk space used(since the file i have is 100G).

    Because you are adding characters (the newline) about the only way to do what
    you want is to read the entire file into memory, modify it, and then write it
    back out to the same file.

    John
  • No.5 | | 932 bytes | |

    Saravana Kumar wrote:
    John W. Krahn wrote:

    >>You can do what you want if you set the Input Record Separator to read
    >>1000 bytes at a time:
    >>
    >>$/ = \1000;
    >>while ( <FILE) {

    >s/[[:cntrl:]]/ /g;
    >print "$_\n";
    >}


    Thanks John. That did the trick. I ran the above script with my input file
    and redirected the output to another file. Since it is creating a new file
    i was wondering whether i can do the changes in the same file ie., read
    1000 characters, do the replacement and write the output to the same file.
    This will reduce the disk space used(since the file i have is 100G).

    Because you are adding characters (the newline) about the only way to do what
    you want is to read the entire file into memory, modify it, and then write it
    back out to the same file.

    John
  • No.6 | | 382 bytes | |

    Saravana Kumar schreef:

    Since it is creating
    a new file i was wondering whether i can do the changes in the same
    file ie., read 1000 characters, do the replacement and write the
    output to the same file. This will reduce the disk space used(since
    the file i have is 100G).

    Just change your specs and don't add the newline with every 1000
    characters.
  • No.7 | | 3653 bytes | |

    Rob Dixon wrote:

    Saravana Kumar wrote:
    John W. Krahn wrote:
    >
    >>Saravana Kumar wrote:
    >>

    I am new to the list and newbie in perl.

    I have a big flat file(100G). The file was supposed to be in a single
    line but many of records(as it has ^M). There are also ^@ and tabs in
    between.

    I want to first replace the control characters and tabs with space.

    I tried this s/[[:cntrl:]\t]/ /g.
    >>
    >>The [:cntrl:] character class includes the "\t" character.
    >>

    After replacing the above said characters
    with space i have to insert \n after each 1000th character.

    But the program hangs after reading about 24G( 1/4th of the file).

    I thought of reading the file character by character, check if the
    character is ^M||^@||\t. If true replace with the space and write the
    ouput else
    simply write the output. I have to keep track of the count of
    characters so as to insert \n after each 1000th character.

    Will the above work or is there any other(simple) way to do this?( or
    should i just move on to C?)

    I am not sure why my first program hang(i ran the program in a machine
    with 2G RAM).
    >>
    >>You can do what you want if you set the Input Record Separator to read
    >>1000 bytes at a time:
    >>
    >>$/ = \1000;
    >>while ( <FILE) {

    >s/[[:cntrl:]]/ /g;
    >print "$_\n";
    >}
    >

    Thanks John. That did the trick. I ran the above script with my input
    file and redirected the output to another file. Since it is creating a
    new file i was wondering whether i can do the changes in the same file
    ie., read 1000 characters, do the replacement and write the output to
    the same file. This will reduce the disk space used(since the file i
    have is 100G).

    That is like preparing an apple pie while it is in the oven to save on
    kitchen space. You can't easily do it because each of your new records is
    one character longer than the original record and you would be overwriting
    data you hadn't processed yet. It is possible, in the sense that you could
    make sure that all the data is read from the file and held elsewhere (in
    memory or in a temporary file) before it is overwritten, but it wouldn't
    be a simple piece of code to get working correctly. In any case it is a
    bad idea because if you have a failure of any sort part-way through
    processing then your original data is then lost and you have no second
    chance. If the people you are working for expect to have files of this
    size and haven't allowed for storage space for several of them at once
    then you need to have a word with them about storage planning. You need a
    new disk drive: $100 will buy you around 300GB these days and that doesn't
    buy enough of your time to write clever software to cope with the lack of
    disk space.

    Cheers,

    Rob

    I have enough space in the HDD to store more files but this "idea" came to
    me just as a thought. I missed the part that adding "\n" will actually
    overwrite the first character in the next record, which i haven't read at
    all. I am going ahead with the same method( redirecting the output to new
    file) so as to save the coding time. Not to mention that i cant loose any
    data in that file.

    Thanks! for all who replied to my queries. Thanks! for the time spent.

    Regds,
    SK

Re: Coverting a big flat file


max 4000 letters.
Your nickname that display:
In order to stop the spam: 4 + 3 =
QUESTION ON "Perl"

EMSDN.COM