Development

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Frankenstring

    24 answers - 1415 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi,
    I think I need an iterator over a string of characters pulling them out
    one by one, like a usual iterator over a str does. At the same time the
    thing should allow seeking and telling like a file-like object:
    f = frankenstring("0123456789")
    for c in f:
    print c
    if c == "2":
    break
    0
    1
    2
    f.tell()
    3L
    f.seek(7)
    for c in f:
    print c
    7
    8
    9
    It's definitely no help that file-like objects are iterable; I do want
    to get a character, not a complete line, at a time.
    I can think of more than one clumsy way to implement the desired
    behaviour in Python; I'd rather like to know whether there's an
    implementation somewhere that does it fast. (Yes, it's me and speed
    considerations again; this is for a tokenizer at the core of a library,
    and I'd really like it to be fast.) I don't think there's anything like
    it in the standard library, at least not anything that would be obvious
    to me.
    I don't care whether this is more of a string iterator with seeking and
    telling, or a file-like object with a single-character iterator; as long
    as it does both efficiently, I'm happy.
    I'd even consider writing such a beast in C, albeit more as a learning
    exercise than as a worthwhile measure to speed up some code.
    Thanks for any hints.
  • No.1 | | 48 bytes | |

    see StringI or cStringI in the standard library.
  • No.2 | | 1606 bytes | |

    Thomas Lotze wrote:

    >Hi,
    >
    >I think I need an iterator over a string of characters pulling them out
    >one by one, like a usual iterator over a str does. At the same time the
    >thing should allow seeking and telling like a file-like object:


    , first off, this is never going to be *fast* compared to something
    coded in C and wrapped with Python. You are dealing with every single
    character as a Python object, so let's forget fast for the moment and do
    a straightforward implementation:

    class Franken( str ):
    frankenIndex = 0
    def __iter__( self ):
    while self.frankenIndex < len(self):
    yield self[ self.frankenIndex ]
    self.frankenIndex += 1
    self.frankenIndex = 0
    def seek( self, index ):
    self.frankenIndex = index
    def tell( self, index ):
    return self.frankenIndex

    if __name__ == "__main__":
    f = Franken( 'abcdefg' )
    for c in f:
    print 'char', c
    if c == 'c':
    break
    f.seek( 5 )
    l1 = list( f )
    l2 = list( f )
    assert l1 == [ 'f','g' ]
    assert l2 == list(str(f))
    print 'first list', l1
    print 'second list', l2

    If you want to speed it up, you can optimise for various string sizes
    (eg using a slice of the string and the built-in iterator when
    appropriate), but in the end, it's not going to be anywhere near as fast
    as a C-engine tokeniser, so I'd personally spend more time on elegance
    than on speed

    Anywho, have fun,
    Mike
  • No.3 | | 2130 bytes | |

    Tue, 12 Jul 2005 22:08:55 +0200, Thomas Lotze <thomas (AT) thomas-lotze (DOT) dewrote:

    >Hi,
    >
    >I think I need an iterator over a string of characters pulling them out
    >one by one, like a usual iterator over a str does. At the same time the
    >thing should allow seeking and telling like a file-like object:
    >

    f = frankenstring("0123456789")
    for c in f:
    print c
    if c == "2":
    break

    >0
    >1
    >2

    f.tell()
    >3L

    f.seek(7)
    for c in f:
    print c

    >7
    >8
    >9


    >
    >It's definitely no help that file-like objects are iterable; I do want
    >to get a character, not a complete line, at a time.
    >
    >I can think of more than one clumsy way to implement the desired
    >behaviour in Python; I'd rather like to know whether there's an
    >implementation somewhere that does it fast. (Yes, it's me and speed
    >considerations again; this is for a tokenizer at the core of a library,
    >and I'd really like it to be fast.) I don't think there's anything like
    >it in the standard library, at least not anything that would be obvious
    >to me.
    >
    >I don't care whether this is more of a string iterator with seeking and
    >telling, or a file-like object with a single-character iterator; as long
    >as it does both efficiently, I'm happy.
    >
    >I'd even consider writing such a beast in C, albeit more as a learning
    >exercise than as a worthwhile measure to speed up some code.
    >
    >Thanks for any hints.
    >

    I'd probably subclass file to buffer in good-sized chunks and override the
    iteration to go by characters through the buffer, updating the buffer
    when you get to its end, and overriding seek and tell to do the right thing
    re the buffer and where you are in it for the character iteration via next.

    Regards,
    Bengt Richter
  • No.4 | | 184 bytes | |

    jay graves wrote:
    see StringI or cStringI in the standard library.
    Just as with files, iterating over them returns whole lines, which is
    unfortunately not what I want.
  • No.5 | | 697 bytes | |

    Thomas Lotze wrote:
    Hi,
    I think I need an iterator over a string of characters pulling them out
    one by one, like a usual iterator over a str does. At the same time the
    thing should allow seeking and telling like a file-like object:

    f = frankenstring("0123456789")
    for c in f:

    print c
    if c == "2":
    break

    K, frankenstring can be approximated by nothing:
    for c in "0123456789":
    print c

    Now if you want to do it for a file, you could do:

    for c in thefile.read():

    , if you did not want to do a full read:
    def filechars(afile):
    for line in afile:
    for char in line:
    yield char

    David Daniels
    Scott.Daniels@A
  • No.6 | | 616 bytes | |

    Scott David Daniels wrote:

    Now if you want to do it for a file, you could do:

    for c in thefile.read():

    The whole point of the exercise is that seeking on a file doesn't
    influence iteration over its content. In the loop you suggest, I can
    seek() on thefile to my heart's content and will always get its content
    iterated over exactly from beginning to end. It had been read before any
    of this started, after all. Similarly, thefile.tell() will always tell me
    thefile's size or the place I last seek()'ed to instead of the position of
    the next char I will get.
  • No.7 | | 4434 bytes | |

    Wed, 13 Jul 2005 03:49:16 +0200, Thomas Lotze <thomas (AT) thomas-lotze (DOT) dewrote:

    >Scott David Daniels wrote:
    >
    >Now if you want to do it for a file, you could do:
    >
    >for c in thefile.read():
    >
    >
    >The whole point of the exercise is that seeking on a file doesn't
    >influence iteration over its content. In the loop you suggest, I can
    >seek() on thefile to my heart's content and will always get its content
    >iterated over exactly from beginning to end. It had been read before any
    >of this started, after all. Similarly, thefile.tell() will always tell me
    >thefile's size or the place I last seek()'ed to instead of the position of
    >the next char I will get.
    >

    What I suggested in my other post (untested beyond what you see, so you
    may want to add to the test ):

    < lotzefile.py >
    class LotzeFile(file):
    BUFSIZE = 4096
    def __init__(self, path, mode='r'):
    self.f = file(path, mode)
    self.pos = self.bufbase = 0
    self.buf = ''
    def __iter__(self): return self
    def next(self):
    if not self.buf[self.pos:]:
    self.bufbase += len(self.buf)
    self.pos = 0
    self.buf = self.f.read(self.BUFSIZE)
    if not self.buf:
    self.close()
    raise StopIteration
    byte = self.buf[self.pos]
    self.pos += 1
    return byte
    def seek(self, pos, ref=0):
    self.f.seek(pos, ref)
    self.bufbase = self.f.tell()
    self.pos = 0
    self.buf = ''
    def tell(self):
    return self.bufbase + self.pos
    def close(self):
    self.f.close()

    def test():
    f = file('lotzedata.txt','w')
    for s in (' %3d'%i for i in xrange(1000)): f.write(s)
    f.close()

    it = iter(LotzeFile('lotzedata.txt'))

    hold4=[0,0,0,0]
    for i, c in enumerate(it):
    hold4[i%4] = c
    if i%4==3:
    print hold4
    assert (i-3)/4 == int(''.join(hold4))
    if i == 99: break
    print it.tell()
    it.seek(52)
    for i in xrange(8): print it.next(),
    print
    it.seek(990*4)
    for c in it: print c,

    if __name__ == '__main__':
    test()

    Result:

    [20:53] C:\pywk\clp>py24 lotze.py
    [' ', ' ', ' ', '0']
    [' ', ' ', ' ', '1']
    [' ', ' ', ' ', '2']
    [' ', ' ', ' ', '3']
    [' ', ' ', ' ', '4']
    [' ', ' ', ' ', '5']
    [' ', ' ', ' ', '6']
    [' ', ' ', ' ', '7']
    [' ', ' ', ' ', '8']
    [' ', ' ', ' ', '9']
    [' ', ' ', '1', '0']
    [' ', ' ', '1', '1']
    [' ', ' ', '1', '2']
    [' ', ' ', '1', '3']
    [' ', ' ', '1', '4']
    [' ', ' ', '1', '5']
    [' ', ' ', '1', '6']
    [' ', ' ', '1', '7']
    [' ', ' ', '1', '8']
    [' ', ' ', '1', '9']
    [' ', ' ', '2', '0']
    [' ', ' ', '2', '1']
    [' ', ' ', '2', '2']
    [' ', ' ', '2', '3']
    [' ', ' ', '2', '4']
    100
    1 3 1 4
    9 9 0 9 9 1 9 9 2 9 9 3 9 9 4 9 9 5 9 9 6 9 9 7 9 9 8 9 9 9

    I suspect you could get better performance if you made LotzeFile instances able to
    return interators over buffer chunks and get characters from them, which would
    be string iterators supplying the characters rather than the custom .next, but
    the buffer chunks would have to be of some size to make that pay. Testing is
    the only way to find out what the crossing point is, if you really have to.

    Regards,
    Bengt Richter
  • No.8 | | 503 bytes | |

    Thomas Lotze wrote:
    It's definitely no help that file-like objects are iterable; I do want
    to get a character, not a complete line, at a time.

    Hi,

    if i did understand what you mean, what about using mmap? Iterating over
    characters in a file like this:

    # coding: iso-8859-1
    import os
    import mmap

    f = open("file.txt", "r+")
    size = os.path.getsize("file.txt")
    m = mmap.mmap(f.fileno(), size)

    for x in m:
    print x

    m.close()
    f.close()
  • No.9 | | 1000 bytes | |

    Thomas Lotze wrote:

    I think I need an iterator over a string of characters pulling them out
    one by one, like a usual iterator over a str does. At the same time the
    thing should allow seeking and telling like a file-like object:

    from StringI import StringI
    class frankenstring(StringI):
    def next(self):
    c = self.read(1)
    if not c:
    raise StopIteration
    return c

    f = frankenfile("0123456789")
    for c in f:
    print c
    if c == "2":
    break

    0
    1
    2
    f.tell()
    3
    f.seek(7)
    for c in f:
    print c

    7
    8
    9

    A non-intrusive alternative:

    def chariter(instream):
    def char(): return instream.read(1)
    return iter(char, "")

    f = StringI("0123456789")
    for c in chariter(f):
    print c
    if c == "2": break

    0
    1
    2
    f.tell()
    3

    Performance is probably not so good, but if you really want to do it in C,
    with cStringI you might be /almost/ there.

    Peter
  • No.10 | | 351 bytes | |

    Roland Heiber wrote:

    if i did understand what you mean, what about using mmap?

    AIUI (and as a little experimenting seems to confirm), you can't
    reposition an iterator over an mmap'ed file by seeking. True, you have
    both iterating by characters and seeking/telling, but the two
    functionalities don't play together.
  • No.11 | | 1539 bytes | |

    Bengt Richter wrote:

    < lotzefile.py >

    Thanks.

    []
    byte = self.buf[self.pos]

    This is the place where the thing is basically a str whose items are
    accessed as sequence elements. It has some iterator behaviour and file
    management which makes it nice to use, of course, and to most this will
    be enough (and it is a lot indeed). But it loses the efficiency of

    for c in "asdf": do_something(c)

    Actually, relying on string[index] behind the scenes is one of the ways
    of implementing frankenstring I labelled "clumsy" in the original
    posting ;o)

    I suspect you could get better performance if you made LotzeFile instances
    able to return interators over buffer chunks and get characters from them,
    which would be string iterators supplying the characters rather than the
    custom .next, but the buffer chunks would have to be of some size to make
    that pay. Testing is the only way to find out what the crossing point is,
    if you really have to.

    If I understand this correctly, you'd have to switch to using a new iterator
    after seeking, which would make this impossible:

    f = LotzeFile('something')
    for c in iter(f):
    do_something(c)
    if some_condition:
    f.seek(somewhere)
    # the next iteration reads from the new position

    And it would break telling since the class can't know how many
    characters have been read from an iterator once it returned one after
    seeking or switching to another buffer chunk.
  • No.12 | | 307 bytes | |

    Peter wrote:

    class frankenstring(StringI):
    def next(self):
    c = self.read(1)
    if not c:
    raise StopIteration
    return c

    Repeated read(1) on a file-like object is one of the ways of doing it with
    existing tools I labelled "clumsy" in the original posting ;o)

    Thanks anyway.
  • No.13 | | 1057 bytes | |

    Thomas Lotze wrote:
    AIUI (and as a little experimenting seems to confirm), you can't
    reposition an iterator over an mmap'ed file by seeking. True, you have
    both iterating by characters and seeking/telling, but the two
    functionalities don't play together.

    A quick and dirty hack!? Maybe i'm missing what it is that you need

    class MmapWithSeekAndTell(object):
    def __init__(self, m, size):
    self._m = m
    self._pos = 0
    self._size = size-1

    def __iter__(self):
    return self

    def next(self):
    if self._pos < self._size-1:
    self._pos += 1
    return self._m[self._pos]
    raise StopIteration

    def seek(self, offset, whence=0):
    if whence == 0 and 0 <= offset < self._size:
    self._pos = offset
    elif whence == 1 and 0 <= (self._pos + offset) < self._size:
    self._pos += offset
    elif whence == 2 and 0<= (self._size - offset) < self._size:
    self._pos = self._size - offset

    def tell(self):
    return self._pos

    HtH, Roland
  • No.14 | | 152 bytes | |

    Roland Heiber wrote:
    class MmapWithSeekAndTell(object):
    def __init__(self, m, size):
    where m is a mmap-object and size the filesize sorry.
  • No.15 | | 1465 bytes | |

    Aloha,

    Thomas Lotze wrote:
    I think I need an iterator over a string of characters pulling them out
    one by one, like a usual iterator over a str does. At the same time the
    thing should allow seeking and telling like a file-like object:
    f = frankenstring("0123456789")
    for c in f:
    print c
    if c == "2":
    break

    0
    1
    2
    f.tell()
    3L
    f.seek(7)
    for c in f:
    print c

    7
    8
    9
    I can think of more than one clumsy way to implement the desired
    behaviour in Python; I'd rather like to know whether there's an
    implementation somewhere that does it fast. (Yes, it's me and speed
    considerations again; this is for a tokenizer at the core of a library,
    and I'd really like it to be fast.)

    You can already think my answer, because i'm doing this
    at the core of a similar library, but to give others
    the chance to discuss.

    f = "0123456789"
    p = 0
    t2 = f.find('2')+1
    for c in f[p:t2]:
    print c

    0
    1
    2
    p = 7
    for c in f[p:]:
    print c

    7
    8
    9

    A string, and a pointer on that string. If you give up the
    boundary condition to tell backwards, you can start to eat up
    the string via f = f[p:]. There was a performance difference
    with that, in fact it was faster ~4% on a python2.2.

    I dont't expect any iterator solution to be faster than
    that.

    Wishing a happy day
    LBI
  • No.16 | | 590 bytes | |

    Thomas Lotze wrote:

    Peter wrote:

    class frankenstring(StringI):
    >def next(self):
    >c = self.read(1)
    >if not c:
    >raise StopIteration
    >return c


    Repeated read(1) on a file-like object is one of the ways of doing it with
    existing tools I labelled "clumsy" in the original posting ;o)

    Not clumsy, just slow. I hope you'll let us know how much faster your final
    approach turns out to be. By the way, I'll consider anything that doesn't
    implement seek() and tell() cheating :-)

    Peter
  • No.17 | | 1478 bytes | |

    Andreas Lobinger wrote:

    t2 = f.find('2')+1

    This is indeed faster than going through a string char by char. It doesn't
    make for a nice character-based state machine, but of course it avoids
    making Python objects for every character and uses the C implementation of
    str for searching.

    However, it's only fine if you are looking for single characters. As soon
    as you're looking for classes of characters, you need the (slower) regex
    machinery (as you well know, but for the sake of discussion).

    A string, and a pointer on that string. If you give up the boundary
    condition to tell backwards, you can start to eat up the string via f =
    f[p:]. There was a performance difference with that, in fact it was faster
    ~4% on a python2.2.

    When I tried it just now, it was the other way around. Eating up the
    string was slower, which makes sense to me since it involves creating new
    string objects all the time.

    I dont't expect any iterator solution to be faster than that.

    It's not so much an issue of iterators, but handling Python objects
    for every char. Iterators would actually be quite helpful for searching: I
    wonder why there doesn't seem to be an str.iterfind or str.itersplit
    thing. And I wonder whether there shouldn't be str.findany and
    str.iterfindany, which takes a sequence as an argument and returns the
    next match on any element of it.
  • No.18 | | 1250 bytes | |

    Peter wrote:

    Not clumsy, just slow.

    As you wish ;o) I didn't mean clumsy as in "clumsy looking Python code"
    anyway, rather as in "clumsy to use the Python machinery for operations
    that are straight-forward and efficient in C, in which language str and
    cStringI are implemented already".

    I hope you'll let us know how much faster your
    final approach turns out to be.

    I'm pretty convinced that implementing an algorithmically nice state
    machine that goes through a string char by char won't get any faster than
    using s[index] all the time unless I do a frankenstring in C. Failing
    that, a more pragmatic approach is what Andreas suggests; see the other
    subthread.

    By the way, I'll consider anything that
    doesn't implement seek() and tell() cheating :-)

    An implementation of frankenstring would have to have seek and tell,
    that's the point of doing it. But for half-way simple state machines,
    hiding the index handling in a Python class that slows things down it just
    not worth it. Doing index += 1 here and there is fine if it happens only
    half a dozen times. I know it's not beautiful, that's why I started this
    thread ;o)
  • No.19 | | 1609 bytes | |

    Aloha,

    Thomas Lotze wrote:
    >>A string, and a pointer on that string. If you give up the boundary
    >>condition to tell backwards, you can start to eat up the string via f =
    >>f[p:]. There was a performance difference with that, in fact it was faster

    4% on a python2.2.

    When I tried it just now, it was the other way around. Eating up the
    string was slower, which makes sense to me since it involves creating new
    string objects all the time.

    I expected the f[p:] also to be slower, the 4% i only measured on one
    platform. Most propably the CG and memory management isn't the same.

    >>I dont't expect any iterator solution to be faster than that.


    It's not so much an issue of iterators, but handling Python objects
    for every char. Iterators would actually be quite helpful for searching: I
    wonder why there doesn't seem to be an str.iterfind or str.itersplit
    thing. And I wonder whether there shouldn't be str.findany and
    str.iterfindany, which takes a sequence as an argument and returns the
    next match on any element of it.

    There is a finditer in the re. I'm currently rewriting a few pattern
    matching things and find it quite valueable.

    import re
    pat = re.compile('[57]')
    f = "754356184756046104564"
    for a in pat.finditer(f):
    print a.start(),f[a.start()]

    0 7
    1 5
    4 5
    9 7
    10 5
    18 5

    Wishing a happy day
    LBI
  • No.20 | | 700 bytes | |

    Thomas Lotze wrote:

    And I wonder whether there shouldn't be str.findany and
    str.iterfindany, which takes a sequence as an argument and returns the
    next match on any element of it.

    second thought, that wouldn't gain much on a loop over finding each
    sequence, but add more complexity than it is worth. What would be more
    useful, especially thinking of a C implementation, is str.findanyof and
    str.findnoneof. They take a string as an argument and find the first
    occurrence of any char in that string or any char not in that string,
    resp. Especially finding any char not among a given few needs a hoop to
    jump through now, if I didn't miss anything.
  • No.21 | | 590 bytes | |

    >jay graves wrote:
    >see StringI or cStringI in the standard library.


    Just as with files, iterating over them returns whole lines, which is
    unfortunately not what I want.

    Then why not subclass it and alter the iteration scheme to do a read(1)
    or something?

    from StringI import StringI

    class FrankenString(StringI):
    lastidx = 0
    atEnd = False
    def __iter__(self):
    while not self.atEnd:
    char = self.read(1)
    idx = self.tell()
    if self.lastidx == idx:
    self.atEnd = True
    self.lastidx = idx
    yield char
  • No.22 | | 688 bytes | |

    Here is a cStringI based version:
    class FrankenString:
    def __init__(self,string=None):
    self.str = StringI(string)
    self.atEnd = False
    self.lastidx = 0
    self.seek = self.str.seek
    self.tell = self.str.tell
    def __iter__(self):
    while not self.atEnd:
    char = self.str.read(1)
    idx = self.str.tell()
    if self.lastidx == idx:
    self.atEnd = True
    self.lastidx = idx
    yield char

    a string 1024*1024 long the StringI version takes 10s to iterate
    over but do nothing, whereas this one takes 3.1 on my system. Well to
    create that string, create the instance and loop over. But since each
    variant is doing the same thing I figure it's even. ;)
  • No.23 | | 499 bytes | |

    well that appears to have been munged up
    that tell() belongs immediately after self.str. making it
    self.str.tell()

    class FrankenString:
    def __init__(self,string=None):
    self.str = StringI(string)
    self.atEnd = False
    self.lastidx = 0
    self.seek = self.str.seek
    self.tell = self.str.tell
    def __iter__(self):
    while not self.atEnd:
    char = self.str.read(1)
    idx = self.str.tell()
    if self.lastidx == idx:
    self.atEnd = True
    self.lastidx = idx
    yield char
  • No.24 | | 2718 bytes | |

    Peter wrote:

    I hope you'll let us know how much faster your
    final approach turns out to be

    K, here's a short report on the current state. Such code as there is can
    be found at <>,
    with a Python mock-up in the same directory.

    Thinking about it (Andreas, thank you for the reminder :o)), doing
    character-by-character scanning in Python is stupid, both in terms of
    speed and, given some more search capabilities than str currently has,
    elegance.

    So what I did until now (except working myself into writing extensions
    in C) is give the evolving FrankenString some search methods that enable
    searching for the first occurrence in the string of any character out of
    a set of characters given as a string, or any character not in such a
    set. This has nothing to do yet with iterators and seeking/telling.

    Just letting C do the "while data[index] not in whitespace: index += 1"
    part speeds up my PDF tokenizer by a factor between 3 and 4. I have
    never compared that directly to using regular expressions, though As
    a bonus, even with this minor addition the Python code looks a little
    cleaner already:

    c = data[cursor]

    while c in whitespace:
    # Whitespace tokens.
    cursor += 1

    if c == '%':
    # We're just inside a comment, read beyond EL.
    while data[cursor] not in "\r\n":
    cursor += 1
    cursor += 1

    c = data[cursor]

    becomes

    cursor = data.skipany(whitespace, start)
    c = data[cursor]

    while c == '%':
    # Whitespace tokens: comments till EL and whitespace.
    cursor = data.skipother("\r\n", cursor)
    cursor = data.skipany(whitespace, cursor)
    c = data[cursor]

    (removing '%' from the whitespace string, in case you wonder).

    The next thing to do is make FrankenString behave. Right now there's too
    much copying of string content going on everytime a FrankenString is
    initialized; I'd like it to share string content with other
    FrankenStrings or strs much like cStringI does. I hope it's just a
    matter of learning from cStringI To justify the "franken" part of the
    name some more, I consider mixing in yet another ingredient and making
    the thing behave like a buffer in that a FrankenString should be
    possible to make from only part of a string without copying data.

    After that, the thing about seeking and telling iterators over
    characters or search results comes in. I don't think it will make much
    difference in performance now that the stupid character searching has
    been done in C, but it'll hopefully make for more elegant Python code.

Re: Frankenstring


max 4000 letters.
Your nickname that display:
In order to stop the spam: 1 + 0 =
QUESTION ON "Development"

EMSDN.COM