Frankenstring
24 answers - 1415 bytes -

Hi,
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
f = frankenstring("0123456789")
for c in f:
print c
if c == "2":
break
0
1
2
f.tell()
3L
f.seek(7)
for c in f:
print c
7
8
9
It's definitely no help that file-like objects are iterable; I do want
to get a character, not a complete line, at a time.
I can think of more than one clumsy way to implement the desired
behaviour in Python; I'd rather like to know whether there's an
implementation somewhere that does it fast. (Yes, it's me and speed
considerations again; this is for a tokenizer at the core of a library,
and I'd really like it to be fast.) I don't think there's anything like
it in the standard library, at least not anything that would be obvious
to me.
I don't care whether this is more of a string iterator with seeking and
telling, or a file-like object with a single-character iterator; as long
as it does both efficiently, I'm happy.
I'd even consider writing such a beast in C, albeit more as a learning
exercise than as a worthwhile measure to speed up some code.
Thanks for any hints.
No.1 | | 48 bytes |
| 
see StringI or cStringI in the standard library.
No.2 | | 1606 bytes |
| 
Thomas Lotze wrote:
>Hi,
>
>I think I need an iterator over a string of characters pulling them out
>one by one, like a usual iterator over a str does. At the same time the
>thing should allow seeking and telling like a file-like object:
, first off, this is never going to be *fast* compared to something
coded in C and wrapped with Python. You are dealing with every single
character as a Python object, so let's forget fast for the moment and do
a straightforward implementation:
class Franken( str ):
frankenIndex = 0
def __iter__( self ):
while self.frankenIndex < len(self):
yield self[ self.frankenIndex ]
self.frankenIndex += 1
self.frankenIndex = 0
def seek( self, index ):
self.frankenIndex = index
def tell( self, index ):
return self.frankenIndex
if __name__ == "__main__":
f = Franken( 'abcdefg' )
for c in f:
print 'char', c
if c == 'c':
break
f.seek( 5 )
l1 = list( f )
l2 = list( f )
assert l1 == [ 'f','g' ]
assert l2 == list(str(f))
print 'first list', l1
print 'second list', l2
If you want to speed it up, you can optimise for various string sizes
(eg using a slice of the string and the built-in iterator when
appropriate), but in the end, it's not going to be anywhere near as fast
as a C-engine tokeniser, so I'd personally spend more time on elegance
than on speed
Anywho, have fun,
Mike
No.3 | | 2130 bytes |
| 
Tue, 12 Jul 2005 22:08:55 +0200, Thomas Lotze <thomas (AT) thomas-lotze (DOT) dewrote:
>Hi,
>
>I think I need an iterator over a string of characters pulling them out
>one by one, like a usual iterator over a str does. At the same time the
>thing should allow seeking and telling like a file-like object:
>
f = frankenstring("0123456789")
for c in f:
print c
if c == "2":
break
>0
>1
>2
f.tell()
>3L
f.seek(7)
for c in f:
print c
>7
>8
>9
>
>It's definitely no help that file-like objects are iterable; I do want
>to get a character, not a complete line, at a time.
>
>I can think of more than one clumsy way to implement the desired
>behaviour in Python; I'd rather like to know whether there's an
>implementation somewhere that does it fast. (Yes, it's me and speed
>considerations again; this is for a tokenizer at the core of a library,
>and I'd really like it to be fast.) I don't think there's anything like
>it in the standard library, at least not anything that would be obvious
>to me.
>
>I don't care whether this is more of a string iterator with seeking and
>telling, or a file-like object with a single-character iterator; as long
>as it does both efficiently, I'm happy.
>
>I'd even consider writing such a beast in C, albeit more as a learning
>exercise than as a worthwhile measure to speed up some code.
>
>Thanks for any hints.
>
I'd probably subclass file to buffer in good-sized chunks and override the
iteration to go by characters through the buffer, updating the buffer
when you get to its end, and overriding seek and tell to do the right thing
re the buffer and where you are in it for the character iteration via next.
Regards,
Bengt Richter
No.4 | | 184 bytes |
| 
jay graves wrote:
see StringI or cStringI in the standard library.
Just as with files, iterating over them returns whole lines, which is
unfortunately not what I want.
No.5 | | 697 bytes |
| 
Thomas Lotze wrote:
Hi,
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
f = frankenstring("0123456789")
for c in f:
print c
if c == "2":
break
K, frankenstring can be approximated by nothing:
for c in "0123456789":
print c
Now if you want to do it for a file, you could do:
for c in thefile.read():
, if you did not want to do a full read:
def filechars(afile):
for line in afile:
for char in line:
yield char
David Daniels
Scott.Daniels@A
No.6 | | 616 bytes |
| 
Scott David Daniels wrote:
Now if you want to do it for a file, you could do:
for c in thefile.read():
The whole point of the exercise is that seeking on a file doesn't
influence iteration over its content. In the loop you suggest, I can
seek() on thefile to my heart's content and will always get its content
iterated over exactly from beginning to end. It had been read before any
of this started, after all. Similarly, thefile.tell() will always tell me
thefile's size or the place I last seek()'ed to instead of the position of
the next char I will get.
No.7 | | 4434 bytes |
| 
Wed, 13 Jul 2005 03:49:16 +0200, Thomas Lotze <thomas (AT) thomas-lotze (DOT) dewrote:
>Scott David Daniels wrote:
>
>Now if you want to do it for a file, you could do:
>
>for c in thefile.read():
>
>
>The whole point of the exercise is that seeking on a file doesn't
>influence iteration over its content. In the loop you suggest, I can
>seek() on thefile to my heart's content and will always get its content
>iterated over exactly from beginning to end. It had been read before any
>of this started, after all. Similarly, thefile.tell() will always tell me
>thefile's size or the place I last seek()'ed to instead of the position of
>the next char I will get.
>
What I suggested in my other post (untested beyond what you see, so you
may want to add to the test ):
< lotzefile.py >
class LotzeFile(file):
BUFSIZE = 4096
def __init__(self, path, mode='r'):
self.f = file(path, mode)
self.pos = self.bufbase = 0
self.buf = ''
def __iter__(self): return self
def next(self):
if not self.buf[self.pos:]:
self.bufbase += len(self.buf)
self.pos = 0
self.buf = self.f.read(self.BUFSIZE)
if not self.buf:
self.close()
raise StopIteration
byte = self.buf[self.pos]
self.pos += 1
return byte
def seek(self, pos, ref=0):
self.f.seek(pos, ref)
self.bufbase = self.f.tell()
self.pos = 0
self.buf = ''
def tell(self):
return self.bufbase + self.pos
def close(self):
self.f.close()
def test():
f = file('lotzedata.txt','w')
for s in (' %3d'%i for i in xrange(1000)): f.write(s)
f.close()
it = iter(LotzeFile('lotzedata.txt'))
hold4=[0,0,0,0]
for i, c in enumerate(it):
hold4[i%4] = c
if i%4==3:
print hold4
assert (i-3)/4 == int(''.join(hold4))
if i == 99: break
print it.tell()
it.seek(52)
for i in xrange(8): print it.next(),
print
it.seek(990*4)
for c in it: print c,
if __name__ == '__main__':
test()
Result:
[20:53] C:\pywk\clp>py24 lotze.py
[' ', ' ', ' ', '0']
[' ', ' ', ' ', '1']
[' ', ' ', ' ', '2']
[' ', ' ', ' ', '3']
[' ', ' ', ' ', '4']
[' ', ' ', ' ', '5']
[' ', ' ', ' ', '6']
[' ', ' ', ' ', '7']
[' ', ' ', ' ', '8']
[' ', ' ', ' ', '9']
[' ', ' ', '1', '0']
[' ', ' ', '1', '1']
[' ', ' ', '1', '2']
[' ', ' ', '1', '3']
[' ', ' ', '1', '4']
[' ', ' ', '1', '5']
[' ', ' ', '1', '6']
[' ', ' ', '1', '7']
[' ', ' ', '1', '8']
[' ', ' ', '1', '9']
[' ', ' ', '2', '0']
[' ', ' ', '2', '1']
[' ', ' ', '2', '2']
[' ', ' ', '2', '3']
[' ', ' ', '2', '4']
100
1 3 1 4
9 9 0 9 9 1 9 9 2 9 9 3 9 9 4 9 9 5 9 9 6 9 9 7 9 9 8 9 9 9
I suspect you could get better performance if you made LotzeFile instances able to
return interators over buffer chunks and get characters from them, which would
be string iterators supplying the characters rather than the custom .next, but
the buffer chunks would have to be of some size to make that pay. Testing is
the only way to find out what the crossing point is, if you really have to.
Regards,
Bengt Richter
No.8 | | 503 bytes |
| 
Thomas Lotze wrote:
It's definitely no help that file-like objects are iterable; I do want
to get a character, not a complete line, at a time.
Hi,
if i did understand what you mean, what about using mmap? Iterating over
characters in a file like this:
# coding: iso-8859-1
import os
import mmap
f = open("file.txt", "r+")
size = os.path.getsize("file.txt")
m = mmap.mmap(f.fileno(), size)
for x in m:
print x
m.close()
f.close()
No.9 | | 1000 bytes |
| 
Thomas Lotze wrote:
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
from StringI import StringI
class frankenstring(StringI):
def next(self):
c = self.read(1)
if not c:
raise StopIteration
return c
f = frankenfile("0123456789")
for c in f:
print c
if c == "2":
break
0
1
2
f.tell()
3
f.seek(7)
for c in f:
print c
7
8
9
A non-intrusive alternative:
def chariter(instream):
def char(): return instream.read(1)
return iter(char, "")
f = StringI("0123456789")
for c in chariter(f):
print c
if c == "2": break
0
1
2
f.tell()
3
Performance is probably not so good, but if you really want to do it in C,
with cStringI you might be /almost/ there.
Peter
No.10 | | 351 bytes |
| 
Roland Heiber wrote:
if i did understand what you mean, what about using mmap?
AIUI (and as a little experimenting seems to confirm), you can't
reposition an iterator over an mmap'ed file by seeking. True, you have
both iterating by characters and seeking/telling, but the two
functionalities don't play together.
No.11 | | 1539 bytes |
| 
Bengt Richter wrote:
< lotzefile.py >
Thanks.
[]
byte = self.buf[self.pos]
This is the place where the thing is basically a str whose items are
accessed as sequence elements. It has some iterator behaviour and file
management which makes it nice to use, of course, and to most this will
be enough (and it is a lot indeed). But it loses the efficiency of
for c in "asdf": do_something(c)
Actually, relying on string[index] behind the scenes is one of the ways
of implementing frankenstring I labelled "clumsy" in the original
posting ;o)
I suspect you could get better performance if you made LotzeFile instances
able to return interators over buffer chunks and get characters from them,
which would be string iterators supplying the characters rather than the
custom .next, but the buffer chunks would have to be of some size to make
that pay. Testing is the only way to find out what the crossing point is,
if you really have to.
If I understand this correctly, you'd have to switch to using a new iterator
after seeking, which would make this impossible:
f = LotzeFile('something')
for c in iter(f):
do_something(c)
if some_condition:
f.seek(somewhere)
# the next iteration reads from the new position
And it would break telling since the class can't know how many
characters have been read from an iterator once it returned one after
seeking or switching to another buffer chunk.
No.12 | | 307 bytes |
| 
Peter wrote:
class frankenstring(StringI):
def next(self):
c = self.read(1)
if not c:
raise StopIteration
return c
Repeated read(1) on a file-like object is one of the ways of doing it with
existing tools I labelled "clumsy" in the original posting ;o)
Thanks anyway.
No.13 | | 1057 bytes |
| 
Thomas Lotze wrote:
AIUI (and as a little experimenting seems to confirm), you can't
reposition an iterator over an mmap'ed file by seeking. True, you have
both iterating by characters and seeking/telling, but the two
functionalities don't play together.
A quick and dirty hack!? Maybe i'm missing what it is that you need
class MmapWithSeekAndTell(object):
def __init__(self, m, size):
self._m = m
self._pos = 0
self._size = size-1
def __iter__(self):
return self
def next(self):
if self._pos < self._size-1:
self._pos += 1
return self._m[self._pos]
raise StopIteration
def seek(self, offset, whence=0):
if whence == 0 and 0 <= offset < self._size:
self._pos = offset
elif whence == 1 and 0 <= (self._pos + offset) < self._size:
self._pos += offset
elif whence == 2 and 0<= (self._size - offset) < self._size:
self._pos = self._size - offset
def tell(self):
return self._pos
HtH, Roland
No.14 | | 152 bytes |
| 
Roland Heiber wrote:
class MmapWithSeekAndTell(object):
def __init__(self, m, size):
where m is a mmap-object and size the filesize sorry.
No.15 | | 1465 bytes |
| 
Aloha,
Thomas Lotze wrote:
I think I need an iterator over a string of characters pulling them out
one by one, like a usual iterator over a str does. At the same time the
thing should allow seeking and telling like a file-like object:
f = frankenstring("0123456789")
for c in f:
print c
if c == "2":
break
0
1
2
f.tell()
3L
f.seek(7)
for c in f:
print c
7
8
9
I can think of more than one clumsy way to implement the desired
behaviour in Python; I'd rather like to know whether there's an
implementation somewhere that does it fast. (Yes, it's me and speed
considerations again; this is for a tokenizer at the core of a library,
and I'd really like it to be fast.)
You can already think my answer, because i'm doing this
at the core of a similar library, but to give others
the chance to discuss.
f = "0123456789"
p = 0
t2 = f.find('2')+1
for c in f[p:t2]:
print c
0
1
2
p = 7
for c in f[p:]:
print c
7
8
9
A string, and a pointer on that string. If you give up the
boundary condition to tell backwards, you can start to eat up
the string via f = f[p:]. There was a performance difference
with that, in fact it was faster ~4% on a python2.2.
I dont't expect any iterator solution to be faster than
that.
Wishing a happy day
LBI
No.16 | | 590 bytes |
| 
Thomas Lotze wrote:
Peter wrote:
class frankenstring(StringI):
>def next(self):
>c = self.read(1)
>if not c:
>raise StopIteration
>return c
Repeated read(1) on a file-like object is one of the ways of doing it with
existing tools I labelled "clumsy" in the original posting ;o)
Not clumsy, just slow. I hope you'll let us know how much faster your final
approach turns out to be. By the way, I'll consider anything that doesn't
implement seek() and tell() cheating :-)
Peter
No.17 | | 1478 bytes |
| 
Andreas Lobinger wrote:
t2 = f.find('2')+1
This is indeed faster than going through a string char by char. It doesn't
make for a nice character-based state machine, but of course it avoids
making Python objects for every character and uses the C implementation of
str for searching.
However, it's only fine if you are looking for single characters. As soon
as you're looking for classes of characters, you need the (slower) regex
machinery (as you well know, but for the sake of discussion).
A string, and a pointer on that string. If you give up the boundary
condition to tell backwards, you can start to eat up the string via f =
f[p:]. There was a performance difference with that, in fact it was faster
~4% on a python2.2.
When I tried it just now, it was the other way around. Eating up the
string was slower, which makes sense to me since it involves creating new
string objects all the time.
I dont't expect any iterator solution to be faster than that.
It's not so much an issue of iterators, but handling Python objects
for every char. Iterators would actually be quite helpful for searching: I
wonder why there doesn't seem to be an str.iterfind or str.itersplit
thing. And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.
No.18 | | 1250 bytes |
| 
Peter wrote:
Not clumsy, just slow.
As you wish ;o) I didn't mean clumsy as in "clumsy looking Python code"
anyway, rather as in "clumsy to use the Python machinery for operations
that are straight-forward and efficient in C, in which language str and
cStringI are implemented already".
I hope you'll let us know how much faster your
final approach turns out to be.
I'm pretty convinced that implementing an algorithmically nice state
machine that goes through a string char by char won't get any faster than
using s[index] all the time unless I do a frankenstring in C. Failing
that, a more pragmatic approach is what Andreas suggests; see the other
subthread.
By the way, I'll consider anything that
doesn't implement seek() and tell() cheating :-)
An implementation of frankenstring would have to have seek and tell,
that's the point of doing it. But for half-way simple state machines,
hiding the index handling in a Python class that slows things down it just
not worth it. Doing index += 1 here and there is fine if it happens only
half a dozen times. I know it's not beautiful, that's why I started this
thread ;o)
No.19 | | 1609 bytes |
| 
Aloha,
Thomas Lotze wrote:
>>A string, and a pointer on that string. If you give up the boundary
>>condition to tell backwards, you can start to eat up the string via f =
>>f[p:]. There was a performance difference with that, in fact it was faster
4% on a python2.2.
When I tried it just now, it was the other way around. Eating up the
string was slower, which makes sense to me since it involves creating new
string objects all the time.
I expected the f[p:] also to be slower, the 4% i only measured on one
platform. Most propably the CG and memory management isn't the same.
>>I dont't expect any iterator solution to be faster than that.
It's not so much an issue of iterators, but handling Python objects
for every char. Iterators would actually be quite helpful for searching: I
wonder why there doesn't seem to be an str.iterfind or str.itersplit
thing. And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.
There is a finditer in the re. I'm currently rewriting a few pattern
matching things and find it quite valueable.
import re
pat = re.compile('[57]')
f = "754356184756046104564"
for a in pat.finditer(f):
print a.start(),f[a.start()]
0 7
1 5
4 5
9 7
10 5
18 5
Wishing a happy day
LBI
No.20 | | 700 bytes |
| 
Thomas Lotze wrote:
And I wonder whether there shouldn't be str.findany and
str.iterfindany, which takes a sequence as an argument and returns the
next match on any element of it.
second thought, that wouldn't gain much on a loop over finding each
sequence, but add more complexity than it is worth. What would be more
useful, especially thinking of a C implementation, is str.findanyof and
str.findnoneof. They take a string as an argument and find the first
occurrence of any char in that string or any char not in that string,
resp. Especially finding any char not among a given few needs a hoop to
jump through now, if I didn't miss anything.
No.21 | | 590 bytes |
| 
>jay graves wrote:
>see StringI or cStringI in the standard library.
Just as with files, iterating over them returns whole lines, which is
unfortunately not what I want.
Then why not subclass it and alter the iteration scheme to do a read(1)
or something?
from StringI import StringI
class FrankenString(StringI):
lastidx = 0
atEnd = False
def __iter__(self):
while not self.atEnd:
char = self.read(1)
idx = self.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char
No.22 | | 688 bytes |
| 
Here is a cStringI based version:
class FrankenString:
def __init__(self,string=None):
self.str = StringI(string)
self.atEnd = False
self.lastidx = 0
self.seek = self.str.seek
self.tell = self.str.tell
def __iter__(self):
while not self.atEnd:
char = self.str.read(1)
idx = self.str.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char
a string 1024*1024 long the StringI version takes 10s to iterate
over but do nothing, whereas this one takes 3.1 on my system. Well to
create that string, create the instance and loop over. But since each
variant is doing the same thing I figure it's even. ;)
No.23 | | 499 bytes |
| 
well that appears to have been munged up
that tell() belongs immediately after self.str. making it
self.str.tell()
class FrankenString:
def __init__(self,string=None):
self.str = StringI(string)
self.atEnd = False
self.lastidx = 0
self.seek = self.str.seek
self.tell = self.str.tell
def __iter__(self):
while not self.atEnd:
char = self.str.read(1)
idx = self.str.tell()
if self.lastidx == idx:
self.atEnd = True
self.lastidx = idx
yield char
No.24 | | 2718 bytes |
| 
Peter wrote:
I hope you'll let us know how much faster your
final approach turns out to be
K, here's a short report on the current state. Such code as there is can
be found at <>,
with a Python mock-up in the same directory.
Thinking about it (Andreas, thank you for the reminder :o)), doing
character-by-character scanning in Python is stupid, both in terms of
speed and, given some more search capabilities than str currently has,
elegance.
So what I did until now (except working myself into writing extensions
in C) is give the evolving FrankenString some search methods that enable
searching for the first occurrence in the string of any character out of
a set of characters given as a string, or any character not in such a
set. This has nothing to do yet with iterators and seeking/telling.
Just letting C do the "while data[index] not in whitespace: index += 1"
part speeds up my PDF tokenizer by a factor between 3 and 4. I have
never compared that directly to using regular expressions, though As
a bonus, even with this minor addition the Python code looks a little
cleaner already:
c = data[cursor]
while c in whitespace:
# Whitespace tokens.
cursor += 1
if c == '%':
# We're just inside a comment, read beyond EL.
while data[cursor] not in "\r\n":
cursor += 1
cursor += 1
c = data[cursor]
becomes
cursor = data.skipany(whitespace, start)
c = data[cursor]
while c == '%':
# Whitespace tokens: comments till EL and whitespace.
cursor = data.skipother("\r\n", cursor)
cursor = data.skipany(whitespace, cursor)
c = data[cursor]
(removing '%' from the whitespace string, in case you wonder).
The next thing to do is make FrankenString behave. Right now there's too
much copying of string content going on everytime a FrankenString is
initialized; I'd like it to share string content with other
FrankenStrings or strs much like cStringI does. I hope it's just a
matter of learning from cStringI To justify the "franken" part of the
name some more, I consider mixing in yet another ingredient and making
the thing behave like a buffer in that a FrankenString should be
possible to make from only part of a string without copying data.
After that, the thing about seeking and telling iterators over
characters or search results comes in. I don't think it will make much
difference in performance now that the stupid character searching has
been done in C, but it'll hopefully make for more elegant Python code.