Input:
Itertools is a great library, but some methods definitely receive more attention than others – for instance, I’d wager that
More on this is covered in this url: http://jezng.com/2012/06/inside-python-tee/
However, the documentation doesn’t really explain why you might want to use
Iterables are Python objects that define an
To actually use these iterators and iterables, we turn to Python’s classic
Perhaps you are wondering why Python makes the distinction between iterables and iterators, and why iterators don’t have
More importantly,
In sum, Python’s iterator protocol is simple to implement, and it works well for the common use case of
With this implementation detail in mind, the reasoning behind the documentation’s caveat becomes clearer – using the iterator outside of
In fact,
Notably, as a consequence of this optimization, passing in a non-
However, linked lists perform worse than arrays in terms of memory
fragmentation. Moreover, they incur lots of overhead in memory
allocations and deallocations. As a compromise, CPython uses linked
arrays – a linked list with arrays as individual elements. Moreover,
these arrays are sized such that the entire array element is 64 bits in
size, fitting nicely into cache lines.
http://code.activestate.com/recipes/305588-simple-example-to-show-off-itertoolstee/
Learn python for fun.The popular blog with questions and answers to the python.Solutions to facebookhackercup,codejam,codechef.The fun way to learn python with me.Building some cool apps.
#!usr/bin/env/python from itertools import * r=islice(count(),5) iterator_1,iterator_2=tee(r) print "iterator_1,iterator_2:",list(iterator_1),list(iterator_2) #you can create n number of iterators,but default is set to tw0 iterators. print"\nExample 2:" r_2=islice(count(),5) i1,i2=tee(r_2) print 'r_2:', for i in r_2: print i, if i>1: break print print'i1,i2:',list(i1),list(i2)Output:
iterator_1,iterator_2: [0, 1, 2, 3, 4] [0, 1, 2, 3, 4] Example 2: r_2: 0 1 2 i1,i2: [3, 4] [3, 4]
Itertools is a great library, but some methods definitely receive more attention than others – for instance, I’d wager that
chain
is a lot more well-known than tee
. Python’s documentation describes tee
as follows:
Return n independent iterators from a single iterable.It also adds this caveat:
Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.(Example-2 is Illustrated).
Caution:Don't read the rest....you may feel boring.Learn about Iterators,Generators and come back.
More on this is covered in this url: http://jezng.com/2012/06/inside-python-tee/
However, the documentation doesn’t really explain why you might want to use
tee
, instead of copying the iterable n
times. The short answer is that tee()
uses some heuristics and strategies to make the generation of these n
iterators more memory efficient. This article will peek inside Python’s internals and explore these strategies in more detail.Iterables, Iterators, and the Iterator Protocol
To understand the basis for the optimizations, it is essential to know how Python’s iterator protocol works. (More experienced Pythonistas might want to skim this section – it’s here for completeness.)Iterables are Python objects that define an
__iter__()
method, which, when called, returns an iterator. Iterators define the __iter__()
method as well, but implement it by simply returning the iterator itself. Iterators also implement a next()
method, which returns the next element in a sequence. To signal the end of a sequence, next()
raises a StopIteration
exception. Any object that implements both __iter__
and next
in the aforementioned manner is said to implement the iterator protocol.To actually use these iterators and iterables, we turn to Python’s classic
for item in sequence
loop. Here, sequence
can be either an iterator or an iterable – it doesn’t matter, because the loop will begin by implicitly invoking seq
’s __iter__
method, which will return an iterator. Now, on each round through the loop, the Python interpreter will invoke the iterator’s next
method and assign it to item
, and will continue to do so until next
raises StopIteration
.Perhaps you are wondering why Python makes the distinction between iterables and iterators, and why iterators don’t have
prev
or rewind
methods. These two questions are in fact related. It can be expensive
or complicated to implement these other methods – for instance, reading
sequentially forwards from a file corresponds simply to fread()
in C, but rewinding to the start is a more expensive fseek
call, and there is no efficient way to read sequentially in reverse.More importantly,
prev
and rewind
are not essential for iteration – they can be built upon the primitive next
method. For instance, if we really wish to access the earlier contents
of the file in Python, we could buffer the earlier results ourselves.
However, buffering all the earlier values is generally inefficient and
wasteful: many programs have no need to access earlier values of an
iterator, and those that do tend to limit themselves to one or two
elements before the current one. Rewinding an iterable is a more common
use case, but the same thing can be achieved by obtaining an entirely
new iterator that points to the start of the sequence. To do that, we
need a factory that produces iterators – which is exactly what an
iterable is.In sum, Python’s iterator protocol is simple to implement, and it works well for the common use case of
for .. in
loops.Sometimes Buffering is Useful
The starkness of Python’s iterator design means that more complicated use cases will need to build their own abstractions on top of it. Fortunately, its simplicity also means that it is easy to build upon.tee()
is one such abstraction, built to efficiently create n
independent iterators. Consider file I/O again: we might not want to create n
file iterations by calling fopen()
over and over, especially if n
is large. Nor would we want to copy iterators that do heavy computation each time we called next()
. In both cases buffering is a better solution, and this is what tee
does, collating in one central list the values returned by the original iterator. Its return value consists of n
tee
iterator objects,1
each of which stores a single integer index that indicates the next
element it should return from the list. The list itself is populated
lazily – whenever any of the n
iterators asks for a value at an index that is not yet in the buffer, tee()
calls next
on the underlying iterator, then caches and returns the new value.With this implementation detail in mind, the reasoning behind the documentation’s caveat becomes clearer – using the iterator outside of
tee()
would cause some (or all) values to be lost from the cache.Sometimes Buffering is Not Useful
Buffering is not always the most efficient way to createn
independent iterators. For example, next
could be a computationally cheap function, and it would be more efficient for us to just copy it n
times. We can indicate this fact by defining a __copy__
method on our iterator.2 If tee()
detects the presence of __copy__
, it will copy the iterator instead of doing buffering. Moreover, it will only do this copying n - 1
times – the last iterator will simply be the original one that was
passed in! Since the original iterator should never be used after being
passed to tee()
, this makes perfect sense; the end result will appear the same to the library consumer.In fact,
tee
objects themselves implement __copy__
! Since any tee
object is backed by a central buffer, the most efficient way to
duplicate one is to have the copy point to the same buffer. So not only
does tee
duplicate any iterator efficiently, it also ensures that future duplications are efficient.Notably, as a consequence of this optimization, passing in a non-
tee
object that implements __copy__
will return a tuple of copies of that object, not tee
objects. In practice, the type of the object should not matter as long as one sticks to using it solely as an iterator.Efficient Buffering: The gritty details
If the total number of items generated by the iterator is very large, storing all of the generated values might take up a lot of memory. If we can keep track of all the indices of thetee
objects generated by a tee()
function call, we can delete old values once all of the indices have
passed it. (Since iterators can only progress forwards, we know that
these values will never be used again.) With CPython’s reference
counting, this is actually quite simple: the buffer can be implemented
as a singly linked list, with each tee
object holding a pointer to the next element that it will return. As a natural consequence, once the last tee
iterator is done with a buffer element, the element’s reference count goes to zero and it gets garbage collected.A Python Implementation
Python’s documentation gives pure-Python implementation oftee()
,
but it simplifies things and leaves out the buffering and copy
semantics. I’ve written up a slightly more involved one that reflects
the optimizations described above. Unlike the version in the docs, this
implementation passes the standard library’s test suite. It still makes
some concessions to simplicity: for instance, buffering uses a single
list, instead of the linked arrays described above. class TeeIterator(object):
def __new__(cls, tee_data):
if isinstance(tee_data, TeeData):
self = super(TeeIterator, cls).__new__(cls)
self.tee_data = tee_data
else:
self = TeeIterator.from_iterable(tee_data)
self.index = 0
return self
def __copy__(self):
return TeeIterator(self.tee_data)
def __iter__(self):
return self
def next(self):
rv = self.tee_data[self.index]
self.index += 1
return rv
@classmethod
def from_iterable(cls, iterable):
if isinstance(iterable, cls):
return iterable.__copy__()
tee_data = TeeData(iter(iterable))
return TeeIterator(tee_data)
class TeeData(object):
def __init__(self, iterator):
self.iterator = iterator
self.buffer = []
def __getitem__(self, index):
if index == len(self.buffer):
self.buffer.append(next(self.iterator))
return self.buffer[index]
def tee(iterable, n=2):
if n < 0:
raise ValueError("n must be >= 0")
elif n == 0:
return ()
if hasattr(iterable, '__copy__'):
copyable = iterable
else:
copyable = TeeIterator.from_iterable(iterable)
result = [ copyable ]
for i in xrange(1, n):
result.append(copyable.__copy__())
return tuple(result)
Learn python for fun.The popular blog with questions and answers to the python.Solutions to facebookhackercup,codejam,codechef.The fun way to learn python with me.Building some cool apps.
No comments:
Post a Comment