## Saturday, March 16, 2013

### Python's itertools.tee() explained with examples

Input:
```
#!usr/bin/env/python
from itertools import *
r=islice(count(),5)
iterator_1,iterator_2=tee(r)
print "iterator_1,iterator_2:",list(iterator_1),list(iterator_2)
#you can create n number of iterators,but default is set to tw0 iterators.
print"\nExample 2:"
r_2=islice(count(),5)
i1,i2=tee(r_2)
print 'r_2:',
for i in r_2:
print i,
if i>1:
break
print
print'i1,i2:',list(i1),list(i2)
```
Output:
```
iterator_1,iterator_2: [0, 1, 2, 3, 4] [0, 1, 2, 3, 4]

Example 2:
r_2: 0 1 2
i1,i2: [3, 4] [3, 4]```
` `

Itertools is a great library, but some methods definitely receive more attention than others – for instance, I’d wager that `chain` is a lot more well-known than `tee`. Python’s documentation describes `tee` as follows:

Return n independent iterators from a single iterable.
Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.(Example-2 is Illustrated).
Caution:Don't read the rest....you may feel boring.Learn about Iterators,Generators and come back.

More on this is covered in this url: http://jezng.com/2012/06/inside-python-tee/

However, the documentation doesn’t really explain why you might want to use `tee`, instead of copying the iterable `n` times. The short answer is that `tee()` uses some heuristics and strategies to make the generation of these `n` iterators more memory efficient. This article will peek inside Python’s internals and explore these strategies in more detail.

## Iterables, Iterators, and the Iterator Protocol

To understand the basis for the optimizations, it is essential to know how Python’s iterator protocol works. (More experienced Pythonistas might want to skim this section – it’s here for completeness.)
Iterables are Python objects that define an `__iter__()` method, which, when called, returns an iterator. Iterators define the `__iter__()` method as well, but implement it by simply returning the iterator itself. Iterators also implement a `next()` method, which returns the next element in a sequence. To signal the end of a sequence, `next()` raises a `StopIteration` exception. Any object that implements both `__iter__` and `next` in the aforementioned manner is said to implement the iterator protocol.
To actually use these iterators and iterables, we turn to Python’s classic `for item in sequence` loop. Here, `sequence` can be either an iterator or an iterable – it doesn’t matter, because the loop will begin by implicitly invoking `seq`’s `__iter__` method, which will return an iterator. Now, on each round through the loop, the Python interpreter will invoke the iterator’s `next` method and assign it to `item`, and will continue to do so until `next` raises `StopIteration`.
Perhaps you are wondering why Python makes the distinction between iterables and iterators, and why iterators don’t have `prev` or `rewind` methods. These two questions are in fact related. It can be expensive or complicated to implement these other methods – for instance, reading sequentially forwards from a file corresponds simply to `fread()` in C, but rewinding to the start is a more expensive `fseek` call, and there is no efficient way to read sequentially in reverse.
More importantly, `prev` and `rewind` are not essential for iteration – they can be built upon the primitive `next` method. For instance, if we really wish to access the earlier contents of the file in Python, we could buffer the earlier results ourselves. However, buffering all the earlier values is generally inefficient and wasteful: many programs have no need to access earlier values of an iterator, and those that do tend to limit themselves to one or two elements before the current one. Rewinding an iterable is a more common use case, but the same thing can be achieved by obtaining an entirely new iterator that points to the start of the sequence. To do that, we need a factory that produces iterators – which is exactly what an iterable is.
In sum, Python’s iterator protocol is simple to implement, and it works well for the common use case of `for .. in` loops.

## Sometimes Buffering is Useful

The starkness of Python’s iterator design means that more complicated use cases will need to build their own abstractions on top of it. Fortunately, its simplicity also means that it is easy to build upon.
`tee()` is one such abstraction, built to efficiently create `n` independent iterators. Consider file I/O again: we might not want to create `n` file iterations by calling `fopen()` over and over, especially if `n` is large. Nor would we want to copy iterators that do heavy computation each time we called `next()`. In both cases buffering is a better solution, and this is what `tee` does, collating in one central list the values returned by the original iterator. Its return value consists of `n` `tee` iterator objects,1 each of which stores a single integer index that indicates the next element it should return from the list. The list itself is populated lazily – whenever any of the `n` iterators asks for a value at an index that is not yet in the buffer, `tee()` calls `next` on the underlying iterator, then caches and returns the new value.
With this implementation detail in mind, the reasoning behind the documentation’s caveat becomes clearer – using the iterator outside of `tee()` would cause some (or all) values to be lost from the cache.

## Sometimes Buffering is Not Useful

Buffering is not always the most efficient way to create `n` independent iterators. For example, `next` could be a computationally cheap function, and it would be more efficient for us to just copy it `n` times. We can indicate this fact by defining a `__copy__` method on our iterator.2 If `tee()` detects the presence of `__copy__`, it will copy the iterator instead of doing buffering. Moreover, it will only do this copying `n - 1` times – the last iterator will simply be the original one that was passed in! Since the original iterator should never be used after being passed to `tee()`, this makes perfect sense; the end result will appear the same to the library consumer.
In fact, `tee` objects themselves implement `__copy__`! Since any `tee` object is backed by a central buffer, the most efficient way to duplicate one is to have the copy point to the same buffer. So not only does `tee` duplicate any iterator efficiently, it also ensures that future duplications are efficient.
Notably, as a consequence of this optimization, passing in a non-`tee` object that implements `__copy__` will return a tuple of copies of that object, not `tee` objects. In practice, the type of the object should not matter as long as one sticks to using it solely as an iterator.

## Efficient Buffering: The gritty details

If the total number of items generated by the iterator is very large, storing all of the generated values might take up a lot of memory. If we can keep track of all the indices of the `tee` objects generated by a `tee()` function call, we can delete old values once all of the indices have passed it. (Since iterators can only progress forwards, we know that these values will never be used again.) With CPython’s reference counting, this is actually quite simple: the buffer can be implemented as a singly linked list, with each `tee` object holding a pointer to the next element that it will return. As a natural consequence, once the last `tee` iterator is done with a buffer element, the element’s reference count goes to zero and it gets garbage collected.
However, linked lists perform worse than arrays in terms of memory fragmentation. Moreover, they incur lots of overhead in memory allocations and deallocations. As a compromise, CPython uses linked arrays – a linked list with arrays as individual elements. Moreover, these arrays are sized such that the entire array element is 64 bits in size, fitting nicely into cache lines.

## A Python Implementation

Python’s documentation gives pure-Python implementation of `tee()`, but it simplifies things and leaves out the buffering and copy semantics. I’ve written up a slightly more involved one that reflects the optimizations described above. Unlike the version in the docs, this implementation passes the standard library’s test suite. It still makes some concessions to simplicity: for instance, buffering uses a single list, instead of the linked arrays described above.
``````    class TeeIterator(object):

def __new__(cls, tee_data):
if isinstance(tee_data, TeeData):
self = super(TeeIterator, cls).__new__(cls)
self.tee_data = tee_data
else:
self = TeeIterator.from_iterable(tee_data)
self.index = 0
return self

def __copy__(self):
return TeeIterator(self.tee_data)

def __iter__(self):
return self

def next(self):
rv = self.tee_data[self.index]
self.index += 1
return rv

@classmethod
def from_iterable(cls, iterable):
if isinstance(iterable, cls):
return iterable.__copy__()
tee_data = TeeData(iter(iterable))
return TeeIterator(tee_data)

class TeeData(object):

def __init__(self, iterator):
self.iterator = iterator
self.buffer = []

def __getitem__(self, index):
if index == len(self.buffer):
self.buffer.append(next(self.iterator))
return self.buffer[index]

def tee(iterable, n=2):

if n < 0:
raise ValueError("n must be >= 0")
elif n == 0:
return ()

if hasattr(iterable, '__copy__'):
copyable = iterable
else:
copyable = TeeIterator.from_iterable(iterable)

result = [ copyable ]

for i in xrange(1, n):
result.append(copyable.__copy__())

return tuple(result)
``````

1. Yes, the returned objects are called `tee` as well, but they are completely distinct from the `itertools.tee()` function.
2. `copy.copy()` looks for this method too.
http://code.activestate.com/recipes/305588-simple-example-to-show-off-itertoolstee/

Learn python for fun.The popular blog with questions and answers to the python.Solutions to facebookhackercup,codejam,codechef.The fun way to learn python with me.Building some cool apps.