Added brief usage guide and test for the given examples

Byron · Byron · commit fe9842642154 · 2010-07-01T01:07:53.000+02:00
diff --git a/channel.py b/channel.py
@@ -16,7 +16,7 @@
 
 __all__ = ('Channel', 'SerialChannel', 'Writer', 'ChannelWriter', 'CallbackChannelWriter',
 			'Reader', 'ChannelReader', 'CallbackChannelReader', 'mkchannel', 'ReadOnly', 
-			'IteratorReader')
+			'IteratorReader', 'CallbackReaderMixin', 'CallbackWriterMixin')
 
 #{ Classes 
 class Channel(object):
diff --git a/doc/source/intro.rst b/doc/source/intro.rst
@@ -2,20 +2,22 @@
 Overview
 ########
 
-The *GitDB* project implements interfaces to allow read and write access to git repositories. In its core lies the *db* package, which contains all database types necessary to read a complete git repository. These are the ``LooseObjectDB``, the ``PackedDB`` and the ``ReferenceDB`` which are combined into the ``GitDB`` to combine every aspect of the git database.
+*Async* is one more attempt to make the definition and execution of asynchronous interdependent operations easy. For that to work, you may define tasks which communicate with each other by channels. Channels transfer items, which is very similar to bytes flowing through pipes uses in inter-process communication. Items will only be generated on demand, that is when you read from the respective output channel.
 
-For this to work, GitDB implements pack reading, as well as loose object reading and writing. Data is always encapsulated in streams, which allows huge files to be handled as well as small ones, usually only chunks of the stream are kept in memory for processing, never the whole stream at once.
+As it turned out, the GIL is far more restricting than initially thought, which effectively means true concurrency can only be obtained during input output to files and sockets, as well as specifically written versions of existing c python extensions which release the GIL before lengthy operations. Many of the currently available c extensions, such as zlib, lock everything down to just one thread at a time, even though this isn't a strict technical requirement.
 
-Interfaces are used to describe the API, making it easy to provide alternate implementations.
+If you want to make good use of *async*, you will have to carefully plan the operation, and you might end up writing a new or altering existing c-extensions for this.
+
+If you have 10 minutes, watch a more graphical presentation `on youtube <http://www.youtube.com/watch?v=wy1yB1M-dcQ>`_.
 
 ================
-Installing GitDB
+Installing Async
 ================
-Its easiest to install gitdb using the *easy_install*  program, which is part of the `setuptools`_::
+Its easiest to install async using the *easy_install*  program, which is part of the `setuptools`_::
     
-    $ easy_install gitdb
+    $ easy_install async
     
-As the command will install gitdb in your respective python distribution, you will most likely need root permissions to authorize the required changes.
+As the command will install async in your respective python distribution, you will most likely need root permissions to authorize the required changes.
 
 If you have downloaded the source archive, the package can be installed by running the ``setup.py`` script::
     
@@ -24,6 +26,6 @@ If you have downloaded the source archive, the package can be installed by runni
 ===============
 Getting Started
 ===============
-It is advised to have a look at the :ref:`Usage Guide <tutorial-label>` for a brief introduction on the different database implementations.
+It is advised to have a look at the :ref:`Usage Guide <tutorial-label>` for a brief introduction.
     
 .. _setuptools: http://peak.telecommunity.com/DevCenter/setuptools
diff --git a/doc/source/usage.rst b/doc/source/usage.rst
@@ -0,0 +1,84 @@
+.. _tutorial-label:
+
+###########
+Usage Guide
+###########
+
+******
+Design
+******
+The central instance within *async* is the **Pool**. A pool keeps a set of 0 or more workers which can run asynchronoously and process **Task**\ s. Tasks are added to the pool using the ``add_task`` function. Once added, the caller receives a **ChannelReader** instance which connects to a channel. Calling ``read`` on the instance will trigger the actual computation. A ChannelReader can serve as input for another task as well, which once added to the Pool, indicates a dependency between these tasks. To obtain one item from task 2, one item needs to be produced by task 1 beforehand - the pool takes care of the dependency handling as well as scheduling.
+
+Task instances allow to define the minimum amount of items to be processed on each request, and the maximum amount of items per batch. This chunking behaviour allows you to have fine-grained control about the memory requirements as well as the actuall achieved concurrency for your chain of tasks.
+
+Task chunks are the units actually being processed by the workers, the pool assures these are processed in the right order. Chunks help to bridge the gap between slowly items that take a long time to process, and those which are quickly generated. Generally, slow tasks should have small chunks, otherwise some of the workers might just end up waiting for input while slowy processing items of a big chunk take place in another worker.
+
+**************
+The ThreadPool
+**************
+A thread pool is a pool implementation which uses threads as workers. ``ChannelReader``\ s are blocking channels which are used as a means of communication  between tasks which are currently being processed.
+
+The ``set_size`` method is essential, as it determines the amount of workers in the pool. It defaults to 0 for newly created pools, which is equal to a fully synchonized mode of operation - all processing is effectively done by the calling thread::
+    
+    from async.pool import ThreadPool
+    
+    p = ThreadPool()
+    # default size is 0, synchronous mode
+    assert p.size() == 0
+    
+    # now tasks would be processed asynchronously
+    p.set_size(1)
+    assert p.size() == 1
+
+Currently this is the only implementation, but it was designed with the ``Multiprocessing`` package in mind, which shouldn't make it too hard to implement that in future releases.
+
+*****
+Tasks
+*****
+A task encapsulates properties of a task, and how its items should be processed. The processing is usually performed per item, calling a function with one item, to receive a processed item back which will be written to into the output channel. The reader end of that channel is either held by the client of the items, or by another task which performs additional processing.
+
+In the following example, a simple task is created which takes integers and multiplies them by itself::
+    
+    from async.task import IteratorThreadTask
+    
+    # A task performing processing on items from an iterator
+    t = IteratorThreadTask(iter(range(10)), "power", lambda i: i*i)
+    reader = p.add_task(t)
+    
+    # read all items - they where procesed by worker 1
+    items = reader.read()
+    assert len(items) == 10 and items[0] == 0 and items[-1] == 81
+    
+
+*****************************
+Channels, Readers and Writers
+*****************************
+Channels a the means of communication between tasks as well as clients to finally receive the processed itmes. A channel has one or more write ends and and one or more read ends. Readers will block if there are less than the requested amount of items, but will wake up once the missing items where sent through the write end.
+
+A channel's major difference over a queue is its ability to be closed, which will immediately wake up all waiting readers.
+
+Reader Callbacks
+================
+The reader returned by the Pool's ``add_task`` method is a specialized version of a ``CallbackChannelReader``, which allows to setup functions to be called before and after an item is read. This allows for just-in-time notification of asynchronous events, as well as to apply item transformations. 
+
+**************
+Chaining Tasks
+**************
+When using different task types, chains between tasks can be created. These will be understood by the pool, which realizes the implicit task dependency and will schedule the tasks in the right order.
+
+The following example creates two tasks which combine their results. As the pool only has one worker, and as the chunk size is maximized, we can be sure that the items are returned in order in this case::
+    
+    from async.task import ChannelThreadTask
+    
+    t = IteratorThreadTask(iter(range(10)), "power", lambda i: i*i)
+    reader = p.add_task(t)
+    
+    # chain both by linking their readers
+    tmult = ChannelThreadTask(reader, "mult", lambda i: i*2)
+    result_reader = p.add_task(tmult)
+    
+    # read all
+    items = result_reader.read()
+    assert len(items) == 10 and items[0] == 0 and items[-1] == 162
+
+
diff --git a/pool.py b/pool.py
@@ -106,8 +106,7 @@ def pool(self):
 	
 	def read(self, count=0, block=True, timeout=None):
 		"""Read an item that was processed by one of our threads
-		:note: Triggers task dependency handling needed to provide the necessary 
-			input"""
+		:note: Triggers task dependency handling needed to provide the necessary input"""
 		# NOTE: we always queue the operation that would give us count items
 		# as tracking the scheduled items or testing the channels size
 		# is in herently unsafe depending on the design of the task network
@@ -389,13 +388,15 @@ def num_tasks(self):
 			self._taskgraph_lock.release()
 		
 	def remove_task(self, task, _from_destructor_ = False):
-		"""Delete the task
+		"""
+		Delete the task.
 		Additionally we will remove orphaned tasks, which can be identified if their 
 		output channel is only held by themselves, so no one will ever consume 
 		its items.
 		
 		This method blocks until all tasks to be removed have been processed, if 
 		they are currently being processed.
+		
 		:return: self"""
 		self._taskgraph_lock.acquire()
 		try:
@@ -430,6 +431,7 @@ def remove_task(self, task, _from_destructor_ = False):
 	
 	def add_task(self, task):
 		"""Add a new task to be processed.
+		
 		:return: a read channel to retrieve processed items. If that handle is lost, 
 			the task will be considered orphaned and will be deleted on the next 
 			occasion."""
diff --git a/task.py b/task.py
@@ -196,8 +196,10 @@ class IteratorTaskBase(Task):
 	def __init__(self, iterator, *args, **kwargs):
 		Task.__init__(self, *args, **kwargs)
 		self._read = IteratorReader(iterator).read
+		
 		# defaults to returning our items unchanged
-		self.fun = lambda item: item
+		if self.fun is None:
+			self.fun = lambda item: item
 				
 		
 class IteratorThreadTask(IteratorTaskBase, ThreadTaskBase):
diff --git a/test/test_example.py b/test/test_example.py
@@ -0,0 +1,44 @@
+"""Module containing examples from the documentaiton"""
+from lib import *
+
+from async.pool import *
+from async.task import *
+from async.thread import terminate_threads
+
+
+
+
+class TestExamples(TestBase):
+	
+	@terminate_threads
+	def test_usage(self):
+		p = ThreadPool()
+		# default size is 0, synchronous mode
+		assert p.size() == 0
+		
+		# now tasks would be processed asynchronously
+		p.set_size(1)
+		assert p.size() == 1
+		
+		# A task performing processing on items from an iterator
+		t = IteratorThreadTask(iter(range(10)), "power", lambda i: i*i)
+		reader = p.add_task(t)
+		
+		# read all items - they where procesed by worker 1
+		items = reader.read()
+		assert len(items) == 10 and items[0] == 0 and items[-1] == 81
+		
+		
+		# chaining 
+		t = IteratorThreadTask(iter(range(10)), "power", lambda i: i*i)
+		reader = p.add_task(t)
+		
+		# chain both by linking their readers
+		tmult = ChannelThreadTask(reader, "mult", lambda i: i*2)
+		result_reader = p.add_task(tmult)
+		
+		# read all
+		items = result_reader.read()
+		assert len(items) == 10 and items[0] == 0 and items[-1] == 162
+		
+		
diff --git a/test/test_pool.py b/test/test_pool.py
@@ -1,4 +1,4 @@
-"""Channel testing"""
+"""Pool testing"""
 from lib import *
 from task import *
 

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-"""Channel testing"""`
	`1`	`+"""Pool testing"""`
`2`	`2`	`from lib import *`
`3`	`3`	`from task import *`
`4`	`4`