What's new? | Help
papy/
                
Previous:

papy.papy

This module provides classes and functions to construct and run a PaPy pipeline.

class papy.papy.Chain(iterables, stride=1)

This is a generalization of the zip and chain functions. If stride =1 it behaves like itertools.zip, if stride =len(iterable) it behaves like itertools.chain in any other case it zips iterables in strides e.g:

a = Chain([iter([1,2,3]), iter([4,5,6], stride =2)
list(a)
>>> [1,2,4,5,3,6]

It is further resistant to exceptions i.e. if one of the iterables raises an exception the Chain does not end in a StopIteration, but continues with other iterables.

next()
Returns the next result from the chained iterables given stride.
class papy.papy.Consume(iterable, n=1, stride=1)

This iterator-wrapper consumes n results from the input iterator and weaves the results together in strides. If the result is an exception it is not raised.

next()
Returns the next sequence of results, given stride and n.
class papy.papy.Dagger(pipers=(), pipes=(), xtras=None)

The Dagger is a directed acyclic graph. It defines the topology of a PaPy pipeline/workflow. It is a subclass from Graph inverting the direction of edges called within the Dagger pipes. Edges can be regarded as dependencies, while pipes as data-flow between Pipers or Nodes of the Graph.

Arguments:

  • pipers(sequence) [default: ()]

    A sequence of valid add_piper inputs (see the documentation for the add_piper method).

  • pipes(sequence) [default: ()]

    A sequence of valid add_pipe inputs (see the documentation for the add_piper method).

add_pipe(pipe, branch=None)

Adds a pipe (A, ..., N) which is an N-tuple tuple of Pipers. Adding a pipe means to add all the Pipers and connect them in the specified left to right order.

Arguments:

  • pipe(sequence)

    N-tuple of Piper instances or objects which are valid add_piper arguments. See: Dagger.add_piper and Dagger.resolve.

Note

The direction of the edges in the graph is reversed compared to the left to right data-flow in a pipe.

add_piper(piper, xtra=None, create=True, branch=None)

Adds a Piper to the Dagger, only if the Piper is not already in a Node. Optionally creates a new Piper if the piper argument is valid for the Piper constructor. Returns a tuple: (new_piper_created, piper_instance) indicating whether a new Piper has been created and the instance of the added Piper.

Arguments:

  • piper(Piper instance, Worker instance or id(Piper instance))

    Piper instance or object which will be converted to a Piper instance.

  • create(bool) [default: True]

    Should a new Piper be created if the piper cannot be resolved in the Dagger?

  • xtra(dict) [default: None]

    Dictionary of Graph Node properties.

add_pipers(pipers, *args, **kwargs)

Adds a sequence of Pipers to the Dagger in specified order. Takes optional arguments for Dagger.add_piper.

Arguments:

  • pipers (sequence of valid add_piper arguments)

    Sequence of Pipers or valid Dagger.add_piper arguments to be added to the Dagger in the left to right order of the sequence.

add_pipes(pipes, *args, **kwargs)

Adds a sequence of pipes to the Dagger in the specified order. Takes optional arguments for Dagger.add_pipe.

Arguments:

  • pipes (sequence of valid add_pipe arguments)

    Sequence of pipes or valid Dagger.add_pipe arguments to be added to the Dagger in the left to right order of the sequence.

children_after_parents(piper1, piper2)
Custom compare function. Returns 1 if the first Piper is upstream of the second Piper, -1 if the first Piper is downstream of the second Piper and 0 if the two Pipers are independent.
connect(datas=None)
Given the pipeline topology connects Pipers in the order input -> output. See Piper.connect.
connect_inputs(datas)

Connects input Pipers to input data in the correct order determined, by the Piper.ornament attribute and the Dagger._cmp function.

Note

It is assumed that the input data is in the form of an iterator and that all inputs have the same number of input items. A pipeline will deadlock otherwise.

Arguments:

  • datas (sequence of iterators)

    Ordered sequence of inputs for all input Pipers.

del_pipe(pipe, forced=False)

Deletes a pipe (A, ..., N) which is an N-tuple of pipers. Deleting a pipe means to delete all the connections between pipers and to delete all the Pipers. If forced =``False`` only Pipers which are not needed anymore (i.e. have not downstream Pipers) are deleted.

Arguments:

  • pipe(sequence)

    N-tuple of Piper instances or objects which can be resolved in the Dagger (see: Dagger.resolve). The Pipers are removed from right to left.

  • forced(bool) [default: False]

    The forced argument will be forwarded to the Dagger.del_piper method. If forced is False only Pipers with no outgoing pipes will be deleted.

Note

The direction of the edges in the Graph is reversed compared to the left to right data-flow in a pipe.

del_piper(piper, forced=False)

Removes a Piper from the Dagger.

Arguments:

  • piper(Piper instance, Worker instance or id(Piper instance))

    Piper instance or object from which a Piper instance can be constructed.

  • forced(bool) [default: False]

    If forced =``False Pipers with outgoing pipes (incoming edges) will not be removed and will raise a DaggerError.

del_pipers(pipers, *args, **kwargs)

Deletes a sequence of Pipers from the Dagger in reverse of the specified order. Takes optional arguments for Dagger.del_piper.

Arguments:

  • pipes (sequence of valid del_pipe arguments)

    Sequence of Pipers or valid Dagger.del_piper arguments to be removed from the Dagger in the right to left order of the sequence.

del_pipes(pipes, *args, **kwargs)

Deletes a sequence of pipes from the Dagger in the specified order. Takes optional arguments for Dagger.del_pipe.

Arguments:

  • pipes (sequence of valid del_pipe arguments)

    Sequence of pipes or valid Dagger.del_pipe arguments to be removed from the Dagger in left to right order of the sequence.

disconnect(forced=False)

Given the pipeline topology disconnects Pipers in the order output -> input. This also disconnects inputs. See Dagger.connect, Piper.connect and Piper.disconnect.

Arguments:

  • forced(bool) [default: False]

    If set True all tasks from all IMaps will be removed.

get_inputs()
Returns Pipers which are inputs to the pipeline i.e. have no incoming pipes (outgoing dependency edges).
get_outputs()
Returns Pipers which are outputs to the pipeline i.e. have no outgoing pipes (incoming dependency edges).
resolve(piper, forgive=False)

Given a Piper or the id of the Piper. Returns this Piper if it can be resolved else raises a DaggerError or returns False depending on the forgive argument.

Arguments:

  • piper(Piper instance, id(Piper instance))

    Piper instance or its id to be found in the Dagger.

  • forgive(bool) [default =``False``]

    If forgive is False a DaggerError is raised whenever a Piper cannot be resolved in the Dagger. If forgive is True: False is returned.

start()
Given the pipeline topology starts Pipers in the order input -> output. See Piper.start. The forced =`True` argument is passed to the Piper.start method, allowing Pipers to share IMaps.
stop()
Given the pipeline topology stops the Pipers.
exception papy.papy.DaggerError
Exceptions raised or related to Dagger instances.
class papy.papy.InputIterator(iterator, piper)
next()
class papy.papy.Piper(worker, parallel=False, consume=1, produce=1, spawn=1, timeout=None, branch=None, debug=False, name=None, track=False)

Creates a new Piper instance.

Note

The (produce * spawn) of the upstream Piper has to equal the (consume * spawn) of the downstream Piper. for each pair of pipers connected by a pipe. This will be in future enforced by the Dagger.

Arguments:

  • worker(Worker instance, Piper instance or sequence of worker-functions or Worker instances)

    A Piper can be created from a Worker instance another Piper instance or a sequence of worker-functions or Worker instances in every case a new instance is created.

  • parallel(False or IMap instance) [default: False]

    If parallel =``False`` Piper will not evaluate the Worker in parallel but use the “manager” process and the itertools.imap function. Otherwise the specified IMap instance and it’s threads/processes will be used.

  • consume(int) [default: 1]

    The number of input items consumed from all directly connected upstream Pipers per one Worker evaluation. Results will be passed to the worker-function as a sequence of individual results.

  • produce(int) [default: 1]

    The number of results to generate for each Worker evaluation result. Results will be elements of the sequence returned by the worker.

  • spawn(int) [default: 1]

    The number of times this Piper is implicitly added to the pipeline to consume the specified number of results.

  • timeout(int) [default: None]

    Time to wait till a result is available. Otherwise a PiperError is returned not raised.

  • branch(object) [default: None]

    This affects the order of Pipers in the Dagger. Pipers are sorted according to:

    1. data-flow upstream->downstream
    2. branch attribute

    This argument sets the branch attribute of a Piper. If two Pipers have no upstream->downstream relation they will be sorted according to their branch attribute. If neither of them has a branch set or it is identical their order will be semi-random. Pipers will implicitly inherit the branch of an up-stream Piper, thus it is only necessary to sepcify the branch of a Piper if it is the first one after a branch point.

    The argument can be any object which can be used by the cmp built-in function. If necessary they could override the __cmp__ method.

    Note that it is possible to construct pipelines without specifying branches if Pipers which are connected to multiple up-stream Pipers use Workers which act correctly regardless of the ordere of branch results in the inbox of the worker-function.

  • debug(bool) [default: False]

    Verbose debugging mode. Raises a PiperError on WorkerErrors.

    Warning

    this will most likely hang the Python interpreter after the error occurs. Use during development only!

connect(inbox)
Connects the Piper to its upstream Pipers. Upstream Pipers should be passed as a sequence. This connects the Piper.inbox with the Piper.outbox respecting the consume, spawn and produce arguments.
disconnect(forced=False)

Disconnects the Piper from its upstream Pipers or input data. If the Piper

Arguments:

  • forced(bool) [default: False]

    If forced is True tries to forcefully remove all tasks (including the spawned ones) from the IMap instance

next()

Returns the next result. If no result is availble within the specified (at initialization) timeout then a PiperError wrapped TimeoutError is returned.

If the result is a WorkerError it is wrapped in a PiperError and returned or raised if debug mode was specified at initialization. If the result is a PiperError it is propagated.

start(stages=None)

Makes the Piper ready to return results. This involves starting the the provided IMap instance. If multiple Pipers share an IMap instance the order in which the Pipers are started is important. The valid order is upstream before downstream. The IMap instance has to be started only once, but the process can be done in 2 stages. This methods stages argument is a tuple which can contain any combination of the numbers 0, 1 or 2 specifying which stage of the start routine should be carried out.

stage 0 - creates the needed itertools.tee objects.

stage 1 - activates IMap pool. A call to next will block..

stage 2 - activates IMap pool managers.

If this Piper shares an IMap with other Pipers the proper way to start them is to start them in a valid postorder with stages (0, 1) and (2,) separately.

Arguments:

  • stages(tuple) [default =(0,) if linear; (0,1,2) if parallel]

    Performs the specified stages of the start of a Piper instance. Stage 0 is necessary and sufficient to start a Piper utilizing an itertools.imap. Stages 1 and 2 are required to start a parallel Piper.

stop(forced=False, **kwargs)

Tries to cleanly stop the Piper. A Piper is “started” if it’s IMap instance is “started”. Non-parallel Pipers need not to be started or stopped. An IMap instance can be stopped by triggering its stopping procedure and retrieving results from the IMaps end tasks. Because neither the Piper nor the IMap “knows” which tasks(Pipers) are the ends they have to be specified:

end_task_ids = [0, 1]    # A list of IMap task ids
piper_instance.stop(ends =end_task_ids)        

results in:

IMap_instance.stop(ends =[0,1])

If the Piper did not finish the forced argument has to be specified:

piper_instance.stop(forced =True, ends =end_task_ids)

If the Piper (and consequently IMap) is part of a Dagger the Dagger.stop method should be called instead. See IMap.stop and Dagger.stop.

Arguments:

  • forced(bool) [default =False]

    The Piper will be forced to stop the IMap instance. If set True but the ends IMap argument is not specified. The IMap instance will not try to retrieve any results and will not call the IMap._stop_managers method.

exception papy.papy.PiperError
Exceptions raised or related to Piper instances.
class papy.papy.Plumber(logger_options={}, **kwargs)

The Plumber is a subclass of Dagger and Graph with added run-time methods and a high-level interface for working with PaPy pipelines.

Arguments:

  • dagger(Dagger instance) [default: None]

    An optional Dagger instance.

load(filename)

Load pipeline from source file.

Arguments:

  • filename(path)

    Location of the pipeline source code.

pause()
Pauses a running pipeline. This will stop retrieving results from the pipeline. Parallel parts of the pipeline will stop after the IMap buffer is has been filled. A paused pipeline can be run or stopped.
run()
Runs a started pipeline by pulling results from it’s output Pipers. Pipers with the track attribute set True will have their results stored within Dagger.stats['pipers_tracked'] dictionary. A running pipeline can be paused.
save(filename)

Save pipeline as source file.

Arguments:

  • filename(path)

    Path to save pipeline source code.

start(datas)

Starts the pipeline by connecting the input Pipers of the pipeline to the input data, connecting the pipeline and starting the IMaps.

Arguments:

  • datas(sequence)

    A sequence of external input data in the form of sequences or iterators. The order of items in the datas sequence should correspond to the order of the input Pipers defined by Dagger._cmp and Piper.ornament.

stop()
Stops a paused pipeline. This will a trigger a StopIteration in the inputs of the pipeline. And retrieve the buffered results. This will stop all Pipers and IMaps. Python will not terminate cleanly if a pipeline is running or paused.
wait(timeout=None)

Waits(blocks) until a running pipeline finishes.

Arguments:

  • timeout(int) [default =None]

    Specifies the timeout, RuntimeError will be raised. The default is to wait indefinetely for the pipeline to finish.

exception papy.papy.PlumberError
Exceptions raised or related to Plumber instances.
class papy.papy.Produce(iterable, n=1, stride=1)

This iterator wrapper is an iterator, but it returns elements from the sequence returned by the wrapped iterator. The number of returned elements is defined by n and should not be smaller then the sequence returned by the wrapped iterator.

For example if the wrapped iterator results are ((11, 12), (21, 22), (31, 32)) then n should equal 2. For stride =1 the result will be: [11, 12, 21, 22, 31, 32]. For stride =2 [11, 21, 12, 22, 31, 32]. Note that StopIteration is also an exception!

class papy.papy.TeePiper(piper, i, stride)

This is wrapper around a Piper, created whenever another Piper connects. The actual call to itertools.tee happens on a call to Piper.start. A TeePiper protects the tee object with a threading lock. This lock is held for a stride after this the next TeePiper is released. If a StopIteration exception occurs the next TeePiper is released and subsequent calls to the next method of this TeePiper will not involve acquiring a lock and calling the next method of the wrapped tee object. This guarantees that the next method of a Piper will yield a StopIteration only once. This guarantee is required because the IMap will finish a task after the first StopIteration and will not call Piper.next any more and will automatically raise StopIterations for subsequent calls to IMap.next.

Arguments:

  • piper(Piper instance)

    The Piper object to be tee’d

  • i(int)

    The index of the itertools.tee object within Piper.tees.

  • stride(int)

    The stride of the Piper downstream of the wrapped Piper. In a pipeline they should be the same or compatible (see manual).

next()
Returns or raises the next result from the itertools.tee object for the wrapped Piper.
class papy.papy.Worker(functions, arguments=None, kwargs=None, name=None)

The Worker is an object which composes sequences of functions. When called the functions are evaluated from left to right. The function on the right will receive the return value from the function on the left. Optionally takes sequences of positional and keyworded arguments for none or all of the composed functions. Positional arguments should be given in a tuple. Each element of this tuple should be a tuple of positional arguments for the corresponding function. If a function does not take positional arguments its corresponding element in the arguments tuple should be an empty tuple i.e. (). Keyworded arguments should be given in a tuple. Each element of this tuple should be a dictionary of arguments for the corresponding function. If a function does not take any keyworded arguments its corresponding element in the keyworded arguments tuple should be an empty dictionary i.e. {}. If none of the functions takes arguments of a given type the positional and/or keyworded arguments tuple can be omitted.

All exceptions raised by the worker-functions are caught, wrapped and returned not raised. If the Worker is called with a sequence which contains an exception no worker-function is evaluated and the exception is wrapped and returned.

The Worker can be initialized in a variety of ways:

  • with a sequence of functions and a optional sequences of positional and keyworded arguments e.g.:

    Worker((func1,         func2,    func3), 
          ((arg11, arg21), (arg21,), ()),
          ({},             {},       {'arg31':arg31}))
    
  • with another Worker instance, which results in their functional equivalence e.g.:

    Worker(worker_instance)
    
  • With multiple Worker instances, where the functions and arguments of the Workers are combined e.g.:

    Worker((worker1, worker2))
    

    this is equivalent to:

    Worker(worker1.task + worker2.task,                worker1.args + worker2.args,                worker1.kwargs + worker2.kwargs)
    
  • with a single function and its arguments in a tuple e.g.:

    Worker(function,(arg1, arg2, arg3))
    

    which is equivalent to:

    Worker((function,),((arg1, arg2, arg3),))
    
exception papy.papy.WorkerError
Exceptions raised or related to Worker instances.
papy.papy.comp_task(inbox, args, kwargs)
Composes functions in the global sequence variable TASK and evaluates the composition given input (inbox) and arguments (args, kwargs).
papy.papy.inspect(piper)
Determines the instance (Piper, Worker, FunctionType, Iterable). It returns a tuple of boolean variables i.e: (is_piper, is_worker, is_function, is_iterable_of_pipers, is_iterable_of_workers, is_iterable_of_functions).

papy.graph

This module implements a graph data structure without explicit edges, using nested Python dictionaries.

class papy.graph.Graph(nodes=(), edges=(), xtras=None)

Dictionary based graph data structure.

This Graph implementation is a little bit unusual as it does not explicitly hold a list of edges. The Graph is a dictionary where the keys of the dictionary are any hashable objects (node objects), while the values are Node instances. A Node instance is also a dictionary, where the keys are node objects and the values are Node instances. A Node instance(value) is basically a dictionary of outgoing edges from the node object(key). The edges are indexed by the incoming objects. So we end up with a recursivly nested dictionary which defines the topology of the Graph.

Arguments:

  • nodes(sequence of nodes) [default: ()]

    A sequence of nodes to be added to the graph. See: Graph.add_nodes

  • edges(sequence of edges) [default: ()]

    A sequence of edges to be added to the graph. See: Graph.add_edges

  • xtras(sequence of dictionaries) [default: None]

    A sequence of xtra dictionaries corresponding to the added node objects. The topological nodes corresponding to the added dictionaries will have Node.xtra updated with the contents of this sequence. Either all or no xtra dictionaries have to be specified.

add_edge(edge, double=False)

Adds an edge to the Graph. An edge is just a pair of node objects. If the node objects are not in the Graph they are added.

Arguments:

  • edge(sequence)

    An ordered pair of node objects. The edge is assumed to have a direction from the first to the second node object.

  • double(bool) [default: False`]

    If True the the reverse edge is also added.

add_edges(edges, *args, **kwargs)

Adds edges to the graph. Takes optional arguments for Graph.add_edge.

Arguments:

  • edges(sequence of edges)

    Sequence of edges to be added to the Graph

add_node(node, xtra=None, branch=None)

Adds a node object to the Graph. Returns True if a new node object has been added. If the node object is already in the Graph returns False.

Arguments:

  • node(object)

    Node to be added. Any hashable Python object.

  • xtra(dict) [default: None]

    The newly created topological Node.xtra dictionary will be updated with the contents of this dictionary.

  • branch(object) [default: None]

add_nodes(nodes, xtras=None)

Adds nodes to the graph.

Arguments:

  • nodes(sequence of objects)

    Sequence of node objects to be added to the Graph

  • xtras(sequence of dictionaries) [default: None]

    Sequence of Node.xtra dictionaries corresponding to the node objects being added. See: Graph.add_node.

clear_nodes()
Clears all nodes in the Graph. See Node.clear.
cmp_branch(node1, node2)
To sort by branch of the topological Node corresponding to the node object.
deep_nodes(node)
Returns all reachable node objects from a node object. See: Node.deep_nodes
del_edge(edge, double=False)

Removes an edge to the Graph. An edges is just a pair of node objects. But the node objects are not removed from the Graph.

Arguments:

  • edge(sequence)

    An ordered pair of node objects. The edge is assumed to have a direction from the first to the second node object.

  • double(bool) [default: False`]

    If True the the reverse edge is also remove.

del_edges(edges, *args, **kwargs)

Removes edges from the graph. Takes optional arguments for Graph.del_edge.

Arguments:

  • edges(sequence of edges)

    Sequence of edges to be removed from the Graph

del_node(node)

Removes a node object to the Graph. Returns True if a node object has been removed. If the node object was not in the Graph raises a KeyError.

Arguments:

  • node(object)

    Node to be removed. Any hashable Python object.

del_nodes(nodes)

Removes nodes from the graph.

Arguments:

  • nodes(sequence of objects)

    Sequence of node objects to be removed from the Graph. See: Graph.del_node.

dfs(node, bucket=None, order='append')

Recursive depth first search. By default (order = ‘append’) this returns the node objects in the reverse postorder. To change this into the preorder use a collections.deque bucket and order ‘appendleft’.

Arguments:

  • bucket(list or queue) [default: None]

    The user must provide the list or queue to store the nodes

  • order(str) [default: ‘append’]

    Method of the bucket which will be called with the node object which has been examined. Other valid options might be ‘appendleft’ for a dequeue.

edges(nodes=None)

Returns a tuple of all edges in the Graph.

Arguments:

  • nodes(objects)

    If specified the edges will be limited to those originating from one of the specified nodes.

incoming_edges(node)

Returns a tuple of incoming edges for this node object.

Arguments:

  • node(object)

    Node to be inspected for incoming edges.

iternodes()
Returns an iterator of all node objects in the Graph
node_rank()
Returns the maximum rank for each node in ther graph. The rank of a node is define as the number of edges between the node and a node which has rank 0. A node has rank 0 if it has no incoming edges.
nodes()
Returns a list of all node objects in the Graph
outgoing_edges(node)

Returns a tuple of outgoing edges for this node object.

Arguments:

  • node(object)

    Node to be inspected for outgoing edges.

postorder()

Returns some postorder of node objects of the Graph if it is a directed acyclic graph. A postorder is not random, because the order of elements in a dictionary is not random and so are the starting nodes of the depth-first search traversal which produces the postorder. Therefore some postorders will be discovered more frequently.

This postorder enforces additional order:

  • (TODO: describe earthworm branch order)
  • if the topological Nodes corresponding to the node objects have a ‘branch’ attribute it will be used to sort the graph from left to right.

is, but not unique either.

rank_width()
Returns the width of each rank in the graph.
class papy.graph.Node(entity=None, xtra=None)

Node is the topological node of a Graph. Please note that the node object is not the same as the topological node. The node object is any hashable Python object. The topological node is is defined for each node object and is a dictionary of other node objects with incoming edges from this node object.

clear()
Sets the discovered and examined attributes to False.
deep_nodes(allnodes=None)
A recursive method to return all node objects reachable from this Node.
iternodes()
Returns an iterator of node objects directly reachable from this Node.
nodes()
Returns a list of node objects directly reachable from this Node.

papy.utils

This module provides diverse utility functions.

papy.tkgui

Tkinter/Pmw gui for PaPy. Provides also a Tkinter shell widget.

papy.workers

This module provides diverse workers.

  • core (core functionality)
  • maths (wrappers around python math and operator modules)

papy.workers.core

A collection of core workers-functions to use in Worker instances.

papy.workers.core.ipasser(inbox, i=0)

Passes the i-th input from inbox. By default passes the first input.

Arguments:

  • i(int) [default: 0]
papy.workers.core.njoiner(inbox, n=None, join='')

String joins and returns the first n inputs.

Arguments:

  • n(int) [default: None]

    All elements in the inbox smaller then this number will be joined.

  • join(string) [default: ""]

    String which will join the elements of the inbox i.e. join.join().

papy.workers.core.npasser(inbox, n=None)

Passes n first inputs from inbox. By default passes the whole inbox.

Arguments:

  • n(int) [default: None]
papy.workers.core.nzipper(inbox, n=None)

Zips the n first inputs from inbox. By default zips thw whole inbox.

Arguments:

  • n(int) [default: None]
papy.workers.core.plugger(inbox)
Returns nothing.
papy.workers.core.sjoiner(inbox, s=None, join='')

String joins input with indices in s.

Arguments:

  • s(sequence) [default: None]

    Sequence (tuple or list) of indices of the elements which will be joined.

  • join(string) [default: ""]

    String which will join the elements of the inbox i.e. join.join().

papy.workers.core.spasser(inbox, s=None)

Passes inputs with indecies in s. By default passes the whole inbox.

Arguments:

  • s(sequence) [default: None -> range(len(inbox))]
papy.workers.core.szipper(inbox, s=None)

Zips inputs from inbox with indecies in s. By default zips thw whole inbox.

Arguments:

  • s(sequence) [default: None]

papy.workers.maths

This module contains worker-functions for common math to use in PaPy Worker instances. It wraps around functions from the operator and math Python standard library modules. Any function from math and operator is included in this library if it is related to evaluations.

If a function accepts multiple arguments two flavours of wrappers are provided:

  1. simple wrapper: math.X(arg1, arg2) should be called from papy.workers.maths.X as X([arg1], arg2)
  2. star wrapper: math.X(arg1, arg2) should be called from papy.workers.maths.Xstar ars Xstar(arg)

In other words 1. should be used to construct Workers which accept arguments at construction and 2. should be used to construct Workers which accept multiple inputs.

papy.workers.io

A collection of worker functions dealing with inputs/outputs of a pipeline or Pipers. In general those functions are used to connect Pipers to external inputs/outputs (these are the pipeline input/outputs i.e. streams) or to connect them to other Pipers (via items i.e. transformed elements of the input streams). Based on that distinction two types of functions are provided:

  • stream function - load or save the input stream from or into a single file, therefore they can only be used at the beginnings or ends of a pipeline. Stream loaders are not worker functions, as they are called once (e.g. with the input file name as the argument) and create the input stream in the form of a generator of input items.
  • item functions - load, save, process or display data items. These are Worker functions and should be used within Pipers.

No method of interprocess communication, besides the default inefficient two pass multiprocessing.Queue and temporary files is supported on all platforms. Even among UNIX implementation details forking and shm implementation details can differ.

papy.workers.io.csv_dumps(inbox, **kwargs)
Dumps first element of inbox as csv (comma seperated value) string. Takes optional arguments for csv.writer.
papy.workers.io.csv_loads(inbox)
Loads firts element of inbox as a csv (comma seperated value) string.
papy.workers.io.dump_db_item(inbox, type='sqlite', table='temp', **kwargs)

Writes the first element of the inbox as a key/value pair in a database of the provided type. Currently supported: “sqlite” and “mysql”. Returns the information necessary for the load_db_item to retrieve the element.

According to the sqlite documentation: You should avoid putting SQLite database files on NFS if multiple processes might try to access the file at the same time.

Arguments:

  • type(str) [default: ‘sqlite’]

    Type of the database to use currently supported ‘sqlite’ and ‘mysql’ Using MySQL requires a running

  • db(str) [default: ‘papydb’]

    Default name of the database, for sqlite it is the name of the database file in the current working directory. Databases can be shared among pipers. Having multiple SQLite database files improves concurrency. A new file will be created if none exists. The MySQL database has to exists (it will not be created).

  • table(str) [default: ‘temp’]

    Name of the table to store the key/value pairs into. Tables can be shared among pipers.

  • host, user, passwd

    Authentication information. Refer to the generic dbapi2 documentation.

papy.workers.io.dump_item(inbox, type='file', prefix=None, suffix=None, dir=None, timeout=320, buffer=None)

Writes the first element of the inbox as a file of a specified type. The type can be ‘file’, ‘fifo’, ‘shm’, ‘tcp’ or ‘udp’ corresponding to typical files, named pipes(FIFOs) and posix shared memory. FIFOs and shared memory are volatile, but shared memory can exist longer then the Python process.

Returns the semi-random name of the file written. By default creates files and fifos in the default temporary directory and shared memory in /dev/shm. To use named pipes the operating system has to support both forks and fifos (not Windows). To use shared memory the system has to be proper posix (not MacOSX) and the posix_ipc module has to be installed. Sockets should work on all operating systems.

This worker-function is useful to efficently communicate parallel Pipers without the overhead of using queues.

Arguments:

  • type(‘file’, ‘fifo’, ‘shm’, ‘tcp’, ‘udp’) [default: ‘file’]

    Type of the created file/socket.

  • prefix(str) [default: tmp_papy_%type%]

    Prefix of the file to be created. Should probably identify the Worker and Piper.

  • suffix(str) [default: '']

    Suffix of the file to be created. Should probably identify the format of the serialization protocol e.g. ‘pickle’ or deserialized data e.g. ‘numpy’.

  • dir(str) [default: tempfile.gettempdir() or /dev/shm]

    Directory to safe the file to. (can be changed only for types ‘file’ and ‘fifo’

  • timeout(int) [default: 320]

    Number of seconds to keep the process at the write-end of the socket or pipe alive.

papy.workers.io.dump_manager_item(inbox, address=('localhost', 46779), authkey='papy')

Writes the first element of the inbox as a shared object. The object is stored as a value in a shared dictionary served by a Manager process. Returns the key for the object value the address and the authentication key.

To use this worker-function a DictServer instance has to be running. Usage of this method for IPC is not recommended for performance reasons.

Arguments:

  • address(2-tuple) [default: (‘localhost’, 46779)]

    A 2-tuple identifying the server(string) and port(integer).

  • authkey(string) [default: ‘papy’]

    Authentication string to connect to the server.

papy.workers.io.dump_pickle_stream(inbox, handle)
Writes the first element of the inbox to the provided stream (data handle) as a pickle. To be used with the load_pickle_stream worker-function.
papy.workers.io.dump_stream(inbox, handle, delimiter=None)

Writes the first element of the inbox to the provided stream (file handle) delimiting the input by the optional delimiter string. Returns the name of the file being written.

Warning

Note that only a single process can have access to a file handle open for writing. Therefore this worker-function should only be used by a non-parallel Piper.

Arguments:

  • handle(stream)

    File handle open for writing.

  • delimiter(string) [default: None]

    A string which will seperate the written items. e.g: “END” becomes “nENDn” in the output stream. The default is an empty string which means that items will be seperated by a blank line i.e.: ‘nn’

papy.workers.io.find_items(prefix='tmp', suffix='', dir=None)

Creates a file name generator from files matching the supplied arguments. Matches the same files as those created by dump_chunk for the same arguments.

Arguments:

  • prefix(string) [default: ‘tmp’]

    Mandatory first chars of the files to find.

  • suffix(string) [default: '']

    Mandatory last chars of the files to find.

  • dir(string) [default: current working directory]

    Directory where the files should be located.

papy.workers.io.json_dumps(inbox)
Serializes the first element of the input using the JSON protocol as implemented by the json Python 2.6 library.
papy.workers.io.json_loads(inbox)
Deserializes the first element of the input using the JSON protocol as implemented by the json Python 2.6 library.
papy.workers.io.load_db_item(inbox, remove=True)

Loads an item from a sqlite database. Returns the stored string.

Arguments:

  • remove(bool) [default: True]

    Remove the loaded item from the table (temporary storage).

papy.workers.io.load_item(inbox, type='string', remove=True, buffer=None)

Loads data from a file. Determines the file type automatically (‘file’, ‘fifo’, ‘shm’, ‘tcp’, ‘udp’) but allows to specify the representation type ‘string’ or ‘mmap’ for memmory mapped access to the file. Returns a the loaded item as a string or mmap object. Internally creates an item from a file object.

Arguments:

  • type(‘string’ or ‘mmap’) [default: string]

    Determines the type of object the worker returns i.e. the file read as a string or a memmory map. FIFOs cannot be memory mapped.

  • remove(bool) [default: True]

    Should the file be removed from the filesystem? This is mandatory for FIFOs and sockets and generally a very good idea for shared memory. Files can be used to store data persistantly.

papy.workers.io.load_manager_item(inbox, remove=True)

Loads an item from from DictServer.

Arguments:

  • remove(bool) [default: True]

    Should the data be removed from the DictServer instance?

papy.workers.io.load_pickle_stream(handle)

Creates an object generator from a stream (file handle) containing data in pickles. To be used with the dump_pickle_stream

Warning

File handles should not be read by different processes.

papy.workers.io.load_stream(handle, delimiter=None)

Creates a string generator from a stream (file handle) containing data delimited by the delimiter strings. This is a stand-alone function and should be used to feed external data into a pipeline.

Arguments:

  • delimiter(string) [default: None]

    The default means that items will be separated by two new-line characters i.e.: ‘nn’

papy.workers.io.make_items(handle, size)

Creates a generator of items from a file handle. The size argument is the approximate size of the generated chunks in bytes. The main purpose of this worker function is to allow multiple worker processes/threads to read from the same file handle.

A chunk is a 3-tuple (file descriptor, first_byte, last_byte), which defines the position of chunk within a file. The size of a chunk i.e. last_byte - first_byte is approximately the size argument. The last byte in a chunk is always a ‘n’. The first byte points always to the first character in a line. A chunk can also be a whole file i.e. the first byte is 0 and the last byte is EOF.

Arguments:

  • size(int) [default: mmap.ALLOCATIONGRANULARITY]

    on windows: 64KBytes on linux: 4KBytes

    Approximate chunk size in bytes.

papy.workers.io.make_lines(handle, follow=False, wait=0.10000000000000001)

Creates a line generator from a stream (file handle) containing data in lines.

Arguments:

  • follow(bool) [default: False]

    If true follows the file after it finishes like ‘tail -f’.

  • wait(float) [default: 0.1]

    Time to wait between file polls.

papy.workers.io.marshal_dumps(inbox)
Serializes the first element of the input using the marshal protocol.
papy.workers.io.marshal_loads(inbox)

Deserializes the first element of the input using the marshal protocol.

Warning

Be carful when communicating different Python version via this protocol as it is version specific.

papy.workers.io.open_shm(name)

Equivalent to the built in open function but opens a file in shared memory. A single file can be opened multiple times. Only the name of the file is necessary and not its absolute location (which is most likely /dev/shm/). The file is opened by default in read/write mode.

Arguments:

  • name(string)

    The name of the file to open e.g. “my_file” not /dev/shm/my_file

papy.workers.io.pickle_dumps(inbox)
Serializes the first element of the input using the pickle protocol using the fastes binary protocol.
papy.workers.io.pickle_loads(inbox)
Deserializes the first element of the input using the pickle protocol.
papy.workers.io.print_(inbox)
Prints the first element of the inbox.

IMap.IMap

This module provides a parallel, buffered, multi-task, imap function. It evaluates results as they are needed, where the need is relative to the buffer size. It can use threads and processes.

class IMap.IMap.IMap(func=None, iterable=None, args=None, kwargs=None, worker_type=None, worker_num=None, worker_remote=None, stride=None, buffer=None, ordered=True, skip=False, name=None)

Parallel (thread- or process-based, local or remote), buffered, multi-task, itertools.imap or Pool.imap function replacment. Like imap it evaluates a function on elements of a sequence or iterator, and it does so lazily with an adjustable buffer. This is accomplished via the stride and buffer arguments. All sequences or iterators are required to be of the same lenght. The tasks can be interdependent, the result from one task being the input to a second taks if the tasks are added to the IMap in the right order.

A completely lazy evaluation, i.e. submitting the first tasklet after the next method for the first task has been called, is not supported.

Arguments:

  • func, iterable, args, kwargs [default: None]

    If the IMap with those arguments they are treated to define the first and only task of the IMap, it starts the IMap pool and returns an IMap iterator. For a description of the args, kwargs and iterable input please see the add_task method. Either both or none func and se have to be specified. Positional and keyworded arguments are optional.

  • worker_type(‘process’ or ‘thread’) [default: ‘process’]

    Defines the type of internally spawned pool workers. For multiprocessing.Process based worker choose ‘process’ for threading.Thread workers choose ‘thread’.

    Note

    This choice has fundamental impact on the performance of the function. Please understand the difference between processes and threads and refer to the manual documentation. As a general rule use ‘process’ if you have multiple CPUs or CPU-cores and your task functions are cpu-bound. Use ‘thread’ if your function is IO-bound e.g. retrieves data from the Web.

    If you specify any remote workers via worker_remote, worker_type has to be the default ‘process’. This limitation might go away in future versions.

  • worker_num(int) [default: number of CPUs, min: 1]

    The number of workers to spawn locally. Defaults to the number of availble CPUs, which is a reasonable choice for process-based IMaps. For thread-based IMaps a larger number of might improve performance. This number does not include workers needed to run remote processes and can be =0 for a purely remote IMap.

    Note

    Increasing the number of workers above the number of CPUs makes sense only if these are Thread-based workers and the evaluated functions are IO-bound. Some CPU-bound tasks might evaluate faster if the number of worker processes equals the number of CPUs + 1.

  • worker_remote(sequence or None) [default: None]

    A sequence of remote host identifiers, and remote worker number pairs. Specifies the number of workers per RPyC host in the form of (“host:port”, worker_num). For example:

    [['localhost', 2], ['127.0.0.1', 2]]
    

    with a custom TCP port:

    [['localhost:6666'], ['remotehost:8888', 4]]
    
  • stride(int) [default: worker_num + sum of remote worker_num, min: 1]

    Defines the number of tasklets which are submitted to the process thread, consecutively from a single task. See the documentation for add_task and the manual for an explanation of a task and tasklet. The stride defines the laziness of the pipeline. A long stride improves parallelism, but increases memory consumption. It should not be smaller than the size of the pool, because of idle processes or threads.

  • buffer(int) [default: stride * (tasks * spawn), min: variable]

    The buffer argument limits the maximum memory consumption, by limiting the number of tasklets submitted to the pool. This number is larger then stride because a task might depend on results from multiple tasks. The minimum buffer is the maximum possible number of queued results. This number depends on the interdependencies between tasks, the produce/spawn/consume numbers and the stride. The default is conservative and equals will always be enough regardless of the task interdependencies. Please consult the manual before adjusting this setting.

    Note

    A tasklet is considered submitted until the result is returned by the next method. The minimum stride is 1 this means that starting the IMap will cause one tasklet (the first from the first task) to be submitted to the pool input queue. The first tasklet from the second task can enter the queue only if either the result from the first tasklet is returned or the buffer size is larger then the stride size. If the buffer is n then n tasklets can enter the pool. A Stride of n requires n tasklets to enter the pool, therefore buffer can never be smaller then stride. If the tasks are chained i.e. the output from one is consumed by another then the first at most one tasklet i-th tasklet from each chained task is at a given moment in the pool. In those cases the minimum buffer to satisfy the worst case number of queued results is lower then the safe default.

  • ordered(bool) [default: True]

    If True the output of all tasks will be ordered. This means that for specific task the result from the n-th tasklet will be return before the result from the n+1-th tasklet. If false the results will be returned in the order they are calculated.

  • skip(bool) [default: False]

    Should we skip a result if trying to retrieve it raised a TimeoutError? If ordered =True and skip =True then the calling the next method after a TimeoutError will skip the result, which did not arrive on time and try to get the next. If skip =False it will try to get the same result once more. If the results are not ordered then the next result calculated will be skipped. If tasks are chained a TimeoutError will collapse the IMap evaluation. Do not use specify timeouts (this argument becomes irrelevant).

  • name(string) [default: ‘imap_id(object)’]

    An optional name to associate with this IMap instance. Should be unique. Useful for nicer code generation.

add_task(func, iterable, args=None, kwargs=None, timeout=None, block=True, track=False)

Adds a task to evaluate. A task is made of a function an iterable and optional arguments and keyworded arguments. The iterable can be the result iterator of a previously added task. A tasklet is a (func, iterable.next(), args, kwargs).

  • func(callable)

    Will be called with the elements of the iterable, args and kwargs.

  • iterable(iterable)

    The elements of the iterable will be the first arguments passed to the func.

  • args(tuple) [default =None]

    A tuple of optional constant arguments passed to the function after the argument from the iterable.

  • kwargs(dict) [default =None]

    A dictionary of constant keyworded arguments passed to the function after the variable argument from the iterable and the constant arguments in the args tuple.

  • track(bool) [default =False]

    If true the results (or exceptions) of task are saved withing self._tasks_tracked[%task_id%] as a {index:result} dictionary. This is only useful if the task function involves creation of persistant data. The dictionary can be used to restore the correct order of the data.

Note

The order in which tasks are added to the IMap instance is important. It affects the order in which tasks are submited to the pool and consequently the order in which results should be retrieved. If the tasks are chained then the order should be a valid topological sort (reverse topological oreder).

get_task(task=0, timeout=None, block=True)

Returns an iterator which results are bound to one task. The default iterator the one which e.g. will be used by default in for loops is the iterator for the first task (task =0). Compare:

for result_from_task_0 in imap_instance:
    pass

with:

for result_from_task_1 in imap_instance.get_task(task_id =1):
    pass

a typical use case is:

task_0_iterator = imap_instance.get_task(task_id =0)
task_1_iterator = imap_instance.get_task(task_id =1)

for (task_1_res, task_0_res) in izip(task_0_iterator, task_1_iterator):
    pass
next(timeout=None, task=0, block=True)

Returns the next result for the given task (default 0).

Note

the next result for task n can regardless of the buffersize be evaluated only if the result for task n-1 has left the buffer.

If multiple tasks are evaluated then those tasks share not only the process or thread pool but also the buffer, the minimal buffer size is one and therefore the results from the buffer have to be removed in the same order as tasks are submitted to the pool. The tasks are submited in a topological order which allows to chain them.

Warning

If multiple chained tasks are evaluated then the next method of only the last should be called directly. Otherwise the pool might dead-lock depending on the buffer-size. This is a consequenceof the fact that tasks are submitted to the pool in a next-needed order. Calling the next method of an up-stream task changes this topological evaluation order.

pop_task(number)

Removes a previously added task from the IMap instance.

Arguments

  • number(int or True)

    A positive integer specifying the number of tasks to pop. If number is set True all tasks will be popped.

start(stages=(1, 2))

Starts the processes or threads in the pool (stage 1) and the threads which manage the pools input and output queues respectively (stage 2).

Arguments:

  • stages(tuple) [default: (1,2)]

    Specifies which stages of the start process to execute. After the first stage the pool worker processes/threads are started and the IMap._started event is set True. A call to the next method of the IMap instance will block. After the second stage the IMap._pool_putter and IMap._pool_getter threads will be started.

stop(ends=None, forced=False)

Stops an IMap instance. If the list of end taks is specified via the ends argument it blocks the calling thread, retrieves (and discards) a maximum of 2 * stride of results, stops the worker pool threads or processes and stops the threads which manage the input and output queues of the pool respectively. If the ends argument is not specified, but the forced argument is the method does not block and the IMap._stop_managers has to be called after all pending results have been retrieved. Either ends or forced has to be specified.

Arguments:

  • ends(list) [default: None]

    A list of task ids which are not consumed within the IMap instance. All buffered results will be lost and up to 2 * stride of inputs consumed. If no list is given the end tasks will need to be manually consumed or the threads/processes might not terminate and the Python interpreter will not exit cleanly.

  • forced(bool) [default: False]

    If ends is not None this argument is ignored. If ends is None and forced is True the IMap instance will trigger stopping mode.

class IMap.IMap.IMapTask(iterator, task, timeout, block)

The IMapTask is an object-wrapper of IMap instaces. It is an iterator, which returns results only for a single task. It’s next method does not take any arguments. IMap.next arguments are defined during initialization.

Arguments:

  • iterator(IMap instance)

    IMap instance to wrap, usually initialization is done by the IMap.get_task method of the corresponding IMap instance.

  • task(integer)

    Id of the task from the IMap instance.

  • timeout

    see documentation for: IMap.next.

  • block

    see documentation for: IMap.next.

next()
Returns a result if availble within timeout else raises a TimeoutError. See documentation for IMap.next.
class IMap.IMap.PriorityQueue(maxsize=0)
A priority queue using a heap on a list. This Queue is thread but not process safe.
class IMap.IMap.Weave(iterators, repeats=1)

Weaves a sequence of iterators, which can be stopped if the same number of results has been consumed from all iterators.

Arguments:

  • iterators(sequence of iterators)

    A sequence of objects supporting the iterator protocol.

  • repeats(int) [default: 1]

    A positive integer defining the number of results return from an iterator in a stride i.e. before a result from the next iterator is returned.

next()
Returns the next element.
stop()
If called the Weave will stop at repeats boundaries.
IMap.IMap.imports(modules, forgive=False)

Should be used as a decorator to attach import statments to function definitions. These imports are added to the global (in Python module level) namespace of the decorated function.

Two forms of import statements are supported (in the following examples foo, bar, oof, and ``rab are modules not classes or functions):

import foo, bar              # e.q. to @imports(['foo', 'bar'])
import foo.oof as oof            
import bar.rab as rab        # e.g. to @imports(['foo.oof', 'bar.rab'])

Supports alternatives:

try:
    import foo
except ImportError:
    import bar

becomes:

@imports(['foo,bar'])

and:

try:
    import foo.oof as oof
except ImportError:
    import bar.rab as oof

becomes:

@imports(['foo.oof,bar.rab'])

Note

This import is available in the body of the function as oof

Note

imports should be exhaustive for every decorated funcion even if two function have the same globals.

Arguments:

  • modules(sequence)

    A list of modules in the following forms:

    ['foo', 'bar', ..., 'baz']
    

    or:

    ['foo.oof', 'bar.rab', ..., 'baz.zab']
    
  • forgive(bool) [default: False]

    If True will not raise exception on ImportError.

IMap.IMap.inject_func(func, conn)
Injects a function object into a rpyc connection object.
IMap.IMap.worker(inqueue, outqueue, host=None)
Function which is executed by processes/threads within the pool. It waits for tasks (function, data, arguments) at the input queue evaluates the result and passes it to the output queue.

# EOF