Implementation Details¶
The section of the documentation is an overview of the internal design and implementation of MDF. It should not be necessary to read or understand this section in order to use MDF.
Building MDF¶
Use of Cython¶
Cython is used extensively throughout MDF. It is used in such a way, however, that it’s still possible to call MDF code without compiling it to allow easier debugging of the MDF internals.
All of the implementation code is in plain .py
files with additional type information
given using cython.declare()
for local variables and .pxd
files for
class and module attribute type information. The modules are compiled into separate
extensions, and type information is shared between the compilation units via the
.pxd
files.
There is significant overhead calling a Python function compared with calling a C
function. Even when Cythoned methods are Cythoned using the cpdef
keyword so
they can be called both from Python or from C there is still additional overhead
when the method is called compared with calling a plain C function. This is why
the Cythoned classed have some cdef
private methods with corresponding cpdef
public methods. Internally the cdef
methods should always be called unless
it’s expected that the method may be overridden by a Python subclass.
Whenever making changes to MDF the code must be profiled before and after as what look like small changes can drastically change the performance due to the number of times some of the internal functions and methods get called.
Compiling MDF¶
Building MDF requires Cython version 0.16 or later to be installed.
As with any Cythoned package it also requires a distutils compatible C compiler (e.g. Visual Studio 2008 for Windows or GCC for linux).
MDF can be compiled in-place to allow you to test changes and use your local check out of MDF in the usual Python way using setup.py:
python setup.py build_ext --inplace
Or to build the egg:
python setup.py bdist_egg
Current setuptools and Cython (0.16) doesn’t pick up changes to .pxd
files
correctly when determining dependencies, and so changes to .pxd
files won’t
trigger a recompilation of all affected object files. If this happens the
easiest work around is to simply delete the generated .c
files that will
be in the same folder as their corresponding .py
files.
Running/Debugging Non-Cythoned MDF¶
To be able to import the MDF modules without first compiling you need to have Cython installed. Cython is not required to import the modules after compilation.
There are various places where it has not been possible to have the pure-python version of the code identical to the Cythonized code. These cases are clearly commented, and in order to use MDF without compiling it you need to first un-comment these bits of code.
Follow the instructions in the comments regarding Cython in the following files:
- nodes.py
- context.py
- cqueue.py
- ctx_pickle.py
Although you will be able to use MDF in its uncompiled state for debugging its internals you will find it will be orders of magnitude slower than the compiled version.
Debugging the Compiled Cython Code¶
In setup.py there is a module variable cdebug
. Set this to True
to enable the
compiler and linker flags to build debug versions of the extensions. This only
sets the Visual Studio debug flags, and for GCC you should change them to whatever
flags you required (usually -g
is sufficient).
Once you have rebuilt the extensions with the debug flags set you can now attach a debugger to a running python instance in order to debug the cythoned MDF functions.
Profiling MDF¶
MDF includes its own counters and timers for profiling code run using MDF. This
should be used to identify hotspots in user code. See MDFContext.ppstats()
.
If necessary further profiling can be done using cPython. A visual profiler such as RunSnakeRun or kCacheGrind can be useful for understanding the cProfiler output. kCacheGrind provides more detail than RunSnakeRun but requires the cProfile output to be converted to a calltree using the pyprof2calltree pacakge.
Profiling MDF Internals¶
cProfile can also be used to profile the MDF internal Cythoned functions and methods.
In setup.py there is a module variable cython_profile
. Set this to True
to
enable profiling of Cythoned code.
Once you have rebuilt the extensions with the profile flag set you can now use cProfile to profile an application using MDF with the time spent in Cythoned functions included.
The addition of the profiling code to every Cythoned function adds significant overhead as many of the functions have been optimised to be very lightweight and may be inlined by the compiler. Adding the profiling code bloats these functions and distorts the running time significantly and it can therefore be very hard to determine accurate timings and hotspots using this method.
It is better to use a non-invasive statistical profiler such as Intel VTune or
Sleepy (see ‘Very Sleepy’ for Windows). To profile MDF with one of these
profilers you will need to build MDF with debug symbols using the
cdebug
setting in setup.py. For more accurate profiling you may want to
build with compiler optimisations as well as debug symbols, in which case remove
/Od
from the compiler flags.
Profiling with a statistical profiler requires a little more knowledge and intuition about how the Python code is translated to C by Cython. The Cython code is annotated with the original Python code which makes this much easier, and browsing the code around the functions of interest before looking at the profiling results will make understanding the results of the profiling simpler. This is by far the best way to get a proper feel for where time is being spent inside MDF though and it is worth persevering.
Source Code Overview¶
context.py¶
context.py contains (almost) everything to do with MDFContext
. It also includes
a class MDFNodeBase
from while MDFNode
is derived. It’s done this
way with the node base class in context.py so that the context code can call C methods
on the cythoned MDFNode objects without having to cimport nodes.pxd
as that would
result in a circular dependency.
context.py defines the following classes:
MDFContext
MDFNodeBase
ShiftSet
and the public API functions:
shift()
get_nodes()
make_shift_set()
nodes.py¶
nodes.py defines the following classes:
MDFNode
MDFVarNode
MDFEvalNode
MDFTimeNode
MDFIterator
MDFIteratorFactory
MDFIteratorFactory
MDFCallable
and the public API functions:
varnode()
vargroup()
evalnode()
now()
These classes are almost always only used internally. The API functions return instances of the node classes and so it’s almost never necessary to refer to any of these classes outside of MDF.
NodeState
is the per context state associated with a particular node.
This isn’t exposed outside of the Cythoned code and for external use. If it’s referenced
at all should be considered an opaque type (hence not appearing in the API reference).
cqueue.py¶
cqueue is an implementation of a double ended queue, like collections.deque
.
Although it’s a queue it’s specialised to represent a stack efficiently. Underlying it
is a normal python list, and as items are pushed on it grows as it runs out of space.
It keeps an index to the start and end of the queue and so popping items off either
end simply means moving these indexes.
Because the items aren’t reshuffled popping items off the left and adding to the right would result in more and more memory being allocated, but constantly pushing and popping the right is much faster than a deque, which is what this container is used for.
This could probably be improved by writing it in plain C rather than using a Python
list, but at the time of writing it was sufficiently faster than collections.deque
for what it’s being used for that it wasn’t optimised further.
nodetypes.py¶
nodetypes.py is where the nodetype()
decorator is defined, and also all the
various classes that are necessary for implementing more general node types derived
from MDFEvalNode
:
MDFCustomNode
MDFCustomNodeIterator
MDFCustomNodeIteratorFactory
MDFCustomNodeMethod
MDFCustomNodeDecorator
MDFCustomNode
is derived from MDFEvalNode
and is constructed
with an additional function, iterator or generator that converts the result of the evalnode
to whatever the specific nodetype should return.
MDFCustomNodeDecorator
is what’s returned by the nodetype()
decorator,
which is itself a decorator. It converts whatever function (or generator or iterator class)
it decorates into a node type decorator.
All the built-in node type decorators use the nodetype()
and subclasses of
MDFIterator
to achieve the various different calculations.
MDFCustomNodeMethod
is what’s used to add the methods to all MDFEvalNode
instances (see nodetype_method_syntax). It’s a callable object that when called
creates or fetches a previously created node. The returned node is the node it was called on
wrapped with a node type.
runner.py¶
runner.py is where the various functions for running an MDF graph over time and extracting values, including parallel computation of scenarios.
to_dot.py¶
to_dot.py implements the MDFContext.to_dot()
method. This uses pydot
and Graphviz
to render the graph as an image in a variety of different formats
( .dot, .svg, .png, e.t.c.).
parser.py¶
Parsing code for use by the magic ipython functions and also for parsing Python code to find the left hand side of node assignments for defaulting node names, e.g.:
a_var_node = varnode()
The parser looks at the callstack and parses the line of source code to get "a_var_node"
to use as the name for this node.
ctx_pickle.py¶
All the pickling code for contexts and nodes is separated out from the main files, and is
implemented as functions in this file. These functions are imported from node.py and
context.py and called from the various __reduce__
methods on the associated classes.
builders sub-package¶
The builders sub-package is where all the provided callable objects intended to be used
with run()
are.
io sub-package¶
The methods MDFContext.save()
and MDFContext.load()
are implemented
in this sub-package. The serialisation is done via pickling in ctx_pickle.py, but this
sub-package can also read and write compressed files.
pylab sub-package¶
The pylab sub-package is where all the various magic ipython function are. This should only be imported for interactive use not from scripts as it depends on IPython.
regression sub-package¶
Regression testing is done by evaluating nodes in two different processes started from different virtualenvs. The code in this package manages starting the processes in the correct virtualenvs and collecting the values of the nodes over time in both child processes.
The data collection is done by a specialized builder, DataFrameDiffer
,
but other differ could be written by subclassing Differ
.
The interprocess communication is done using Pyro and the proxy objects from the remote sub-package.
remote sub-package¶
remote contains code shared between the various parallel processing components of MDF for creating subprocesses and the Pryo server and objects used for interprocess communication.
It also contains custom Pyro serialisation functions that autmatically compress data larger than a certain size on the fly using bz2.
tests sub-package¶
Unit tests.
Node Evaluation¶
Nodes only have values in a context, so it makes sense that to get a nodes value it’s
evaluated by starting with the context. Indexing into the context with a node (see
MDFContext.__getitem__()
) calls MDFContext.get_value()
, which is
a public API method. This calls an internal C method MDFContext._get_node_value()
which is where the work is actually done.
Once inside a node evaluation to avoid passing the current context around to every node
call to allow the nodes to evaluate other nodes, there’s the notion of a currently
active context. The currently active context is stored in a dictionary that maps thread
id to the MDFContext (_current_contexts in context.py). See also _get_current_context()
.
Dependencies¶
Although it’s common to talk about one node being dependent on another, actually the dependencies are between a node in a context and another node in a context. The contexts need not be the same as one node can call another node in another context (when shifting, for example).
Dependency tracking is facilitated by MDFContext._get_node_value()
keeping track of
the current node being evaluated (and the context it’s being evaluated in). These are kept
in a stack (actually it’s a queue - but conceptually it’s a stack) so to find the node
that’s calling the current node being evaluated it just needs to look at the last item
in the stack. Before dropping into the node evaluation itself the current node and context
are pushed onto the stack.
All dependencies are discovered at runtime by observing which nodes call other nodes. This can either be directly, or a node may call a function that then calls other nodes. Before anything is evaluated MDF knows nothing about the dependencies between nodes.
Each context has its own node evaluation stack and so to find the node and context calling the current node and context the stack belonging to the previous context is examined. A restriction is that the same context can’t be used from different threads concurrently, but different contexts can be used in different threads because the current context is effectively a per thread variable (although it’s in a dict rather than TLS).
Once discovering what the calling node and context is (if any), the current node is pushed
onto the current context’s node evaluation stack and MDFNodeBase.get_value()
is called
to retrieve or calculate the node value in the context. After the node has been evaluated
the node is popped off the evaluation stack and a dependency is established between the previous
node and context and the current node and context by calling MDFNodeBase._add_dependency()
.
As the dependencies are context dependent the relationships are stored on the NodeState
object associated with the node and context pair.
MDFNodeBase._add_dependency()
is written such that re-calling it for the same node and
context is fast, and so it’s called everytime MDFContext._get_node_value()
is called
regardless of whether the dependency has been discovered previously or not.
Node evaluation¶
As mentioned above the entry point for evaluating a node is MDFContext.__getitem__()
or
MDFContext.get_value()
, which in turn calls the internal C method
MDFContext._get_node_value()
. This ultimately calls MDFNodeBase.get_value()
,
which each derived node class implements.
Varnodes¶
varnodes are the simplest type of node. Getting their value just involves looking to see if the
NodeState
has a value for the current context and return that. If there is no
value in the NodeState
the default value for the node is returned if there is one,
or an exception is raised.
Evalnodes¶
Eval nodes wrap a function, generator or MDFIterator
. To get their value the wrapped
function is called and the result is cached on the NodeState
for the context the node
is being evaluated in.
Once a node has been evaluated once it is marked as not needing to be re-evaluated. If the node is evaluated again the previous result is returned as long as none of the dependencies of the node have changed.
Dirty flags¶
Whether the node needs evaluating or not is determined by the dirty_flags
on the NodeState
object for the node in each context. When the node is evaluated these flags are cleared to indicate
the node isn’t dirty and the previously cached value can be used. When any node is changed the
MDFNode.set_dirty()
method is called which marks the node as dirty and then marks all the nodes
calling this node as dirty if they are not already flagged dirty. The dirty flags propagate all the
way through the graph for all nodes and context pairs that depend on the node in the context being
dirtied.
Generators and iterators¶
If an evalnode wraps a generator or iterator then it is advanced each time the context’s date
(now()
) is advanced.
The dirty flags mentioned previously are a bit field. One of these bits is reserverd for
updates to time (now()
). When the time changes all nodes that are dependent on
now()
have the DIRTY_FLAGS.TIME
bit in their dirty flags set. In addition, any generators or
iterators not dependent on now()
also have the DIRTY_FLAGS.TIME
bit in their dirty flags set
(as well as any dependent nodes, as explained in the previous section).
When the evalnode is evaluated and only the DIRTY_FLAGS.TIME
bit is set then, if the node is a
generator or iterator, instead of completely re-evaluating the node the previously instantiated
iterator is advanced. The iterator is stored on the NodeState
for the node and context.
Because generators and iterators need to be advanced on every timestep, regardless of
whether their value is actually used on any particular timestep, MDFContext.set_data()
evaluates all of them after marking them as dirty (just with the DIRTY_FLAGS.TIME
bit). Other
nodes could get the value of an iterator node once and then not look at the value for a
number of timesteps; this would cause problems as the iterator wouldn’t have been steped through
the intermediate timesteps and so would have the wrong value, and so evaluating them in the
MDFContext.set_date()
prevents that problem from occurring.
Shifted contexts¶
Shifted contexts are created by shifting any other context (i.e. shifted or non-shifted) via
the MDFContext.shift()
method.
Shift operations are commutative and associative, i.e:
# commutative
ctx.shift({a: x}).shift({b: y}) == ctx.shift({b: y}).shift({a: x})
# associative
ctx.shift({a: x, b: y}).shift({c: z}) == ctx.shift({a: x}).shift({b: y, c: z})
Shifted contexts are all stored in a flat structure on the root context. Shifted contexts
have a parent context, but that parent is always the root context. This flat structure
is what facilitates the commutative and associative properties of the shift operation.
There is a method MDFContext.is_shift_of()
to determine if one context is a shift
of another. This is determined by looking at the intersection of the two contexts’ shift sets
rather than having an explicit hierachy of contexts.
Each shifted context is keyed by its shift_set
, which is the dictionary of nodes to their
values in the shifted context. Any shift resulting in the same net shift set returns the
same shifted context.
Node evaluation in shifted contexts¶
When evaluating nodes in a shifted context the naive approach would be to just evaluate that node and all its dependencies in that shifted context. This however would cause problems where varnodes are set in the root context (or other shifted contexts that the context the node is being evaluated in is itself a shift of), because the value wouldn’t be available in the shifted context. It would also be inefficient as nodes would potentially be evaluated multiple times when they are needed in different contexts, even if the value in each case would be the same because the node doesn’t depend on some or all of the shifted nodes.
For these reasons there is the concept of the alt context. The alt context is a property of a node and a context, and is the least shifted context the node can be evaluated in that will have the same result as if it were evaluted in the orignal context.
For example:
a = varnode()
b = varnode()
@evalnode
def foo():
return a()
ctx = MDFContext()
shifted_a = ctx.shift({a : 1})
shifted_b = shifted_a.shift({b : 2})
shifted_b[foo] == shifted_a[foo]
Here evaluating foo
in a context where b
is shifted has no effect, but shifting a
does because foo
depends on a
. Therefore the alt context for shifted_b[foo]
is shifted_a
.
Even if the context shifted_a
hadn’t been explicitly created as above the alt context
would still be that context, i.e. a context with shift set {a : 1}
.
Determining the alt context¶
For varnodes getting the alt context from a shifted context is simple. If the shift set of the shifted context includes the varnode then the alt context is the parent (root) context shifted by the varnode and the shift value. All other shifts are irrelevant for the varnode and so are ignored when getting the alt context.
Evalnodes are a bit tricker as they have to be evaluated once before the dependencies
are known. The first time round they are evaluated in the context they’re called in.
Before setting the value of the node in the NodeState
however the
dependencies are analysed and the least shifted context (alt context) is determined
by checking the dependencies of all the called nodes and the shift set of the
original context. All the dependencies and state of the node in the original context
is transfered to the newly discovered alt context.
Once the alt context for a node and context has been determined it’s cached in the
NodeState
. If a dependency changes then that cached value is cleared
and it will be re-determined the next time it’s needed. If a node has conditional
dependencies (i.e. new dependencies are discovered after the initial evaluation)
that can cause the alt context to change, and this causes an an exception to be raised.
Allowing the alt context to change part way through an evaluation would be danergous as
there could be accumulated state in the original alt context that wouldn’t be consistent
in the new alt context.
Node Types¶
Other node types (e.g. queuenode()
, ffillnode()
) are built on top of
MDFEvalNode
. The basic concept is that a nodetype
does the job of a normal
evalnode but then calls a second function on the result of that to transform it in some way.
The second function that does that transformation is referred to as the node type function
and the inner function that gets evaluated to provide the input to the node type function
is the node function
.
Here’s an exampple of a very simple custom node type that will help illustrate how node types work:
@nodetype
def node_type_function(value):
return value * 2
@node_type_function
def node_function():
return x
When evaluating the node node_function
the order of execution is to first call the
python function node_function
and then to pass the result of that to the python
function node_type_function
. The result of node_type_function
becomes the value
of the node node_function
.
To understand how this works it helps to talk about what types are used and what the result of these decorators is.
nodetype
returns an instance of a MDFCustomNodeDecorator
. This instance
keeps a reference to the decorated function. It’s a callable object and its call
method works as a decorator.
So, the decorated node_type_function
is an instance of MDFCustomNodeDecorator
.
When used as a decorator as @node_type_function
on another function it
returns an instance of MDFCustomNode
. This MDFCustomNode
is a
subclass of MDFEvalNode
as is instantiated with the node function and keeps
a reference to the node type function
.
MDFCustomNode
differs from MDFEvalNode
by calling its
node type function after doing the normal evaluation of the node function. The result
of this is what gets returned as the final value of the node.
Things get a little more complicated when adding arguments to the node type, for example:
@nodetype
def node_type_function_2(value, multipler):
return value * multiplier
@node_type_function_2(multiplier=2)
def node_function_2():
return x
To pass the arguments from the node instantiation (when @node_type_function_2 is
called with node_function_2) to the node type function node_type_function_2
the args have to be stored by the MDFCustomNode
instance. This
is exactly what happens, and when the custom node is evaluated it calls the
node type function with these stored arguments.
If any of the arguments are nodes they automatically get evaluated before
being passed to the node type function. Occasionally it is necessary to
pass nodes in as arguments. MDFCustomNode
checks a class property
node_kwargs
and doesn’t automatically evaluate any args in that list.
By subclassing MDFCustomNode
this can be set for specific node types.
Generators and iterators¶
Node type functions may also be a generator or iterator (MDFIterator
).
If they are then the first time the node type function is called it will be
called with the node function results and all the arguments from when the node of
that type was created (e.g. muliplier
in the example from the previous section).
next()
is then called to get the initial value. Subsequent values are obtained
by advancing the iterator sending in the new node function result by the send()
method of the generator or MDFIterator
instance.
For this to work correctly the node function that is used to initialize the
base class MDFEvalNode
depends on whether the node type function
or the node function is a generator or not. If either of them are then
the function called by the underlying MDFEvalNode code must return an iterator
that will work in the same way as if the final node was a generator. This is
done using another class, MDFIteratorFactory
. This is another callable
that when called returns a MDFIterator
, which the underlying
MDFEvalNode
code understands and treats like a generator.
Method syntax for node types¶
When a node type is registered using the nodetype()
decorator a method
name can be specified. This automatically adds two new methods to MDFNode
(and any existing instances) - one that returns a node of the node type and one that
returns the value of a node of that node type.
Both methods work in exactly the same way. A new MDFCustomNode
instance
is constructed with the node type function and using the node the method is called
on as the node function. If the same method is called again with the same arguments
on the same node then it returns the node constructed previously.
It works by creating the new methods when the nodetype()
decorator is called.
The methods are actually instances of MDFCustomNodeMethod
which is yet
another callable class. Calling that creates the new node or fetches it if it was
created already.
The new instances of MDFCustomNodeMethod
are added to
MDFNode._additional_attrs_
, which is checked in MDFNode.__getattr__
allowing for new attributes to be added dynamically to the cythoned class.
Class nodes¶
Nodes can be declared as properties of classes as well as modules. These nodes may take a single argument, which will be the class the node is being called on. If they take no arguments they behave the same way as a normal node.
Class nodes have to be aware of the class they’re defined on and the class they’re bound to (accessed from). Consider the following:
class A(object):
@evalnode
def foo(cls):
return "Declared in A, called on %s" % cls.__name__
class B(A):
@evalnode
def foo(cls):
return "Overridden in B (%s)" % super(B, cls).foo()
In one sense this code just declares two nodes, A.foo
and B.foo
. That’s
a slight simplication of what’s actually going on though; if there were only two then
super(B, cls).foo() would have to evaluate A.foo()
which would return
Declared in A, called on A
. So, there are actually threee [1] nodes, A:A.foo
,
B:B.foo()
and B:A.foo()
.
What the code above declares are unbound nodes. That is, nodes that have no
knowledge of the classes they belong to. When they are accessed from the class
(i.e. A.foo
accesses foo from A) they are then bound to the class at that
point. Binding creates a new node that includes everything from the original
unbound node definition and information about the class the new node is bound
to.
MDFEvalNode
is a descriptor and so the process of binding nodes to a class
is done by MDFEvalNode.__get__()
. This is called whenever a node is accessed
as a property of a class. All evalnodes have a dictionary of classes to bound versions
of themselves. When accessing a node multiple times on the same class the same bound
node is returned each time. If no bound node exists for the unbound node and class then
a new node is created.
To bind the node to a class the function that it references must also be bound
(to create a staticmethod or classmethod) and the new node is created using that
bound method. As keyword arguments to the node may also reference unbound functions
or nodes those too need to be bound. MDFEvalNode._bind()
handles binding
any additional functions or nodes and may be overridden by any subclasses requiring
additional objects to also be bound. The helper method
MDFEvalNode._bind_function
is used to create bound versions of individual
functions, methods and other callable object types.
[1] | More accurately there are three bound nodes and two unbound nodes, but the unbound nodes aren’t accessible outside of the class definition. |