leonardo (leonardo_m) wrote,

Purity in D language

The topic of this article is the idea of "purity" as present in the D V.2 language. It will not cover the whole topic, because there are many details, but I will try to touch the most important things.

C has a primitive type system, and D is a C-family language, but now the D type system is not so primitive any more (Scala type system is more complex).

While D is mostly an imperative object oriented and template-heavy (generic) programming language, the introduction of purity, immutability and closures in D2 has given it the basic requirements for a functional language.

A quotation from Wikipedia helps me cover the basics:
http://en.wikipedia.org/wiki/Pure_function
In computer programming, a function may be described as pure if both these statements about the function hold:

1. The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information or state that may change as program execution proceeds or between different executions of the program, nor can it depend on any external input from I/O devices.
2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices.

The result value need not depend on all (or any) of the argument values. However, it must depend on nothing other than the argument values.

In D you may tag a function or a member function as pure using the "pure" attribute, an example:
pure int sqr(int x) {
    return x * x;
}
D has added pure functions because:
- Pure functions are simpler to understand and analyse for both the programmer and the compiler (or static analysis tools), because they are more self-contained, they can't read and write the global or outer state.
- Pure functions are generally less bug-prone because you know where their data comes from, this puts a limit to the wild practice of C code where you have to read a whole function to know what part of the program state it is influencing. Some practice of D programming has shown me that now most functions may be marked as pure (weakly pure, see below for the meaning of that). According to D design principles, D functions are not pure on default to keep backwards compatibility with C code.
- Pure functions may be run in any order, so it's more easy to parallelize their execution. A map() done on an array may be split into parts with no problems if the mapping function is known to be pure.
- Pure functions allow cacheability, like a memoization. If the pure function input doesn't change, its output is the same, and it doesn't change the global state. Impure functions can't be memoized safely (in Python it's not too much hard to define a memoize decorator, but Python gives no guarantees that the function is actually pure, and this may lead to problems and bugs. D language is designed to give some static guarantees, and it enforces the purity of its pure functions).
- Pure functions allow several kinds of optimizations, for example the current DMD compiler optimizes foo(x)+foo(x) with 2*foo(x) if x is an integral value (not floating point, to avoid approximations) and foo() is pure. Another possible optimization is loop hoisting (loop-invariant code motion http://en.wikipedia.org/wiki/Loop-invariant_code_motion ), this means pulling a pure function call out of a loop if the function argument is constant. Other more complex optimizations are possible, for example Haskell language (that uses pure functions) shows optimizations like filter(map()) ==> map(filter()) that may eventually be done by D compilers too.
- For the compiler it's easy to see if each object or memory zone allocated by a pure function needs to be kept when the function ends. Most of them may be deallocated safely at the function end. So the 'pure' annotation is a help for the garbage collector too, helping both performance and to reduce the total amount of memory used (this is an effect of reducing the amount of garbage produced).

But D is a practical programming language, it's a system language that among other things supports raw pointers and inline assembly too, so the abstract idea of pure functions was adapted to something more practical.

The D docs explain the basic differences:
Pure functions are functions that produce the same result for the same arguments. To that end, a pure function:
- Does not read or write any global mutable state;
- Cannot call functions that are not pure;
- Can override an impure function, but an impure function cannot override a pure one;
- Is covariant with an impure function;
- Cannot perform I/O.

As a concession to practicality, a pure function can:
- Allocate memory via a NewExpression;
- Terminate the program;
- Read and write the floating point exception flags;
- Read and write the floating point mode flags, as long as those flags are restored to their initial state upon function entry.

Those concessions have some interesting consequences. A pure D function needs just to be a pure island if seen from outside, that calls other pure functions, but inside it may perform impure computations and it may even contain impure nested functions:
pure int foo(int x) {
    int bar(int y) { // impure, uses outer variable x
        return x + y;
    }
    return x * bar(x);
}
A D pure function may allocate too:
import std.stdio: writeln;

pure int[] foo(const int n) {
    return new int[n];
}

void main() {
    enum int n = 10;
    int[] a1 = foo(n);
    int[] a2 = foo(n);
    writeln(a1.ptr);
    writeln(a2.ptr);
}
The ptr field of a1 and a2 are free to be be equal or different according to the amount of optimizations done by a D compiler. It's a side effect not covered by the 'pure' attribute. Being foo() pure, and being 'n' a constant (here a compile time constant), I think a1 and a2 are free to be the same array.

In D a pure function may also throw exceptions. If you want to avoid this possibility there is the attribute "nothrow" too:
pure nothrow int sqr(int x) {
    return x * x;
}
You may write pure functions in C too, if you have enough self-discipline (or a static analyser program that guards code for you) but generally the C compiler will not take advantage of purity (unless its analysis is extensive and it is able to tell pure functions from impure ones by itself).

Yet GNU-C (the C supported by GCC) supports a pure attribute:
http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#index-g_t_0040code_007bpure_007d-function-attribute-2461

A quote from the GCC docs:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure. For example,

int square (int) __attribute__ ((pure));

says that the hypothetical function square is safe to call fewer times than the program says.

Some of common examples of pure functions are strlen or memcmp. Interesting non-pure functions are functions with infinite loops or those depending on volatile memory or other system resource, that may change between two consecutive calls (such as feof in a multithreading environment).

Recently thanks to a suggestion by Bruno Medeiros in D2 the concept of pure function has being modified and extended. Now there are two main kinds of pure functions, the "weakly pure" ones and the "strongly pure" ones.

The strongly pure functions are like the original pure functions of D. The weakly pure functions are like the strongly pure, but the function is allowed to modify its input values.

To tell them apart there are no specific attributes like @stronglypure and @weaklypure. You use the normal "pure" attribute, it marks a function as weakly pure. If the function arguments are all const/immutable then the function becomes strongly pure (internally the compiler tags the functions with their precise kind of purity).

So if you want cacheability you must be careful to tag all arguments of a function as const or immutable. Strongly pure functions are memoizable (with an annotation, a memoizing function, or even by the compiler, but this is not expected in D soon), while weakly pure ones generally aren't. A possible syntax for memoization:
alias memoize!sin msin;
...
double x = msin(angle);
A strongly pure function is allowed to call a weakly pure function because the weakly pure function may only change the arguments given by the strongly pure function, so the strongly pure function is still an isle of strong purity.

At first the introduction of weakly pure functions seems to reduce the average purity of a D program, because it relaxes a constraint of the purity concept. But in practice it has the opposite effect, because now many more functions in a D program may be marked as weakly pure, and this in turn allows more functions to be strongly pure (because a pure function can't call an impure function).

This code shows how a strongly pure function createArray() may be refactored out thanks to a weakly pure function rangeSet():
pure void rangeSet(int[] data) { // weakly pure
    foreach (i, ref x; data)
        x = i;
}

pure int[] createArray(const int n) { // strongly pure
    auto data = new int[n];
    rangeSet(data);
    return data;
}

void main() {
    assert(createArray(5) == [0, 1, 2, 3, 4]);
}
To allow strongly pure functions (and to allow the compiler to enforce their strong purity) the compiler needs a concept of immutability stronger than the "const" of C++ language. D solves this introducing both "const" and "immutable" attributes, both of them are transitive (this means that a immutable object must contain only other immutable objects, etc). "immutable" is for objects and variables that can't change once created. "const" allows for a locally immutable view of a object that may othrwise be mutable elsewhere. Without that transitive property the compiler often needs to perform too much flow analysis to enforce the strong purity.

At first sight the idea and implementation of 'pure' in D look simple, but in practice its implementation is complex and there are many important details and to keep in account. Here I present some of them.

--------------------

The "pure" tag is a blunt tool, it specifies that a function can't read or write mutable variables present in outer scopes (but it may read immutable values in outer scopes). But sometimes a function has to use the mutable variables of the outer scope, for performance or other reasons.

The SPARK language (a strict subset of the Ada language, plus some extra annotations, plus a strong static analyser) is designed to reduce the frequency of bugs, and to allow a more refined specification of the outer names used in a function. It uses a syntax to state all variables from outer scopes used by a procedure, and they have "in", "out" or "in out" attributes too ("in" means they are read-only, "out" means they are just written, and "in out" means they are both read and written, and the SPARK static analyser enforces it all):
procedure DeleteLogFile ( Index : LogFileIndexT)
––# global in out AuditSystemFault;
––#        in out LogFiles;
––#        in out LogFilesStatus;
––#        in out LogFileEntries;
––# derives AuditSystemFault,
––#         LogFiles         from *,
––#                               LogFiles,
––#                               Index &
––#         LogFilesStatus,
––#         LogFileEntries   from *,
––#                               Index;
is
   OK : Boolean;
   TheFile : File.T;
begin
   ...
end DeleteLogFile;
Specifying the names used from outer scopes allows to minimize the frequency of unwanted interactions between subsystems. This decreases significantly the total complexity of the system.

--------------------

Creating immutable data structures is a common source of problems in several languages. In Clojure this problem is partially solved using "transients". In D this problem has no good solutions yet, and it's faced with std.exception.assumeUnique() or casts. A recent proposal that may improve the situation is to allow implicit conversion and assign (with no casts) of the result of a pure function to a immutable value. An example (currently the DMD 2.051 compiler doesn't accept this code):
pure int[] foo(const int n) { // strongly pure
    return new int[n];
}

void main() {
    immutable(int[]) array = foo(10);
}
--------------------

With DMD 2.051 this program returns a compilation error, but pure functions are contravariant (a "subset") with impure functions, so this needs to be correct code:
double cube1(double n) {
    return n * n * n;
}

pure double cube2(double n) {
    return n * n * n;
}

void main() {
    double function(double)[] fs = [&cube1, &cube2];
}
--------------------

Recently Don Clugston in the thread "Should pure functions be prevented from reading changeable immutable static variables?" has shown another problem with D purity and the static initializers of modules:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=121129

The problem may be seen with a simple program:
immutable int unstable;

pure int buggy() {
    return unstable;
}

static this() {
     // fails even though buggy is pure
     assert(buggy() == (++unstable, buggy()));
}

void main() {}
--------------------

A higher order function like map() is pure if the mapping function is pure (and if the iterable is lazy, then it too needs to generate the data with pure methods). But currently in D there is no way to specify the conditional purity of a function (the same is true for the "nothrow" attribute and probably "@safe" too). So currently you need a pure map() and an impure one, with some code duplication (that you may reduce with an ugly string mixin). There are some ideas regarding how to solve this problem.

Currently the class/struct invariant(){} (of the Contract-Based Programming) is not pure on default, but in theory it needs to not change the object state and not modify global mutable state.

Pure functions can't print on the console or save data on disk, this means logging becomes a problem (in Haskell there are some logging systems, like: http://hackage.haskell.org/package/hslogger-1.1.0 ).

--------------------

On Wikipedia you may find other related articles:
http://en.wikipedia.org/wiki/Referential_transparency_%28computer_science%29
http://en.wikipedia.org/wiki/Side_effect_%28computer_science%29
http://en.wikipedia.org/wiki/Purely_functional
http://en.wikipedia.org/wiki/Reentrant_code

--------------------

Version 1.0, Dec 16 2010: first version.
Version 1.1, Dec 16 2010: fixed grammar and moved links to the bottom.
Version 1.2, Dec 17 2010: the idea of weakly/strongly pure function was originally by Bruno Medeiros.
Tags: d language, programming, type system
  • Post a new comment

    Error

    default userpic
  • 10 comments