Thursday, August 11, 2011

Few words on false sharing of cache lines

Spent some quality time reading the native code behind java.lang.Object, particularly checking how synchronization is implemented (really complicated code, and critically important for the whole platform as you can imagine).

It is implemented like this (ton of details left out): of the 8 bytes of the object, 4 of them, if the object is used as a lock, are interpreted as a pointer to some ObjectMonitor, basically a queue of waiting threads. The implied CAS operation associated with entering or exiting a synchronized block is targetted (unless I misread it), to a word _inside_ the bytes of the Object, not the linked structure. This basically means that if you have an "locks" Object[8] array (initialized with "new Object()" in a loop), and you grab lock[3] from one thread, and lock[7] from another, these are likely to interfere with one another (CASes on the same cache line, one succeeding may spuriously cause the other to fail and retry). ReentrantLock is less susceptible to this problem, since it weighs at 40 bytes instead of 8, but still. All of this is fixable through padding.

This is not to imply that only CAS operations on nearby locations are a problem; even plain writes/reads of nearby memory locations from different threads will cause false cache sharing and increased cache coherency traffic. 

Tuesday, January 18, 2011

A note on API Design - call-stack as the source of a pervasive duality

I'll talk about few simple, often overlooked, yet pervasive and important, aspects of API design. They come by many names and many forms; if you have written code of any length, you have already faced at some level what I'm going to discuss. Long story short: almost any non-trivial API can be meaningfully designed in two, in some sense dual ways. Don't pick randomly. If your library is popular and frequently used, this matter has profound importance: chances are that users will press for an implementation of the dual, no matter which way you chose to offer - not always realizing that what they asked for is already implemented, just turned "inside out". If you don't realize this duality, you risk mindlessly duplicating lots of things, without realizing it, increasing the entropy of your library.

So my bet is that if you are doing API design, you are probably interested. (Edit: if you want to understand actors, you might also be interested!)

This is certainly not just a theoretical issue. Some well known manifestations of the phenomenon in the real world:
  • The introduction of StAX, while SAX was there
  • The many requests (or cries) for having Collection#forEach() kind of methods, to complement the #iterator()-based approach (related post by Neal Gafter)
  • Scala collections, which offer both iteration styles (passing a callback, or getting an iterator)
  • Servlets (which are callbacks) vs RIFE framework (recall their motto, "do scanf() over the web?" - they perceived that in this way the brought the "simplicity of console applications programming to the web".)
  • Callbacks, continuations, python's nifty "yield" (generators), trampolines, all come from the some place as well.

And that place is the good old call-stack.

Some (very) basic concepts up-front, to make sure we are on the same page. By "API", we usually refer to the layer through which user code and library code communicates. In Java, that is methods and the types they are defined in. Communication between these two parties (in either direction) typically happens by invoking methods: user code invoking library code, or the inverse (after the user code hands some callbacks to the library code, of course, since the library doesn't know its users). Method invocation occurs in a thread. The caller's stackframe issues the invocation to the callee, which pushes the callee's stackframe on the call-stack. Since it is a stack, the caller's stackframe (i.e., local variables + program counter) outlives the one of the callee. Trivial fact, but not trivial consequences.

Now, consider any two state machines that need to work together. When you are using an iterator, you are interacting with a state machine. When you pass callbacks to a parser, you interact with a state machine. Your program is a state machine, anything you invoke is also one - but we will concern ourselves with non-trivial state machines - even a pure function is a state machine, but it has only a single, unchanging state.

Since we are talking about API design, say one state machine is provided as a "library", and the other is "user code", communicating through some API (I only need to address the case of *two* state-machines; for more, you can recursively apply the same arguments). To make it concrete, lets use as an example these very simple state machines:
  • A state machine that remembers/reports the maximum element it has seen
  • A state machine that produces a set of elements
And we want to combine them together so we compute the maximum element of a set.

In a single thread's call-stack, one of these must be deeper in the stack than the other, thus there are only two different configurations to arrange this interaction. One must play the role of the caller (for lack of a better name, I'll call this role active, because it is the one that initiates the invocations to the second), the other must play the role of the callee (passive - it reacts to invocations by the active one). The messaging works as follows: The caller (active) passes a message by invoking, the callee (passive) answers by returning (and it has to return, if the computation ever halts).

Lets see the two arrangements of the the aforementioned state machines, in very simplified code:

//library - active
static int findMax(Iterator integers) {
  int max = Integer.MIN_VALUE; // <-- call-stack encoded state
  while (integers.hasNext()) {
    max = Math.max(max, integers.next());
  }
  return max;
}


//user code - passive
class ValuesFromAnArray implements Iterator {
  private final int[] array;
  private int position; //heap-encoded state


  ValuesFromAnArray(int[] array) {
    this.array = array;
  }
  
// API layer between library and user code
  public Integer next() { 
    if (!hasNext()) throw new NoSuchElementException();
    return array[position++];
  }

// API layer between library and user code
  public boolean hasNext() { 
    return position < array.length;
  }
}
(I don't concern myself with trivialities like how to wrap the Iterator into an Iterable).

Here, the library is active. The user code returns a value, and updates its state (which must be stored in the heap) so next time it will return the next value. The active library has an easier job - it only uses a local var for its state, and it doesn't have to explicitly store where it was (so it can come back to it) before calling the user code, since it implicitly uses the program counter to store that information. (And if you think about it, if you have a method and you have 8 independent positions, like nested if's of depth equal to 3, and you enter one of them, you effectively use the program counter to encode 3 bits of information. To turn this into a heap-based state machine, be sure to remember to encode those three bits in the heap as well!) The passive user code is basically a simplistic iterator (and iterators are a fine example of constructs designed to be passive).

The other option is this:

//user code - active
int[] array = ...;
MaxFinder m = new MaxFinder();
for (int value : array) {
  m.seeValue(value);
}
int max = m.max();


//library - passive
class MaxFinder {
  private int max = Integer.MIN_VALUE; //heap-encoded state


// API layer between user code and library
  void seeValue(int value) { 
    max = Math.max(max, value);
  }


// API layer between user code and library
  int max() {
    return max;
  }
}

In this case, the user code is active and the library is passive. Again, the passive one is forced to encode its state in the heap.

Both work. All we did is a simple mechanical transformation to go from the one to the other. The question for the library designer:
  • Which option to offer? (Or offer both??)
That's not so trivial. Lets try to compare:

1. Syntax: There is no doubt that having the option to encode your state in the call-stack is a big syntactic win (big, but only syntactic). Not so much because of local vars, but because of the program counter: All those nifty for-loops, nested if statements and what not, look so ugly when encoded in the heap. For example, here is some active user code:

MaxFinder maxFinder = new MaxFinder();
for (int i = 0; i < 10; i++) {
  for (int j = 0; j < 5; j++) {
    maxFinder.seeValue(someComplicatedFunction(i, j)); 
  }
}

Easy. Lets see what happens if we try to transform this into passive user code:

return new AbstractIterator() { //Guava's AbstractIterator for convenience
  int i = 0;
  int j = 0;
  protected Integer computeNext() {
    try {
      return (i < 5) ? someComplicatedFunction(i, j) : endOfData();
    } finally {
      j++;
      if (j >= 5) {
         i++;
         j = 0;
      }
    }
  }
};

Ouch - that's like we took a shotgun and splattered the previous code into pieces. Of course, I'm biased: after having read countless for-loops, I tend to understand them more easily and think them simpler, compared to the individual pieces of them scattered here and there. "It's just syntax", in the end (mind you, the difference could be a lot worse though).

2. Composability: Now this is not aesthetics any more, but it's a more serious question between "works" or "doesn't work". If you offer your library as active, you force the user code to be passive. If the user code is already written, and is active, then it simply can't work with the active library. It is precisely this user that will complain to you about your library and/or ask you to offer a passive counterpart in addition to the active you already offered. One of the two parties must agree to use return to send a message back, instead of recursively invoking a method. If neither is passive, the result will be infinite recursion. An active participant requires control of the call-stack for the duration of the coordination (corollary, two actives can only work together if each one has its own call-stack - see "trampoline" below). But if you offer your library as passive, you don't preclude any option for the user (it can be either active or passive), and thus this is the most flexible. 


To exemplify the last point, (this is an adaptation from Gafter's post) suppose we have two lists and we want to test whether they are equal. If we can produce iterators (i.e., passive state machines), it is easy:

//PassiveList offers #iterator()
boolean areEqual(PassiveList list1, PassiveList list2) {
  if (list1.size() != list2.size()) return false;
  Iterator i1 = list1.iterator();
  Iterator i2 = list2.iterator();


  while (i1.hasNext()) {
    if (i1.next() != i2.next()) return false;
  }
  return true;
}

If one of the lists was active (had a #forEach() operation instead of giving an iterator), it is still implementable, let the active control the passive:

//ActiveList offers #forEach(ElementHandler)
boolean areEqual(ActiveList activeList, PassiveList passiveList) {
  if (activeList.size() != passiveList.size()) return false;
  
  final Iterator iterator = passiveList.iterator();
  final boolean[] equal = { true }; 
  activeList.forEach(new ElementHandler() {
    public void process(Object e) {
      if (e != iterator.next()) equal[0] = false;
    } //don't nitpick about not breaking early!
  });
  return equal[0];
}

But in principle, it's not possible to implement the following method, without spawning a second thread (again, see trampoline, below), :

//both are active!
boolean areEqual(ActiveList list1, ActiveList list2);

3) Exceptions: Another (Java-specific) way that the two option differ is checked exceptions, unfortunately. If you create an active library, you have to anticipate your passive user code may require to throw checked exceptions - so creating the API (implemented by user code) without knowing the checked exceptions beforehand is problematic and ugly (this have to somehow be reflected in the API signatures, and generified exceptions are not for the weak-hearted), whereas if you let the user code take the active position, it is oh-too-easy for the user code to wrap the whole thing in a try/catch or add a throws clause in the enclosing method (the enclosing method is of not offered by your library and is no concern to you - if it is not your code that throws checked exceptions, you don't have to worry about them, and this is good). Oh well. Just another reason we all should be doing scala, eventually. Lets quickly hide this under the carpet for now. Credit for noting this important caveat goes to Chris Povirk.

So where these considerations leave us? My rules of thumb (not entirely contradiction-free):

  1. Let the more complicated participant have the active position; that piece of code will benefit the most by having its stackframe available for encoding its state. 
  2. If you provide a library, offer it as passive - so you leave all options open for the user.

    But what to do if you anticipate *the library* to be the more complicated participant? Do you choose to make it active and force user code to be passive? What if you are wrong in your estimation? It's a tough call. But in the end, if you are forced to offer your library also as active, at least place that version near the passive one, so it's apparent that these two perform the same function, and so the users have to look for this functionality at a single place, not two.
  3. If though the library is supposed to offer parallelization to the user, there is no option: you have to make it active, because parallelization boils down to controlling how to invoke user's code. (See ParallelArray's active operations. This also offers a passive iterator, but it would be folly to use it)

That was the meat of this post. I mentioned trampolines above, so in the remaining part, I'll touch upon that.

Python generators are, indeed, beautiful. You write an active piece of code, and some python (alas, very inefficient) piece of magic transforms it into a passive construct (iterator) out of that (which, recall, can be used with another active counterpart). Thus, you get the best of both worlds: both parties (e.g. producer and consumer, here) can be active!

//producer - active!
Iterator iterator = magicTrampoline(new AbstractProducer() {
  public void run() {
    for (int i = 0; i < 10; i++) {
      for (int j = 0; j < 5; j++) {
        produce(complicatedFunction(i, j));
      }
    }
  }
};


//consumer - active!
while (iterator.hasNext())
  consume(iterator.next());

This is certainly the best option for code readability. The cost is, we need an extra thread (more precisely, we need an extra call-stack). "Trampoline" is the common name of the port of this to Java. I wanted to link to some implementation here, but couldn't readily find a nice one. I'll just sketch the implementation anyway: the idea is that you let the producer run on a separate (from the consumer) thread. Thus message-passing between producer and consumer do not happen in the same call-stack, but via another means, say via a BlockingQueue (the producer puts, the consumer takes). So, the producer doesn't have to destroy its stackframe(s) after producing an element, but it can safely hold on to it (remember that this was not possible in the single-thread scenario, with message passing via the call-stack, that would lead to infinite recursion).

--EDIT
Sure I can find an implementation of the above - scala actors! The actors interact in an active (of course!) fashion, and this is possible because they communicate by exchanging messages, not invocations, thus no party needs to play the passive role. (Of course, underneath there is the good old BlockingQueue and the trampoline mechanics mentioned above. ) And yet, even in scala actors, there are the so called reactive actors - these don't hold on to their thread (and their call-stack), thus are not as wasteful as the regular actors, but, as per the discussion above, they are passive. This is the way one can tie actors in this discussion, and understand these through a different perspective.
--/EDIT

Since the cost for this is very high (not just the second thread itself, but the frequent blockings/context switches), it is not to be recommended in Java, unless we someday get much more lightweight threads, like those in Erlang. (And I can certainly imagine the joy of Erlang programmers, who can always readily use their call-stack to encode their complex state machines, instead of jumping through hops just so they do not introduce another thread.)

(Another way to achieve this would be via continuations - which Java doesn't support natively.)

And a departing story: 4 years ago, I played with these ideas in the context of implementing complex gesture detection for a graph editor, amongst other graph related tasks. Use-case: implement drag and drop for nodes (a common task). This is what my experimental code looked like - quite elegant:

//active pattern detection!
Event firstEvent = awaitEvent();
if (firstEvent.isMouseDown()) {
  Event nextEvent;
  while ((nextEvent = awaitEvent()).isMouseDrag()) {
    continue;
  }
  if (nextEvent.isMouseUp()) {
    //drag and drop is complete!
  }
} else { 
  ...
}

If you have read all the way to here (phew!), you can probably imagine how "fun" (ugh) would be transforming the pattern detection to a passive, say, MouseListener/MouseMotionListener (and translating those if's and else's and while's...). Be my guest and try (I'm kidding - just don't). This is the source of my first rule of thumb: it was via such, sufficiently complicated state machines that I observed how much the code can be simplified just through the act of making it active.

To better understand the last example and how it's relevant, note that Swing itself is an active state machine - remember, it owns the event-dispatching thread, thus user code is forced into the passive role. (Which is also consistent with my first rule of thumb: swing, the most complex machinery, taking up the active, more convenient role).