Sunday, February 5, 2012

Updated: Memory cost per Java/Guava structure (+Cache)

I just updated the ElementCostInDataStructures page, after 6 months, bringing it in line with the latest Guava version (11).

The most interesting part is just how easy it is to measure the cost per entry in a Cache (or LoadingCache, which turns out to be equivalent). This is the cheat-sheet I compiled.

To compute the cost of a single (key, value) entry:

So, consider this example from the javadoc:

   LoadingCache graphs = CacheBuilder.newBuilder()
       .maximumSize(10000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .removalListener(MY_LISTENER)
       .build(
           new CacheLoader() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });

The cost of an Entry in this structure this is computed as follows:
  • It's a Cache: +12 words
  • It uses maximumSize(): +4 words
  • It uses expiration: +4 words
Thus, each (key, value) entry would have a footprint of 20 words (thus 80 bytes in a 32bit VM, or 160 in a 64bit one). 

To estimate the overhead imposed in the garbage collector, one could count how many references (pointers) each entry introduces, which the garbage collector would have to traverse to compute object reachability. The same list again, this time only counting references:

Thus, for example, if you use all features together (expiration, maximumSize(), weakKeys(), and weak or soft values), then each entry introduces 6+2+2+3+4 = 17 references. If you have 1000 entries in this cache, that would imply 17,000 references.

A simple advice to close this post: since these numbers can quickly add up, be sure to use only those features that your application needs. Don't overdo it just because these features are easily accessible - if you can practically get away with fewer features, do so, and you'll get a leaner cache.

That said, I'm positively surprised with the Cache implementation - that you pay exactly for what you use, and that the cost various combinations of features is simply the sum of the cost of the individual features. This is not obvious, not given, and not an accident, but good engineering. 

Thursday, August 11, 2011

Few words on false sharing of cache lines

Spent some quality time reading the native code behind java.lang.Object, particularly checking how synchronization is implemented (really complicated code, and critically important for the whole platform as you can imagine).

It is implemented like this (ton of details left out): of the 8 bytes of the object, 4 of them, if the object is used as a lock, are interpreted as a pointer to some ObjectMonitor, basically a queue of waiting threads. The implied CAS operation associated with entering or exiting a synchronized block is targetted (unless I misread it), to a word _inside_ the bytes of the Object, not the linked structure. This basically means that if you have an "locks" Object[8] array (initialized with "new Object()" in a loop), and you grab lock[3] from one thread, and lock[7] from another, these are likely to interfere with one another (CASes on the same cache line, one succeeding may spuriously cause the other to fail and retry). ReentrantLock is less susceptible to this problem, since it weighs at 40 bytes instead of 8, but still. All of this is fixable through padding.

This is not to imply that only CAS operations on nearby locations are a problem; even plain writes/reads of nearby memory locations from different threads will cause false cache sharing and increased cache coherency traffic. 

Monday, March 28, 2011

Tuesday, January 18, 2011

A note on API Design - call-stack as the source of a pervasive duality

I'll talk about few simple, often overlooked, yet pervasive and important, aspects of API design. They come by many names and many forms; if you have written code of any length, you have already faced at some level what I'm going to discuss. Long story short: almost any non-trivial API can be meaningfully designed in two, in some sense dual ways. Don't pick randomly. If your library is popular and frequently used, this matter has profound importance: chances are that users will press for an implementation of the dual, no matter which way you chose to offer - not always realizing that what they asked for is already implemented, just turned "inside out". If you don't realize this duality, you risk mindlessly duplicating lots of things, without realizing it, increasing the entropy of your library.

So my bet is that if you are doing API design, you are probably interested. (Edit: if you want to understand actors, you might also be interested!)

This is certainly not just a theoretical issue. Some well known manifestations of the phenomenon in the real world:
  • The introduction of StAX, while SAX was there
  • The many requests (or cries) for having Collection#forEach() kind of methods, to complement the #iterator()-based approach (related post by Neal Gafter)
  • Scala collections, which offer both iteration styles (passing a callback, or getting an iterator)
  • Servlets (which are callbacks) vs RIFE framework (recall their motto, "do scanf() over the web?" - they perceived that in this way the brought the "simplicity of console applications programming to the web".)
  • Callbacks, continuations, python's nifty "yield" (generators), trampolines, all come from the some place as well.

And that place is the good old call-stack.

Some (very) basic concepts up-front, to make sure we are on the same page. By "API", we usually refer to the layer through which user code and library code communicates. In Java, that is methods and the types they are defined in. Communication between these two parties (in either direction) typically happens by invoking methods: user code invoking library code, or the inverse (after the user code hands some callbacks to the library code, of course, since the library doesn't know its users). Method invocation occurs in a thread. The caller's stackframe issues the invocation to the callee, which pushes the callee's stackframe on the call-stack. Since it is a stack, the caller's stackframe (i.e., local variables + program counter) outlives the one of the callee. Trivial fact, but not trivial consequences.

Now, consider any two state machines that need to work together. When you are using an iterator, you are interacting with a state machine. When you pass callbacks to a parser, you interact with a state machine. Your program is a state machine, anything you invoke is also one - but we will concern ourselves with non-trivial state machines - even a pure function is a state machine, but it has only a single, unchanging state.

Since we are talking about API design, say one state machine is provided as a "library", and the other is "user code", communicating through some API (I only need to address the case of *two* state-machines; for more, you can recursively apply the same arguments). To make it concrete, lets use as an example these very simple state machines:
  • A state machine that remembers/reports the maximum element it has seen
  • A state machine that produces a set of elements
And we want to combine them together so we compute the maximum element of a set.

In a single thread's call-stack, one of these must be deeper in the stack than the other, thus there are only two different configurations to arrange this interaction. One must play the role of the caller (for lack of a better name, I'll call this role active, because it is the one that initiates the invocations to the second), the other must play the role of the callee (passive - it reacts to invocations by the active one). The messaging works as follows: The caller (active) passes a message by invoking, the callee (passive) answers by returning (and it has to return, if the computation ever halts).

Lets see the two arrangements of the the aforementioned state machines, in very simplified code:

//library - active
static int findMax(Iterator integers) {
  int max = Integer.MIN_VALUE; // <-- call-stack encoded state
  while (integers.hasNext()) {
    max = Math.max(max, integers.next());
  }
  return max;
}


//user code - passive
class ValuesFromAnArray implements Iterator {
  private final int[] array;
  private int position; //heap-encoded state


  ValuesFromAnArray(int[] array) {
    this.array = array;
  }
  
// API layer between library and user code
  public Integer next() { 
    if (!hasNext()) throw new NoSuchElementException();
    return array[position++];
  }

// API layer between library and user code
  public boolean hasNext() { 
    return position < array.length;
  }
}
(I don't concern myself with trivialities like how to wrap the Iterator into an Iterable).

Here, the library is active. The user code returns a value, and updates its state (which must be stored in the heap) so next time it will return the next value. The active library has an easier job - it only uses a local var for its state, and it doesn't have to explicitly store where it was (so it can come back to it) before calling the user code, since it implicitly uses the program counter to store that information. (And if you think about it, if you have a method and you have 8 independent positions, like nested if's of depth equal to 3, and you enter one of them, you effectively use the program counter to encode 3 bits of information. To turn this into a heap-based state machine, be sure to remember to encode those three bits in the heap as well!) The passive user code is basically a simplistic iterator (and iterators are a fine example of constructs designed to be passive).

The other option is this:

//user code - active
int[] array = ...;
MaxFinder m = new MaxFinder();
for (int value : array) {
  m.seeValue(value);
}
int max = m.max();


//library - passive
class MaxFinder {
  private int max = Integer.MIN_VALUE; //heap-encoded state


// API layer between user code and library
  void seeValue(int value) { 
    max = Math.max(max, value);
  }


// API layer between user code and library
  int max() {
    return max;
  }
}

In this case, the user code is active and the library is passive. Again, the passive one is forced to encode its state in the heap.

Both work. All we did is a simple mechanical transformation to go from the one to the other. The question for the library designer:
  • Which option to offer? (Or offer both??)
That's not so trivial. Lets try to compare:

1. Syntax: There is no doubt that having the option to encode your state in the call-stack is a big syntactic win (big, but only syntactic). Not so much because of local vars, but because of the program counter: All those nifty for-loops, nested if statements and what not, look so ugly when encoded in the heap. For example, here is some active user code:

MaxFinder maxFinder = new MaxFinder();
for (int i = 0; i < 10; i++) {
  for (int j = 0; j < 5; j++) {
    maxFinder.seeValue(someComplicatedFunction(i, j)); 
  }
}

Easy. Lets see what happens if we try to transform this into passive user code:

return new AbstractIterator() { //Guava's AbstractIterator for convenience
  int i = 0;
  int j = 0;
  protected Integer computeNext() {
    try {
      return (i < 5) ? someComplicatedFunction(i, j) : endOfData();
    } finally {
      j++;
      if (j >= 5) {
         i++;
         j = 0;
      }
    }
  }
};

Ouch - that's like we took a shotgun and splattered the previous code into pieces. Of course, I'm biased: after having read countless for-loops, I tend to understand them more easily and think them simpler, compared to the individual pieces of them scattered here and there. "It's just syntax", in the end (mind you, the difference could be a lot worse though).

2. Composability: Now this is not aesthetics any more, but it's a more serious question between "works" or "doesn't work". If you offer your library as active, you force the user code to be passive. If the user code is already written, and is active, then it simply can't work with the active library. It is precisely this user that will complain to you about your library and/or ask you to offer a passive counterpart in addition to the active you already offered. One of the two parties must agree to use return to send a message back, instead of recursively invoking a method. If neither is passive, the result will be infinite recursion. An active participant requires control of the call-stack for the duration of the coordination (corollary, two actives can only work together if each one has its own call-stack - see "trampoline" below). But if you offer your library as passive, you don't preclude any option for the user (it can be either active or passive), and thus this is the most flexible. 


To exemplify the last point, (this is an adaptation from Gafter's post) suppose we have two lists and we want to test whether they are equal. If we can produce iterators (i.e., passive state machines), it is easy:

//PassiveList offers #iterator()
boolean areEqual(PassiveList list1, PassiveList list2) {
  if (list1.size() != list2.size()) return false;
  Iterator i1 = list1.iterator();
  Iterator i2 = list2.iterator();


  while (i1.hasNext()) {
    if (i1.next() != i2.next()) return false;
  }
  return true;
}

If one of the lists was active (had a #forEach() operation instead of giving an iterator), it is still implementable, let the active control the passive:

//ActiveList offers #forEach(ElementHandler)
boolean areEqual(ActiveList activeList, PassiveList passiveList) {
  if (activeList.size() != passiveList.size()) return false;
  
  final Iterator iterator = passiveList.iterator();
  final boolean[] equal = { true }; 
  activeList.forEach(new ElementHandler() {
    public void process(Object e) {
      if (e != iterator.next()) equal[0] = false;
    } //don't nitpick about not breaking early!
  });
  return equal[0];
}

But in principle, it's not possible to implement the following method, without spawning a second thread (again, see trampoline, below), :

//both are active!
boolean areEqual(ActiveList list1, ActiveList list2);

3) Exceptions: Another (Java-specific) way that the two option differ is checked exceptions, unfortunately. If you create an active library, you have to anticipate your passive user code may require to throw checked exceptions - so creating the API (implemented by user code) without knowing the checked exceptions beforehand is problematic and ugly (this have to somehow be reflected in the API signatures, and generified exceptions are not for the weak-hearted), whereas if you let the user code take the active position, it is oh-too-easy for the user code to wrap the whole thing in a try/catch or add a throws clause in the enclosing method (the enclosing method is of not offered by your library and is no concern to you - if it is not your code that throws checked exceptions, you don't have to worry about them, and this is good). Oh well. Just another reason we all should be doing scala, eventually. Lets quickly hide this under the carpet for now. Credit for noting this important caveat goes to Chris Povirk.

So where these considerations leave us? My rules of thumb (not entirely contradiction-free):

  1. Let the more complicated participant have the active position; that piece of code will benefit the most by having its stackframe available for encoding its state. 
  2. If you provide a library, offer it as passive - so you leave all options open for the user.

    But what to do if you anticipate *the library* to be the more complicated participant? Do you choose to make it active and force user code to be passive? What if you are wrong in your estimation? It's a tough call. But in the end, if you are forced to offer your library also as active, at least place that version near the passive one, so it's apparent that these two perform the same function, and so the users have to look for this functionality at a single place, not two.
  3. If though the library is supposed to offer parallelization to the user, there is no option: you have to make it active, because parallelization boils down to controlling how to invoke user's code. (See ParallelArray's active operations. This also offers a passive iterator, but it would be folly to use it)

That was the meat of this post. I mentioned trampolines above, so in the remaining part, I'll touch upon that.

Python generators are, indeed, beautiful. You write an active piece of code, and some python (alas, very inefficient) piece of magic transforms it into a passive construct (iterator) out of that (which, recall, can be used with another active counterpart). Thus, you get the best of both worlds: both parties (e.g. producer and consumer, here) can be active!

//producer - active!
Iterator iterator = magicTrampoline(new AbstractProducer() {
  public void run() {
    for (int i = 0; i < 10; i++) {
      for (int j = 0; j < 5; j++) {
        produce(complicatedFunction(i, j));
      }
    }
  }
};


//consumer - active!
while (iterator.hasNext())
  consume(iterator.next());

This is certainly the best option for code readability. The cost is, we need an extra thread (more precisely, we need an extra call-stack). "Trampoline" is the common name of the port of this to Java. I wanted to link to some implementation here, but couldn't readily find a nice one. I'll just sketch the implementation anyway: the idea is that you let the producer run on a separate (from the consumer) thread. Thus message-passing between producer and consumer do not happen in the same call-stack, but via another means, say via a BlockingQueue (the producer puts, the consumer takes). So, the producer doesn't have to destroy its stackframe(s) after producing an element, but it can safely hold on to it (remember that this was not possible in the single-thread scenario, with message passing via the call-stack, that would lead to infinite recursion).

--EDIT
Sure I can find an implementation of the above - scala actors! The actors interact in an active (of course!) fashion, and this is possible because they communicate by exchanging messages, not invocations, thus no party needs to play the passive role. (Of course, underneath there is the good old BlockingQueue and the trampoline mechanics mentioned above. ) And yet, even in scala actors, there are the so called reactive actors - these don't hold on to their thread (and their call-stack), thus are not as wasteful as the regular actors, but, as per the discussion above, they are passive. This is the way one can tie actors in this discussion, and understand these through a different perspective.
--/EDIT

Since the cost for this is very high (not just the second thread itself, but the frequent blockings/context switches), it is not to be recommended in Java, unless we someday get much more lightweight threads, like those in Erlang. (And I can certainly imagine the joy of Erlang programmers, who can always readily use their call-stack to encode their complex state machines, instead of jumping through hops just so they do not introduce another thread.)

(Another way to achieve this would be via continuations - which Java doesn't support natively.)

And a departing story: 4 years ago, I played with these ideas in the context of implementing complex gesture detection for a graph editor, amongst other graph related tasks. Use-case: implement drag and drop for nodes (a common task). This is what my experimental code looked like - quite elegant:

//active pattern detection!
Event firstEvent = awaitEvent();
if (firstEvent.isMouseDown()) {
  Event nextEvent;
  while ((nextEvent = awaitEvent()).isMouseDrag()) {
    continue;
  }
  if (nextEvent.isMouseUp()) {
    //drag and drop is complete!
  }
} else { 
  ...
}

If you have read all the way to here (phew!), you can probably imagine how "fun" (ugh) would be transforming the pattern detection to a passive, say, MouseListener/MouseMotionListener (and translating those if's and else's and while's...). Be my guest and try (I'm kidding - just don't). This is the source of my first rule of thumb: it was via such, sufficiently complicated state machines that I observed how much the code can be simplified just through the act of making it active.

To better understand the last example and how it's relevant, note that Swing itself is an active state machine - remember, it owns the event-dispatching thread, thus user code is forced into the passive role. (Which is also consistent with my first rule of thumb: swing, the most complex machinery, taking up the active, more convenient role).

Sunday, December 12, 2010

How long is the period of random numbers created by java.util.Random?

The answer is, pretty darn long! 


I feel a bit cheated for not knowning, while I had several times seen the code the years before.


So it uses a good old linear congruential generator:


x = A * x + C


So, java.util.Random uses these two constants (A and C respectively):


    private final static long multiplier = 0x5DEECE66DL;
    private final static long addend = 0xBL;

Quite appropriate to place two "random" numbers in the Random class, right? Only these are not entirely random...

0x5DEECE66DL % 8 == 5
and 
0xBL is an odd.

So what? 


It turns out, when these exact conditions are satisfied, the defined period is maximal, so the generated period (x is long) is 2^64. This piece of magic is documented here, page 9. The "why" of it is beyond me. 

If any ninja mathematician comes across this post, please leave some explanation if you can.


EDIT: oops. I overlooked that java.util.Random uses a mask to only reduce itself to 48 bits, instead of 64, so correspondingly the period is 2^48 instead of 2^64; apparently in an attempt to increase the quality of the generated numbers. In any case, if you need longer period, the technique described here will do the trick.

Juggling consistent hashing, hashtables, and good old sorting

Consistent hashing can be summarized in one sentence: it is a hashtable with variant-length buckets.

You have a line segment (that's the key space). Breaking it to N equal parts (the buckets) means it is very easy (O(1)) to compute which part contains a given point. But if you increase or decrease N, all parts change (and shrink or grow respectively). Only the first and the last part have one of its ends unaffected, but the rest have both their ends moved.

Now consider breaking it in N parts, but not of necessarily the same length. Now it is easier to change the number of parts. Do you want more parts? Break one in two pieces. Want fewer parts? Merge two parts sitting next to each other. All other parts remain firm in their place, unaffected. But now finding which part contains a given point gets more complicated (O(logN) if you keep the parts sorted, say in a red-black tree).

Hashtables themselves can be summarized in an interesting way: it is sorting...by hash (it doesn't matter if the original key space is not totally ordered - the hashes are). But if it is sorting, where the logN factor in the complexity of hashtable operations? Well, there isn't one: hashes are small natural numbers (bounded by the hashtable size, which is O(N) of the elements in it). And what do you think of when you want to sort small natural numbers? Counting sort. (Caveat: hash functions do not just aim to map a large key space into a sequence of natural numbers, that's only part of the story. They also aim to approximate a random permutation of the original space, which is separately useful - to try to load balance the buckets when the incoming distribution is not uniform. In consistent hashing, we sacrifice the first role of hash functions, but we keep this latter one).

Trying to compress the discussion above:

  • Hashtables are really counting sort
    • (Yes, data structures are algorithms)
    • An equal piece of the (approximately permuted) key space is mapped to each bucket
      • Easy to find the corresponding bucket for a key (++)
      • Changing the number of buckets affects the mapped spaces of all buckets (--)
  • Consistent hashing differs in the following way:
    •  A variable piece the (approximately permuted) key space is mapped to each bucket
      • Harder to find the corresponding bucket for a key (--)
      • Easy to split or merge buckets, without affecting any other buckets (++)


Bonus section: Bigtables (and the good old Bigtable paper here). In Bigtable, The rows are primarily sorted by row key. The tablets (think buckets) are variable length, so that overloaded tablets can be split, without touching the huge amount of data that potentially reside in the same bigtable. (One can help load balancing the tablets by -did you guess it?- sorting by hash first...). So, here you go. The rows are sorted. Yet it can still rightfully be described as consistent hashing. Confused yet? Well, at least I tried. :)

Wednesday, October 20, 2010

Memory cost (per entry) for various Java data structures

Have you seen this? http://code.google.com/p/memory-measurer/wiki/ElementCostInDataStructures
It shows the memory cost per each entry for various common and not-so-common Java data structures. For example, have you wondered when you put() an entry to a HashMap, how much memory footprint you create, apart from the space occupied by your key and your value (i.e. only by internal artifacts introduced by HashMap itself)? There are the answers to such questions.

It also exemplifies the utility of my memory measuring tool, which I recently redesigned to decouple object exploration from measuring the memory usage (bytes) of objects, so I can use the same exploration to derive more qualitative statistics, like number of objects in a object graph, number of references, and number of primitives (which would directly reflect memory usage if it weren't for effects of alignment).

If you are designing data structures in Java, it would be silly not to use this tool. Seriously. (Profilers can be used to somewhat the same effect, but why fire it up and click buttons till you get where you want and take numbers and calculate by yourself, when you can do it more precisely with very simple code).

Future work: do the same for the immutable Guava collections.