Code-o-matic

Updated: Memory cost per Java/Guava structure (+Cache)

2012-02-05T21:13:00.000-08:00

I just updated the ElementCostInDataStructures page, after 6 months, bringing it in line with the latest Guava version (11).

The most interesting part is just how easy it is to measure the cost per entry in a Cache (or LoadingCache, which turns out to be equivalent). This is the cheat-sheet I compiled.

To compute the cost of a single (key, value) entry:

If you use HashMap or ConcurrentHashMap, the cost is 8 words (32 bytes)
If you switch to Cache/LoadingCache, the cost is 12 words (48 bytes)
If you add expiration of either kind (or both), add 4 words (16 bytes)
If you use maximumSize(), add 4 words (16 bytes)
If you use weakKeys(), add 4 words (16 bytes)
If you use weakValues() or softValues(), add 4 words (16 bytes)

So, consider this example from the javadoc:

   LoadingCache graphs = CacheBuilder.newBuilder()
       .maximumSize(10000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .removalListener(MY_LISTENER)
       .build(
           new CacheLoader() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });

The cost of an Entry in this structure this is computed as follows:

It's a Cache: +12 words
It uses maximumSize(): +4 words
It uses expiration: +4 words

Thus, each (key, value) entry would have a footprint of 20 words (thus 80 bytes in a 32bit VM, or 160 in a 64bit one).

To estimate the overhead imposed in the garbage collector, one could count how many references (pointers) each entry introduces, which the garbage collector would have to traverse to compute object reachability. The same list again, this time only counting references:

If you use HashMap or ConcurrentHashMap, the cost is 5 references
If you switch to Cache/LoadingCache, the cost is 6 references
If you add expiration of either kind (or both), add 2 references
If you use maximumSize(), add 2 references
If you use weakKeys(), add 3 references
If you use weakValues() or softValues(), add 4 references

Thus, for example, if you use all features together (expiration, maximumSize(), weakKeys(), and weak or soft values), then each entry introduces 6+2+2+3+4 = 17 references. If you have 1000 entries in this cache, that would imply 17,000 references.

A simple advice to close this post: since these numbers can quickly add up, be sure to use only those features that your application needs. Don't overdo it just because these features are easily accessible - if you can practically get away with fewer features, do so, and you'll get a leaner cache.

That said, I'm positively surprised with the Cache implementation - that you pay exactly for what you use, and that the cost various combinations of features is simply the sum of the cost of the individual features. This is not obvious, not given, and not an accident, but good engineering.

Few words on false sharing of cache lines

2011-08-11T19:56:00.001-07:00

Spent some quality time reading the native code behind java.lang.Object, particularly checking how synchronization is implemented (really complicated code, and critically important for the whole platform as you can imagine).

It is implemented like this (ton of details left out): of the 8 bytes of the object, 4 of them, if the object is used as a lock, are interpreted as a pointer to some ObjectMonitor, basically a queue of waiting threads. The implied CAS operation associated with entering or exiting a synchronized block is targetted (unless I misread it), to a word _inside_ the bytes of the Object, not the linked structure. This basically means that if you have an "locks" Object[8] array (initialized with "new Object()" in a loop), and you grab lock[3] from one thread, and lock[7] from another, these are likely to interfere with one another (CASes on the same cache line, one succeeding may spuriously cause the other to fail and retry). ReentrantLock is less susceptible to this problem, since it weighs at 40 bytes instead of 8, but still. All of this is fixable through padding.

This is not to imply that only CAS operations on nearby locations are a problem; even plain writes/reads of nearby memory locations from different threads will cause false cache sharing and increased cache coherency traffic.

James Gosling at Google!

2011-03-28T14:02:00.000-07:00

Woohoo! James just joined Google!! Finally!

A note on API Design - call-stack as the source of a pervasive duality

2011-01-18T00:13:00.000-08:00

I'll talk about few simple, often overlooked, yet pervasive and important, aspects of API design. They come by many names and many forms; if you have written code of any length, you have already faced at some level what I'm going to discuss. Long story short: almost any non-trivial API can be meaningfully designed in two, in some sense dual ways. Don't pick randomly. If your library is popular and frequently used, this matter has profound importance: chances are that users will press for an implementation of the dual, no matter which way you chose to offer - not always realizing that what they asked for is already implemented, just turned "inside out". If you don't realize this duality, you risk mindlessly duplicating lots of things, without realizing it, increasing the entropy of your library.

So my bet is that if you are doing API design, you are probably interested. (Edit: if you want to understand actors, you might also be interested!)

This is certainly not just a theoretical issue. Some well known manifestations of the phenomenon in the real world:

The introduction of StAX, while SAX was there
The many requests (or cries) for having Collection#forEach() kind of methods, to complement the #iterator()-based approach (related post by Neal Gafter)
Scala collections, which offer both iteration styles (passing a callback, or getting an iterator)
Servlets (which are callbacks) vs RIFE framework (recall their motto, "do scanf() over the web?" - they perceived that in this way the brought the "simplicity of console applications programming to the web".)
Callbacks, continuations, python's nifty "yield" (generators), trampolines, all come from the some place as well.

And that place is the good old call-stack.

Some (very) basic concepts up-front, to make sure we are on the same page. By "API", we usually refer to the layer through which user code and library code communicates. In Java, that is methods and the types they are defined in. Communication between these two parties (in either direction) typically happens by invoking methods: user code invoking library code, or the inverse (after the user code hands some callbacks to the library code, of course, since the library doesn't know its users). Method invocation occurs in a thread. The caller's stackframe issues the invocation to the callee, which pushes the callee's stackframe on the call-stack. Since it is a stack, the caller's stackframe (i.e., local variables + program counter) outlives the one of the callee. Trivial fact, but not trivial consequences.

Now, consider any two state machines that need to work together. When you are using an iterator, you are interacting with a state machine. When you pass callbacks to a parser, you interact with a state machine. Your program is a state machine, anything you invoke is also one - but we will concern ourselves with non-trivial state machines - even a pure function is a state machine, but it has only a single, unchanging state.

Since we are talking about API design, say one state machine is provided as a "library", and the other is "user code", communicating through some API (I only need to address the case of *two* state-machines; for more, you can recursively apply the same arguments). To make it concrete, lets use as an example these very simple state machines:

A state machine that remembers/reports the maximum element it has seen
A state machine that produces a set of elements

And we want to combine them together so we compute the maximum element of a set.

In a single thread's call-stack, one of these must be deeper in the stack than the other, thus there are only two different configurations to arrange this interaction. One must play the role of the caller (for lack of a better name, I'll call this role active, because it is the one that initiates the invocations to the second), the other must play the role of the callee (passive - it reacts to invocations by the active one). The messaging works as follows: The caller (active) passes a message by invoking, the callee (passive) answers by returning (and it has to return, if the computation ever halts).

Lets see the two arrangements of the the aforementioned state machines, in very simplified code:

//library - active
static int findMax(Iterator integers) {
  int max = Integer.MIN_VALUE; // <-- call-stack encoded state
  while (integers.hasNext()) {
   max = Math.max(max, integers.next());
  }
  return max;
}

//user code - passive
class ValuesFromAnArray implements Iterator {
  private final int[] array;
  private int position; //heap-encoded state

  ValuesFromAnArray(int[] array) {
   this.array = array;
  }

// API layer between library and user code
  public Integer next() {
   if (!hasNext()) throw new NoSuchElementException();
   return array[position++];
  }

// API layer between library and user code
  public boolean hasNext() {
   return position < array.length;
  }
}
(I don't concern myself with trivialities like how to wrap the Iterator into an Iterable).

Here, the library is active. The user code returns a value, and updates its state (which must be stored in the heap) so next time it will return the next value. The active library has an easier job - it only uses a local var for its state, and it doesn't have to explicitly store where it was (so it can come back to it) before calling the user code, since it implicitly uses the program counter to store that information. (And if you think about it, if you have a method and you have 8 independent positions, like nested if's of depth equal to 3, and you enter one of them, you effectively use the program counter to encode 3 bits of information. To turn this into a heap-based state machine, be sure to remember to encode those three bits in the heap as well!) The passive user code is basically a simplistic iterator (and iterators are a fine example of constructs designed to be passive).

The other option is this:

//user code - active
int[] array = ...;
MaxFinder m = new MaxFinder();
for (int value : array) {
  m.seeValue(value);
}
int max = m.max();

//library - passive
class MaxFinder {
  private int max = Integer.MIN_VALUE; //heap-encoded state

// API layer between user code and library
  void seeValue(int value) {
   max = Math.max(max, value);
  }

// API layer between user code and library

  int max() {
   return max;
  }
}

In this case, the user code is active and the library is passive. Again, the passive one is forced to encode its state in the heap.

Both work. All we did is a simple mechanical transformation to go from the one to the other. The question for the library designer:

Which option to offer? (Or offer both??)

That's not so trivial. Lets try to compare:

1. Syntax: There is no doubt that having the option to encode your state in the call-stack is a big syntactic win (big, but only syntactic). Not so much because of local vars, but because of the program counter: All those nifty for-loops, nested if statements and what not, look so ugly when encoded in the heap. For example, here is some active user code:

MaxFinder maxFinder = new MaxFinder();

for (int i = 0; i < 10; i++) {

for (int j = 0; j < 5; j++) {

maxFinder.seeValue(someComplicatedFunction(i, j));

}

Easy. Lets see what happens if we try to transform this into passive user code:

return new AbstractIterator() { //Guava's AbstractIterator for convenience

int i = 0;

int j = 0;

protected Integer computeNext() {

try {

return (i < 5) ? someComplicatedFunction(i, j) : endOfData();

} finally {

j++;

if (j >= 5) {

i++;

j = 0;

}

};

Ouch - that's like we took a shotgun and splattered the previous code into pieces. Of course, I'm biased: after having read countless for-loops, I tend to understand them more easily and think them simpler, compared to the individual pieces of them scattered here and there. "It's just syntax", in the end (mind you, the difference could be a lot worse though).

2. Composability: Now this is not aesthetics any more, but it's a more serious question between "works" or "doesn't work". If you offer your library as active, you force the user code to be passive. If the user code is already written, and is active, then it simply can't work with the active library. It is precisely this user that will complain to you about your library and/or ask you to offer a passive counterpart in addition to the active you already offered. One of the two parties must agree to use return to send a message back, instead of recursively invoking a method. If neither is passive, the result will be infinite recursion. An active participant requires control of the call-stack for the duration of the coordination (corollary, two actives can only work together if each one has its own call-stack - see "trampoline" below). But if you offer your library as passive, you don't preclude any option for the user (it can be either active or passive), and thus this is the most flexible.

To exemplify the last point, (this is an adaptation from Gafter's post) suppose we have two lists and we want to test whether they are equal. If we can produce iterators (i.e., passive state machines), it is easy:

//PassiveList offers #iterator()
boolean areEqual(PassiveList list1, PassiveList list2) {
  if (list1.size() != list2.size()) return false;
  Iterator i1 = list1.iterator();
  Iterator i2 = list2.iterator();

  while (i1.hasNext()) {
   if (i1.next() != i2.next()) return false;
  }
  return true;
}

If one of the lists was active (had a #forEach() operation instead of giving an iterator), it is still implementable, let the active control the passive:

//ActiveList offers #forEach(ElementHandler)
boolean areEqual(ActiveList activeList, PassiveList passiveList) {
  if (activeList.size() != passiveList.size()) return false;

  final Iterator iterator = passiveList.iterator();
  final boolean[] equal = { true };
  activeList.forEach(new ElementHandler() {
   public void process(Object e) {
   if (e != iterator.next()) equal[0] = false;
   } //don't nitpick about not breaking early!
  });
  return equal[0];
}

But in principle, it's not possible to implement the following method, without spawning a second thread (again, see trampoline, below), :

//both are active!
boolean areEqual(ActiveList list1, ActiveList list2);

3) Exceptions: Another (Java-specific) way that the two option differ is checked exceptions, unfortunately. If you create an active library, you have to anticipate your passive user code may require to throw checked exceptions - so creating the API (implemented by user code) without knowing the checked exceptions beforehand is problematic and ugly (this have to somehow be reflected in the API signatures, and generified exceptions are not for the weak-hearted), whereas if you let the user code take the active position, it is oh-too-easy for the user code to wrap the whole thing in a try/catch or add a throws clause in the enclosing method (the enclosing method is of not offered by your library and is no concern to you - if it is not your code that throws checked exceptions, you don't have to worry about them, and this is good). Oh well. Just another reason we all should be doing scala, eventually. Lets quickly hide this under the carpet for now. Credit for noting this important caveat goes to Chris Povirk.

So where these considerations leave us? My rules of thumb (not entirely contradiction-free):

Let the more complicated participant have the active position; that piece of code will benefit the most by having its stackframe available for encoding its state.
If you provide a library, offer it as passive - so you leave all options open for the user.

But what to do if you anticipate *the library* to be the more complicated participant? Do you choose to make it active and force user code to be passive? What if you are wrong in your estimation? It's a tough call. But in the end, if you are forced to offer your library also as active, at least place that version near the passive one, so it's apparent that these two perform the same function, and so the users have to look for this functionality at a single place, not two.
If though the library is supposed to offer parallelization to the user, there is no option: you have to make it active, because parallelization boils down to controlling how to invoke user's code. (See ParallelArray's active operations. This also offers a passive iterator, but it would be folly to use it)

That was the meat of this post. I mentioned trampolines above, so in the remaining part, I'll touch upon that.

Python generators are, indeed, beautiful. You write an active piece of code, and some python (alas, very inefficient) piece of magic transforms it into a passive construct (iterator) out of that (which, recall, can be used with another active counterpart). Thus, you get the best of both worlds: both parties (e.g. producer and consumer, here) can be active!

//producer - active!
Iterator iterator = magicTrampoline(new AbstractProducer() {
  public void run() {
   for (int i = 0; i < 10; i++) {
   for (int j = 0; j < 5; j++) {
   produce(complicatedFunction(i, j));
   }
   }
  }
};

//consumer - active!
while (iterator.hasNext())
  consume(iterator.next());

This is certainly the best option for code readability. The cost is, we need an extra thread (more precisely, we need an extra call-stack). "Trampoline" is the common name of the port of this to Java. I wanted to link to some implementation here, but couldn't readily find a nice one. I'll just sketch the implementation anyway: the idea is that you let the producer run on a separate (from the consumer) thread. Thus message-passing between producer and consumer do not happen in the same call-stack, but via another means, say via a BlockingQueue (the producer puts, the consumer takes). So, the producer doesn't have to destroy its stackframe(s) after producing an element, but it can safely hold on to it (remember that this was not possible in the single-thread scenario, with message passing via the call-stack, that would lead to infinite recursion).

--EDIT
Sure I can find an implementation of the above - scala actors! The actors interact in an active (of course!) fashion, and this is possible because they communicate by exchanging messages, not invocations, thus no party needs to play the passive role. (Of course, underneath there is the good old BlockingQueue and the trampoline mechanics mentioned above. ) And yet, even in scala actors, there are the so called reactive actors - these don't hold on to their thread (and their call-stack), thus are not as wasteful as the regular actors, but, as per the discussion above, they are passive. This is the way one can tie actors in this discussion, and understand these through a different perspective.
--/EDIT

Since the cost for this is very high (not just the second thread itself, but the frequent blockings/context switches), it is not to be recommended in Java, unless we someday get much more lightweight threads, like those in Erlang. (And I can certainly imagine the joy of Erlang programmers, who can always readily use their call-stack to encode their complex state machines, instead of jumping through hops just so they do not introduce another thread.)

(Another way to achieve this would be via continuations - which Java doesn't support natively.)

And a departing story: 4 years ago, I played with these ideas in the context of implementing complex gesture detection for a graph editor, amongst other graph related tasks. Use-case: implement drag and drop for nodes (a common task). This is what my experimental code looked like - quite elegant:

//active pattern detection!
Event firstEvent = awaitEvent();
if (firstEvent.isMouseDown()) {
  Event nextEvent;
  while ((nextEvent = awaitEvent()).isMouseDrag()) {
   continue;
  }
  if (nextEvent.isMouseUp()) {
   //drag and drop is complete!
  }
} else {
  ...
}

If you have read all the way to here (phew!), you can probably imagine how "fun" (ugh) would be transforming the pattern detection to a passive, say, MouseListener/MouseMotionListener (and translating those if's and else's and while's...). Be my guest and try (I'm kidding - just don't). This is the source of my first rule of thumb: it was via such, sufficiently complicated state machines that I observed how much the code can be simplified just through the act of making it active.

To better understand the last example and how it's relevant, note that Swing itself is an active state machine - remember, it owns the event-dispatching thread, thus user code is forced into the passive role. (Which is also consistent with my first rule of thumb: swing, the most complex machinery, taking up the active, more convenient role).

How long is the period of random numbers created by java.util.Random?

2010-12-12T23:19:00.000-08:00

The answer is, pretty darn long!

I feel a bit cheated for not knowning, while I had several times seen the code the years before.

So it uses a good old linear congruential generator:

x = A * x + C

So, java.util.Random uses these two constants (A and C respectively):

private final static long multiplier = 0x5DEECE66DL;

private final static long addend = 0xBL;

Quite appropriate to place two "random" numbers in the Random class, right? Only these are not entirely random...

0x5DEECE66DL % 8 == 5

and

0xBL is an odd.

So what?

It turns out, when these exact conditions are satisfied, the defined period is maximal, so the generated period (x is long) is 2^64. This piece of magic is documented here, page 9. The "why" of it is beyond me.

If any ninja mathematician comes across this post, please leave some explanation if you can.

EDIT: oops. I overlooked that java.util.Random uses a mask to only reduce itself to 48 bits, instead of 64, so correspondingly the period is 2^48 instead of 2^64; apparently in an attempt to increase the quality of the generated numbers. In any case, if you need longer period, the technique described here will do the trick.

Juggling consistent hashing, hashtables, and good old sorting

2010-12-12T18:24:00.000-08:00

Consistent hashing can be summarized in one sentence: it is a hashtable with variant-length buckets.

You have a line segment (that's the key space). Breaking it to N equal parts (the buckets) means it is very easy (O(1)) to compute which part contains a given point. But if you increase or decrease N, all parts change (and shrink or grow respectively). Only the first and the last part have one of its ends unaffected, but the rest have both their ends moved.

Now consider breaking it in N parts, but not of necessarily the same length. Now it is easier to change the number of parts. Do you want more parts? Break one in two pieces. Want fewer parts? Merge two parts sitting next to each other. All other parts remain firm in their place, unaffected. But now finding which part contains a given point gets more complicated (O(logN) if you keep the parts sorted, say in a red-black tree).

Hashtables themselves can be summarized in an interesting way: it is sorting...by hash (it doesn't matter if the original key space is not totally ordered - the hashes are). But if it is sorting, where the logN factor in the complexity of hashtable operations? Well, there isn't one: hashes are small natural numbers (bounded by the hashtable size, which is O(N) of the elements in it). And what do you think of when you want to sort small natural numbers? Counting sort. (Caveat: hash functions do not just aim to map a large key space into a sequence of natural numbers, that's only part of the story. They also aim to approximate a random permutation of the original space, which is separately useful - to try to load balance the buckets when the incoming distribution is not uniform. In consistent hashing, we sacrifice the first role of hash functions, but we keep this latter one).

Trying to compress the discussion above:

Hashtables are really counting sort

(Yes, data structures are algorithms)
An equal piece of the (approximately permuted) key space is mapped to each bucket

Easy to find the corresponding bucket for a key (++)
Changing the number of buckets affects the mapped spaces of all buckets (--)

Consistent hashing differs in the following way:

A variable piece the (approximately permuted) key space is mapped to each bucket

Harder to find the corresponding bucket for a key (--)
Easy to split or merge buckets, without affecting any other buckets (++)

Bonus section: Bigtables (and the good old Bigtable paper here). In Bigtable, The rows are primarily sorted by row key. The tablets (think buckets) are variable length, so that overloaded tablets can be split, without touching the huge amount of data that potentially reside in the same bigtable. (One can help load balancing the tablets by -did you guess it?- sorting by hash first...). So, here you go. The rows are sorted. Yet it can still rightfully be described as consistent hashing. Confused yet? Well, at least I tried. :)

Memory cost (per entry) for various Java data structures

2010-10-20T02:47:00.000-07:00

Have you seen this? http://code.google.com/p/memory-measurer/wiki/ElementCostInDataStructures
It shows the memory cost per each entry for various common and not-so-common Java data structures. For example, have you wondered when you put() an entry to a HashMap, how much memory footprint you create, apart from the space occupied by your key and your value (i.e. only by internal artifacts introduced by HashMap itself)? There are the answers to such questions.

It also exemplifies the utility of my memory measuring tool, which I recently redesigned to decouple object exploration from measuring the memory usage (bytes) of objects, so I can use the same exploration to derive more qualitative statistics, like number of objects in a object graph, number of references, and number of primitives (which would directly reflect memory usage if it weren't for effects of alignment).

If you are designing data structures in Java, it would be silly not to use this tool. Seriously. (Profilers can be used to somewhat the same effect, but why fire it up and click buttons till you get where you want and take numbers and calculate by yourself, when you can do it more precisely with very simple code).

Future work: do the same for the immutable Guava collections.

Graph reachability, transitive closures, and a nasty historical accident in the pre-google era

2010-07-29T18:42:00.000-07:00

Last few days I'm feeling a bit like an archaeologist. It's probably a great story to explain to a newcomer just what's the importance of the internet, coupled with a great search engine. But I digress.

What follows is a pretty long post, covering a lot of ground in the graph reachability problem and transitive closures, enumerating various naive solutions (i.e., the ones you are likely to see used), then moving to much more promising approaches.

Introduction

Let's say you have a directed graph, G = (V, E), and you want to know real fast whether you can reach some node from some other, in other words whether there is a directed path u --> v, for any nodes u and v. (Take a moment to convince yourself that this is a trivial problem if the graph is undirected). Before we continue, lets stop and think where this might be useful. Consider a (really silly) class hierarchy (we also call it an ontology):

Testing whether there is a directed path between two nodes in a class hierarchy is equivalent to asking "is this class a subclass of another?". That's basically what the instanceof Java operator does (if we only consider classes, then subtyping in Java forms a tree - single inheritance, remember? - but if we add interfaces in the mix, we get a DAG, but thankfully, never a cycle).

Lets formally denote such a query by a function existsPath: V x V --> {1, 0}. It takes an ordered pair of nodes, and returns whether there is a path from the first to the second. Now, lets make sure that having such an existsPath function is, in fact, the characteristic function of the transitive closure of the graph, that is, given any edge, we can apply this function to it to test whether it belongs to the transitive closure or not, i.e., whether it is one of the following edges:

When we talk about transitive closure, though, we mean something more than merely being able to answer graph reachability queries - we also imply that we possess a function successorSet: V --> Powerset(V). i.e., that we can find all ancestors of a node, instead of checking if a particular node is an ancestor of it or not.

By the way, the above explanation is exactly the reason why in Flexigraph I define the interfaces (sorry, but blogger doesn't play very nicely with source code):

public interface PathFinder {

    boolean pathExists(Node n1, Node n2);

}

public interface Closure extends PathFindr {

    SuccessorSet successorsOf(Node node);

}

"Closure extends PathFinder" is basically a theorem expressed in code, that any transitive closure representation is also a graph reachability representation. That's the precise relation between these two concepts - I'm not aware of any Java graph library that gets these right. Continuing the self-plug, if you need to work with transitive closures/reductions, you might find the whole transitivity package interesting - compare this, for example, with JGraphT support, or yWorks support: a sad state of affairs. But I digress again.

Supporting graph reachability queries

Ok, lets concentrate now on actually implementing support for graph reachability queries. These are the most important questions to ask before we start:

How much memory are we willing to spend?
Is the graph static? Or update operation are supported (and which)?

Lets explore the design space based on some typical answers to the above questions.

Bit-matrix: The fastest reachability test is offered by the bit-matrix method: make a |V| x |V| matrix, and store in each cell whether there is a path from the row-node to the column-node. This actually materializes the transitive closure, in the sense that we explicitly (rather than implicitly, to some degree) represent it (and we already argued that representing the transitive closure also implies support for graph reachability queries).

Of course, this has some obvious shortcomings. Firstly, it uses O(|V| ^ 2) memory - but this is not as bad as it looks, and it would probably be a good solution for you if your graph is static. Which leads us to the second problem - it is terrible if you expect to frequently update your graph.

Adjacency-list graph representation: another way of representing the transitive closure is to just create an adjacency-list graph with all the edges of the transitive closure in it. In the example, that would mean to create a graph to represent the second image. (This is what JGraphT and yWorks offer - it's the simplest to implement). This has numerous disadvantages. First, this is a costly representation - such a representation is good for sparse graphs, but it is very easy for a sparse graph to have a dense transitive closure (consider a single path). Worse, if you represent each edge with an actual "Edge" object, then you are quite probably end up with a representation even costlier than the naive bitmap method - compare using a single bit to encode an edge vs using at least 12 bytes (at least, but typically quite more than that), i.e. 96bits. Also, this representation doesn't even offer fast reachability test - testing whether two nodes are connected depends on the minimum degree of those nodes, and this degree can be large in a dense graph. All in all, this representation is the lest likely to be the best for a given application.

Agrawal et al: Now, lets see the approach in a very influential paper published in 1989: Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. Lets see a figure from this paper, starting from the simple case of supporting reachability queries in a tree:

(Don't be confused by the edge directions, they just happened to drew them the other way around). Each node is labelled with a pair of integers - think of each pair as an interval in the X-axis. The second number is the post-order index of that node in the tree, while the first number is equal to the minimum post-order index of any descendant of that node. Lets quickly visualize those pairs as intervals:

As you can see, the interval of each node subsumes (is a proper superset of) the intervals of exactly its descendants. Thus, we can quickly test whether "e is a descendant of b" by asking "is 3 contained in the interval [1, 4]?", in O(1) time.

Quick check: does it easily support graph updates? No! Inserting a new leaf under node g would require changing the numbers of g and everything on its right (that is, all the nodes of the graph!). And we are still only discussing about trees. Lets see how this generalizes for acyclic graphs (this can be similarly generalized to general graphs, but I won't elaborate on that here). Again, a figure from the paper:

Note that some non-tree edges have been added, particularly d --> b (the rest of the inserted are actually redundant, yes, they could have used a better example). Also note that now d is represented by two intervals, [6,7] and [1,4]. The same invariants are observed as above - the interval set of each node subsumes any interval of exactly its descendants. Basically, we start with a tree (forest) of the graph, label it as previously, and then process all non-tree edges, by propagating all the intervals of the descendant to all its ancestors via that edge.

Second check: how does this cope with graph updates? Much worse! If, for example, we have to change the interval of g, we now must search potentially the whole graph and track where its interval might have been propagated, fixing all those occurrences with the new interval of g. Well, this sucks big time.

Now, lets try to justify the "historical accident" that the title promises. What if I told you that this problem was already solved 7 years earlier than Agrawal reintroduced it? More importantly, what if I told you that (apparently!) hardly anyone noticed this since then? It's hard to believe. For example, here is a technical report published recently, in 2006: Preuveneers, Berbers - Prime Numbers Considered Useful . Don't bother reading all those 50 pages - here is a relevant quote (shortly after having presented the Agrawal et al method - emphasis mine):

In the previous section several encoding algorithms for eﬃcient subsumption or subtype testing were discussed. As previously outlined, the major concern is support for eﬃcient incremental and conﬂict-free encoding. While most algorithms provide a way to encode new classes without re-encoding the whole hierarchy, they often require going through the hierarchy to ﬁnd and modify any conﬂicting codes. The only algorithm that does not suﬀer from this issue is the binary matrix method, but this method is the most expensive one with respect to the encoding length as no compaction of the representation is achieved.

This statement is the authors' motivation for design a new approach (a not too practical dare I say, requiring expensive computations on very large integers). They didn't know the solution.

Here is another example, published in 2004.Christophides - Optimizing Taxonomic Semantic Web Queries Using Labeling Schemes. This is a great informative paper surveying at depth a number of different graph reachability schemes, with a focus on databases. Christophides, actually, is my former supervisor, Professor at the University of Crete, and a good friend. Unfortunately, even though this paper actually cites the paper that contains the key idea, they also didn't know the solution (this I know first-hand).

Yet another example, published in 1999: Bommel, Beck - Incremental Encoding of Multiple Inheritance Hierarchies. Again, they didn't know the solution, so they ended up creating a more complicated, less-than-ideal work-around (see for yourself).

I don't even dare to look into all ~250 papers that cite Agrawal's work.

This is rather embarrassing. Before continuing, this kind of regression would have been highly unlikely if there was an efficient way (e.g. google scholar :) ) to search for relevant research. Here goes the obligatory memex link.

That being said, lets finally turn our attention to the mysterious solution that was missed by seemingly everyone. Dietz - Maintaining order in a linked list. This paper was published much earlier - in 1982, a year special to me since that's when I was born. Dietz introduces the (lets call it so) order maintainance structure, defined by this interface:

Insert(x, y)
Inserts element x right afterwards element y.
Delete(x) (Obvious)
Order(x, y)
Returns whether x precedes y.

So this is basically similar to a linked list, with the addition that we can ask (in constant time!) whether a particular list node precedes another. The actual data structure proposed to fulfill this interface has been superseded by Dietz and Sleator in Two Algorithms for Maintaining Order in a List (1988), and then recently by Bender et al in Two Simplified Algorithms for Maintaining Order in a List. These structures offer amortized O(1) time complexity for all these operations (there are also versions offering O(1) worst case, but they are significantly more complex to implement, and likely to be slower in practice). Here is my implementation of Bender's structure.

So how can this structure solve the incremental updates problem? Let's see Dietz' solution of the case of trees. Lets draw another tree.

Okay, so this is the same tree, and I have labelled the nodes with intervals, with one catch - the intervals do not comprise integers, but symbolic names. What's the deal, you say? Here's the deal!

That's it! The colors are there so you can easily match-up the respective nodes. This is a linked list, or more precisely, the order-maintainance list. Any two nodes in it define an interval. For example, [Pet_pre, Pet_post] interval (green in the picture) subsumes both Cat's and Dog's nodes - precisely because it is an ancestor of both! Note that, internally, these list nodes do in fact contain integers that observe the total order (e.g., are monotonically increasing from left to right) so we can test in O(1) whether a node precedes another. This is quite a simple structure, and there are very good algorithms for inserting a node in it - if there is no space between the integers of the adjacent nodes, an appropriate region of this list is renumbered.

Update: probably it's obvious, but better be specific. So how do we add a descendant? Let's say we want to add a graph node X that is a subclass of Pet. We simply create two list nodes, X_pre, X_post, and insert them right before Pet_post. That would result in a chain Pet_pre --> [whatever nodes were already here] --> X_pre --> X_post --> Pet_post. Or we could add these right after Pet_pre, it's equivalent really. (As a coincidence, both Dietz and I chose the first alternative, to insert child_pre and child_post right before parent_post. I had a reason for this actually, there is a tiny difference because I'm using Integer.MIN_VALUE as a sentinel value, but that would be way too much details for this post). Of course, to insert a node, say B, between two others, say A and C, we have to find an integer between label(A) and label(C), i.e. to make sure that we have label(A) < label(B) < label(C), without affecting any existing label relation. If label(A) = label(C) + 1 (i.e., consecutive, no space between them for a new node), we have some nodes to make space at the place of insertion. Compare with Dietz' Figure 4.1, in the "Maintaining order in a linked list" paper.

While this might not seem such a big win for the case of trees, it is absolutely necessary in the case of graphs. This is what Agrawal et al missed! Instead of using integers directly, they should have used nodes of an order-maintainance structure. Why? Because when an interval does not have space to accommodate a new descendant, instead of widening the interval and then scanning the whole graph to find appearances of those integers to fix them, we instead make only local changes in the order maintainance structure in O(1), without even touching the graph!

You can visualize the difference in the following example.

The segmented edges are non-tree edges, so they propagate upwards the intervals, thus, the H node ends up having 4 distinct intervals. Now, if instead of nodes such as E_pre and E_post, we had numbers (say, (5,6)), what would have happened if we wanted to add a descendant under E but it had no space in its interval? We would fix the boundaries of E's interval, and we would have to search the graph to see where the old boundaries have been propagated (i.e. in nodes F and H). With the order maintainance structure, this is not needed. We locally renumber whichever node is required, and we do it once - we only renumber E_pre and E_post once, and due to node sharing (instead of copied values/integers!), it doesn't matter where these might have been propagated, all positions are automatically fixed too! Contrast this to having to visit F, G and H, find E_pre, E_post in their intervals, and fix them in all places. Let alone if we had to renumber a series of intervals, and hunt the propagations of all of them in the graph, and so on...

Just for completeness, this is what the order maintainance list would look like, in the last example:

A_pre --> B_pre --> C_pre --> D_pre --> E_pre --> E_post --> D_post --> F_pre --> F_post --> C_post --> G_pre --> G_post --> B_post --> H_pre --> H_post --> A_post.

In this structure we perform the renumbering, not in the graph! Particularly, in some (typically small) region that contains the two nodes inside which we want to insert something.

So, there you go. This is the trick I (re)discovered on some Friday of last March, and ended up coding http://code.google.com/p/transitivity-utils in the following weekend, which implements all of the above and more, so you don't have to. It is hard to believe that this went unnoticed for 20 years - and even harder to believe that the solution was there all along, waiting for someone to make the connection.

Now that's something to ponder about. Oh, and thanks, Dietz, great work!

If you managed to read all the way down to here, my thanks for your patience. That was way too long a post, but hopefully, you also earned something out of it. I regard this a very interesting story, and wish I was a better storyteller than this, but here you go. :)

Google File System - the 'bandwidth' problem

2010-07-25T12:13:00.000-07:00

I am reading the very interesting Google File System paper. It describes the situation back in 2003, not sure how much things have changed since then. Before moving to the subject of this post, lets sum up some interesting points of that paper first (only based on my certainly incomplete understanding).

GFS clearly favors appending to files than random access writing. If all mutations of a file are appends, sequential readers (readers that scan the whole of the file - which represents the vast majority of cases) have also extra consistency guarantees: in the face of concurrent modifications, the only "bad" thing that they may see is the file ending prematurely. But everything up to that point is guaranteed to be well defined and consistent.
A GFS file can be used as an excellent producer-consumer queue! Multiple producers may append ("record append") to the file, with no need for extra synchronization/distributed locking. This is a weaker form of append - the client does not choose where the data end up to, the primary replica server of that file chooses this, and the clients get a guarantee that the update with occur at least once (well, not exactly once, but one can work-around this by uniquely naming the records, if they aren't already named so). This seems much simpler and better than either having to establish distributed locking protocols between producers, or letting them choose the offset in the file where their append should take place, and trying to resolve potential (and possibly very frequent) collisions.
Automatic garbage collection of orphaned chunks (file pieces) is implemented.
There are quite weak concurrency guarantees for writers that write lots of data. The writer client needs to break up the data into acceptably small pieces, and each piece is written atomically - but if there are multiple writers doing the same, the final result is undefined, it could contain random fragments of the data of each writer.
File metadata updates are atomic. This is particularly nice, consider the following case: a writer writes to a file, and after it's done writing, it atomically renames the file to a name which can then publish to readers. This is the analog of doing a compare-and-swap (CAS) in a shared memory multiprocessor (SMP), which, importantly, is the base of most lock-free algorithms. In particular, the writer that writes to a file (unknown to others) is like a thread doing thread-local computation, and only announce the result in the end via an atomic pointer update (atomically update a root pointer, if nobody has already changed it, or in GFS, if nobody has created the intended filename first).

Now, on to the piece that found curious enough to make a post about. A couple of key points:

When a client wants to write, it can pick whatever replica to initiate this action (presumably the replica closer to it - and in Google's case, they can infer the "closeness" of two nodes by inspecting their IPs).
The data must be stored (temporarily) to all replicas before the client notifies the (designated) primary replica that the write should be committed.

This leads to the problem: How to move client's data to be written, to all replicas? Google's answer is through a chain (hamiltonian path) through all replicas, starting at the replica that was contacted by the client. This is preferred over e.g. a dissemination tree, so as to bound the requirements of out-bandwidth of replicas - perhaps the best dissemination could be a star-like tree, but too much burden would be placed on the central node, and so on).

Here is a picture taken from the aforementioned paper. Note the thick arrows, they show the chain that data goes through, from the client and to all replicas.

But how do we choose the next replica in the chain? I.e., why we send the data from Replica A to the Primary Replica and then to Replica B, rather than the other way around? Google is using a simple, greedy rule: pick the replica (that hasn't seen the data yet) that is closest to the current one.

This looks like the well (ok, perhaps not so well) known graph theoretic problem, the Bandwidth Problem. That is, we have a graph, and we want to create some linear order of the nodes, so that we minimize the total edge distances (e.g. if we put nodes side-by-side, the edges between them would have very low distance/cost). It's NP-Hard.

To visualize the bandwidth problem, I'm stealing a picture from the excellent Skiena's book, the Algorithm Design Manual, which I highly recommend. (I guess a free ad here should worth stealing a small picture :-/ ). (By the way, here is an excellent treatment of the problem by Uriel Feige, very conveniently containing approximation algorithms and heuristics).

Well, almost, but not exactly the same problem. The subtle difference is that in the Bandwidth Problem, we seek to minimize the distance/cost sum of all edges, while in the case of GFS, we only need to minimize the length of a single (any) Hamiltonian path (the picture just shows an easy case, where apart from a single path there are no more edges in the graph).

Ok, so I don't know yet how to state this problem. (While reading the paper, I thought it was the same problem, thus the title of this post). Let me do something else for now - lets try to be create an example that maximizes the cost for GFS' heuristic. Here is what I came up with (click to see larger size):

I think that's the worst case scenario. The client chooses the replica in the middle, then we go to the right 1 step (because the one at the left is at distance 2), then we go to the left 3 steps (because the one at the right is at distance 4), and so on, going back and forth, with a total distance travelled = 1 + 3 + 7 + 15 = (2^1 - 1) + (2^2 - 1) + (2^3 - 1) + (2^4 - 1) = (2^5 - 1) = 31, or more generally, exactly 2^N - 1! (Of course, don't hold your breath on ever seeing such a topology in an actual GFS cluster!) The optimal solution here would be to go immediately to a boundary replica, either farthest left or right (choose the one closest), and then linearly visit the rest. That would yield a cost of (1 + 4) + (1 + 2 + 4 + 8) = 20, (1 + 4 to go to the farthest right replica, and then going all the way to the left) or more generally, exactly 2^(N - 1) + 2^(N - 3) (it's fun to work this formula out!).

Now, how much worse is Google's heuristic than the optimal solution?

It turns out, (if we assume that I did find the worst case scenario), Google's solution is at most a factor of 8/5 (= 1.6) of the optimal solution! (Well, sure enough. To compare 2^N to 2^(N-1) + 2^(N-3), divide everything by 2^N, so we have 1 compared to 1/2 + 1/8, or 5/8). Wow! That's extremely good for something as simple! I didn't expect to find such an exact result - one can also compute this easily via WolframAlpha, which is way too cool in this kind of problems.

Nice, nice, nice. I thought I would be find something ludicrously bad (in a completely unrealistic scenario), but it turns out, that's just 1.6 times the optimal solution at most!

/sheer happy - we haz it :)

Addendum:

Ah, the art of creating good corner case examples. It turns out the above is not the worst case, there is a worse than that, and quite simpler too (I had thought about this earlier, but for some reason I decided that forcing the algorithm to continually go back and forth would be the most expensive).

Consider this example (again, click to see in full size):

Each adjacent replica pair is separated by unit distance, while the last two are separated by two. We can make the path arbitrarily long, easily forcing the algorithm to choose a path which is a factor of 2 - ε longer than the optimal. And this is to be further restrained: I'm only considering geometric graphs, i.e. graphs where the triangle inequality holds. I have next to no idea whether actual network topologies resemble geometric graphs, and certainly one could construct much worse examples in graphs where the notion of distance is completely arbitrary (but distances in a network topology certainly isn't an arbitrary function). It's too late in the night to worry about that, so lets stop here hoping I'm still making some sense. :)

By the way, I recently bought Vazirani's Approximate Algorithms book. Neat, refreshing read, and highly recommended.

My oldies but goldies pet projects

2010-06-21T07:02:00.000-07:00

Finally! I took the time to upload some of my old pet projects on google code. Briefly:

Flexigraph: a life-saver of a graph library. :) Packed with several unique features, but I have no time to comment them. Especially the Traverser API scoffs at all other java graph libraries of today (and this is several years old by now).
JBenchy: (my friends know this by the imaginative name "aggregator"). This is another life-saver if you are in the business of doing benchmarks/experiments and want a systematic way to store and analyze your collected points. Leverages the sheer expressive and analytical power of SQL GROUP BY statements, yet it hides all SQL from you so you can concentrate on your benchmark itself. It can also create nice diagrams for you, if you need some help with visualization.
MemoryMeasurer: Another uncharacteristically exotic name for a project! This little gem can compute the memory footprint of an arbitrary object graph / data structure. Did you ever wonder how much space does a default HashMap takes? Or an ArrayList of 5 HashMaps? Or whatever? Well, this is the tool for you. Also note that you can use it with JBenchy to create benchmarks for data structures, if this happens to be your thing. Oh, by the way, a new HashMap() takes 120 bytes, in case you did wonder.

Good! All my pets in line, with a new home. Enjoy :)

Motivating Divide and Conquer paradigm

2010-05-06T09:30:00.000-07:00

Just the other day I went in my old University, met former teachers and collaborators... fun!

I also attended a lecture on algorithms. The topic: Divide and Conquer.

The professor presented a very simple example to make the technique apparent. The problem? Find the minimum element of a list. Instead of doing the usual (in pseudocode):

result = +oo;
foreach e in list {
  result = min(e, result)
}

The divide 'n conquer approach would be:

function min(list) {
  if (list.size == 1) return array[0]
  else {
   k = list.length / 2
   return min(
   findMin(list.subList(1, k)),
   findMin(list.subList(k+1, list.size)))
  }
}

One could ask, "what's the point". In this example, this doesn't sound important at all. A student with good practical experience with programming would readily recognize that the latter, sophisticated approach is likely to perform worse than the straight-forward iteration. Indeed, a student there asked, "do we gain any gain in time complexity by following the latter approach in this example?". Unfortunately, the answer was no, both versions are O(n), so the exercise "is" pointless.

Quite some years after the period when I was an undergraduate student, and reflecting on that experience, that was my main gripe with how Computer Science was being taught. Without proper motivation! The student must somehow "guess" by him/herself the importance of the covered topics. An example dialogue that could ensue between students: "-Hey, what did you learn today? -Nothing much, something called 'divide and conquer', it's just a way to create complicated solutions when simple and as good solutions already exist".

Since I was attending, I thought I should intervene and give the very, very, very imporant motivation for learning this algorithm design paradigm. The reason is the all-too-known fact that CPUs have stopped getting faster, and they only become more numerous. This can be easily recognized, but it also has a tremendous implication in how we design software. Even for this simplistic problem, the first method of computing the minimum element is doomed. It may not fail tomorrow, it might not fail next year or the year after, but it's doomed nevertheless. Imagine you have a billion elements of which to find the minimum. Also imagine your run-of-the-mill computer has hundreds of processing cores (by the way, we should stop calling them CPUs/Central Processing Units, how "central" is something you have hundreds of?).

Here is the expression that the first method is trying to evaluate:

I.e., you first find the minimum of the first two elements, then you find the minimum of that and the third element, then the minimum of that and the fourth element, etc, etc. The point is, the Kth call (K>1) to min() always must wait first for the result of the (K-1)th call of min() to become available. Each min() invocation takes (say) a unit of time, so the final computation cannot consume less that K time units, no matter whether you have a single core or thousands of them. The so called "critical path" (the longest path of things that must happen sequentially) as is also apparent in the diagram, has length K. Bad.

This is what happens with divide and conquer instead:

Still, the root depends to two other min() invocations, and must wait for them before evaluating itself. But those two are not dependent on each other, they may run in parallel. The critical path here has length only logK. Thus, divide and conquer can solve this problem in O(logN) instead of O(N). Without leaving your excess of processing cores to sit (and waste energy) idly.

As I explained to the student, this epitomizes programming of the future. Of course that's a hyperbole, but hyperboles are useful to drive a point home. And it is certainly better than present something dry to the student, without any hint of its importance - that's the recipy to make him/her not pay attention, and soon forget about this weird, exotic thingy.

That's all for now! Oh, and if you want to actually implement algorithms like this, and you code in Java, the fork/join framework is something you'll definitely want to learn.

My solution to Matrix-Chain-Order problem (Chapter 15 of CLRS)

2010-04-13T17:02:00.000-07:00

For practice, I tried solving the problem "Matrix-Chain-Order" that appears on section 15.2 of Introduction to Algorithms (that's the chapter on Dynamic Programming). In short, one gets a list of (compatible) matrices, A1, A2, A3... An, and the problem is to compute the optimal order for creating their product.

I liked my solution, so I'm posting this just for the record. This should be helpful if one is reading that section and looks for a Java implementation of that problem (in order to compare with his own solution of course :)). To make it tougher, I tried it writing this first on paper, then in my IDE, to see how many errors would slip in the paper version. There were two: a single off-by-one error in the inner loop (damn!), and that I forgot some components of the cost of a particular multiplication expression.

It creates optimal multiplication expressions of increasing width. Initially width == 1, and we just have the leaves, the matrixes themselves. Next, we find the optimal expressions of width == 2, but there is only one order to multiple 2 arrays, so nothing special happens. After that, things get more interesting, since we get to create various possible trees for each sequence of matrixes, and retain the best.

Amazingly, it is exactly one year (to the day!) ago that I posted a same exercise in tree building (enumerating binary trees, also via dynamic programming): http://code-o-matic.blogspot.com/2009/04/wonderful-programming-exercise.html
I think I need more diversity :)

Anyway. The main method is this:

   public static void main(String[] args) {
   Op op = matrixChainOrder(Arrays.asList(
   new Matrix(30, 35),
   new Matrix(35, 15),
   new Matrix(15, 5),
   new Matrix(5, 10),
   new Matrix(10, 20),
   new Matrix(20, 25)));
   System.out.println(op);
   System.out.println("Cost: " + op.cost());
   }

And it prints:

(([30 X 35] * ([35 X 15] * [15 X 5])) * (([5 X 10] * [10 X 20]) * [20 X 25]))

Cost: 15125

(This mimicks the example and solution of the book, but it also creates the expression of the multiplication, easy to pretty-print and ready to be computed).

The full code:

import java.util.*;

public class Matrixes {

public static void main(String[] args) {

Op op = matrixChainOrder(Arrays.asList(

new Matrix(30, 35),

new Matrix(35, 15),

new Matrix(15, 5),

new Matrix(5, 10),

new Matrix(10, 20),

new Matrix(20, 25)));

System.out.println(op);

System.out.println("Cost: " + op.cost());

}

static Op matrixChainOrder(List matrixes) {

Map optima = new HashMap();

for (int i = 0; i < matrixes.size(); i++) {

optima.put(new Interval(i, i), new Leaf(matrixes.get(i)));

}

for (int width = 1; width < matrixes.size(); width++) {

for (int offset = 0; offset < matrixes.size() - width; offset++) {

Op best = DUMMY;

for (int cut = 0; cut < width; cut++) {

Op left = optima.get(new Interval(offset, offset + cut));

Op right = optima.get(new Interval(offset + cut + 1, offset + width));

Op mul = new Mul(left, right);

if (mul.cost() < best.cost()) {

best = mul;

}

optima.put(new Interval(offset, offset + width), best);

}

return optima.get(new Interval(0, matrixes.size() - 1));

}

private static final Op DUMMY = new Op() {

public int cost() { return Integer.MAX_VALUE; }

public Matrix compute() { throw new AssertionError(); }

public int rows() { throw new AssertionError(); }

public int columns() { throw new AssertionError(); }

};

private static class Interval {

final int begin;

final int end;

Interval(int begin, int end) { this.begin = begin; this.end = end; }

public boolean equals(Object o) {

if (!(o instanceof Interval)) return false;

Interval that = (Interval)o;

return this.begin == that.begin && this.end == that.end;

}

public int hashCode() { return 31 * begin * (17 + end * 31); }

public String toString() { return "[" + begin + ".." + end + "]"; }

}

class Matrix {

final int rows; final int columns;

Matrix(int rows, int columns) { this.rows = rows; this.columns = columns; }

int rows() { return rows; }

int columns() { return columns; }

public String toString() { return "[" + rows + " X " + columns + "]"; }

}

interface Op {

int cost();

Matrix compute();

int rows();

int columns();

}

class Leaf implements Op {

final Matrix matrix;

Leaf(Matrix matrix) { this.matrix = matrix; }

public int cost() { return 0; }

public Matrix compute() { return matrix; }

public int rows() { return matrix.rows(); }

public int columns() { return matrix.columns(); }

public String toString() { return matrix.toString(); }

}

class Mul implements Op {

final Op left;

final Op right;

Mul(Op left, Op right) {

this.left = left; this.right = right;

}

public int rows() {

return left.rows();

}

public int columns() {

return right.columns();

}

public int cost() {

return left.rows() * left.columns() * right.columns() + left.cost() + right.cost();

}

public Matrix compute() { throw new UnsupportedOperationException("later"); }

public String toString() { return "(" + left + " * " + right + ")"; }

}

Announcing transitivity utilities project

2010-03-15T15:34:00.000-07:00

Check out my new project: http://code.google.com/p/transitivity-utils/

This is the recycled message I sent about this at the guava mailing list:

--------

I have implemented some new data structures that may be of interest to someone. It is about maintaining transitive relations efficiently, both in terms of memory and time complexity. Internally it uses interval-encoding for elements, which for example can handle reachability queries in trees in just O(1) time and O(1) memory per element. It's not good for graphs that are close to full bipartite graphs - then memory degenerates to O(N) per element and O(logN) for testing reachability.

Historically, the motivation for this line of research has been encoding inheritance relations in knowledge bases. (For a more familiar example, this can be used to directly implement the instanceof operator of Java, but I don't suggest that!).

The central type amounts to:

TransitiveRelation {
void relate(E subject, E object);
boolean areRelated(E subject, E object);
}

This should be considered for any algorithm that relies on reachability queries.

I regard the API rather stable (except for I might need better names here and there).
----
Boy, the effort it takes.

Long story short, I believe I've found the sweetest spot in the design space for this problem. It contains magic sauce found nowhere else. The gist of the underlying solution is so (deceptively?!) simple that it's hard not to ask: "how could that be overlooked for 20 years?". The details will wait for a future post (but the source is there already).

Find the average between two ints (facing possible overflows/underflows)

2010-02-19T04:42:00.000-08:00

How do you compute the average between two ints in Java?

How about...
(x + y) / 2

Well, that doesn't work when (x + y) can overflow, and in fact this buggy implementation was lurking in the binary-search and mergesort implementations of JDK till recent years. (Obligatory Josh Bloch link: http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html)

It's an interesting puzzle for you to meditate and try out yourself - it looks simple, but it's not. I have yet to find a simple solution that works without assumptions on x, y - just any ints.

Blochs first solution is this:

x + (y - x) / 2

I.e. starting with x and adding half the distance to y. Only that this assumes that x,y > 0, otherwise y-x can itself overflow!

The second solution:

(x + y) >>> 1

Pretty nifty. If (x + y) does overflow, the high bit of the number (which determines the sign) will become one, i.e. negative, but >>> 1 will do the division and put a zero there. But what if (x + y) underflows (x, y are big negative numbers)? Oops.

So neither solution works for all cases.

Still working out a solution. In my case, I only need that average(x, y) != x && average(x, y) != y, when x and y are not consecutive ints. Any helping hand appreciated :)

My current solution:

static int avg(int x, int y) {
return diffSign(x, y) ? (x + y) / 2 : x + (y - x) / 2;
}

static boolean diffSign(int x, int y) {
return (x ^ y) < 0;
}

diffSign returns true iff x and y do not have the same sign. In that case only, (x + y) is safe. Otherwise, (y - x) is safe, so I go for that option. This solution works for corner cases (like Integer.MIN_VALUE) too. Well, I don't think I can get it any simpler (assuming it's really correct). Can you?

-edit-
It turns out there is a much more elegant solution, which is provably correct, found here.

static int avg(int x, int y) {
return (x & y) + (x ^ y) / 2;
}

Wow!..

Traits defining equals, combined with case classes

2010-01-13T14:47:00.000-08:00

I just contributed a "common programming mistake for Scala developers", in the relevant thread over at stackoverflow (which seems to be turning into a mania, given I contributed quite valuable time myself without really thinking about it). It describes a pitfall from the interplay of case classes, traits and equality defined in the latter - by no means this can be really a "common" mistake, but it is not obvious enough so some help to keep it uncommon doesn't hurt.

Thoughts on Actors

2010-01-08T14:46:00.000-08:00

This discussion (started today) on the scala mailing list relates to understanding the usefulness of Akka, and more generally, actors.

Somebody suggested comparing using actors to using locks directly. The following are my comments, intially meant as a response, but I ended summarizing many of my current concerns/questions regarding the actor programming model.

This contrast between actors and low-level concurrent programming (e.g. locks) is misleading. It's not like that there are actors, then a huge void, and then locks, where we get to choose an extreme. There are tons of things in-between. For example, message passing in a single VM is trivial to implement on top of BlockingQueues (or, soon, TransferQueues). There already exists the executor framework, and the fork/join framework, to provide thread pools and fine-grained parallelism.

My take is that actors provide a simplified, more elegant programming model than using the underlying tools directly. At their core, typical actors are a Runnable accompanied with a BlockingQueue (mailbox), while reactors are event listeners. A strong point of actors, as the Haller/Odersky paper shows, is that they unify thread-based and event-based models - one can use and combine either under the same framework. This programming model is still young and requires exploration to find its best use cases and fully appreciate it. As much as anything, this too needs an "Effective Actors" type of book. It is easy to go wrong too, especially for beginners trying to wrap their heads around MPI-like programming. Deadlocks are still possible (actors waiting forever for messages that will not come), race conditions are still possible (an actor giving up on waiting a reply, right before the actual reply arrives), it's not like the usual suspects of concurrent programming have magically vanished. (Edit: Probably I'm wrong to classify the last case as a race condition, it's likely just a data race, following the nomeclature of JCiP).

Moreover, the simplification has its cost too - it's not easy, at least for me, to reason about performance implications. For example, assuming scala actors that depend on ForkJoinScheduler (i.e. using the fork/join framework), this quotation from the javadocs of ForkJoinPool is interesting:

A ForkJoinPool may be constructed with a given parallelism level (target pool size), which it attempts to maintain by dynamically adding, suspending, or resuming threads, even if some tasks are waiting to join others. However, no such adjustments are performed in the face of blocked IO or other unmanaged synchronization.

This leads to some obvious questions which I can't answer easily at all:

What are the (performance) implications of using (blocking) IO in actors? (I haven't seen similar warnings given to actors users).
Noting that tasks are never joined, all receive() blocking calls fall under "unmanaged synchronization" as per the javadoc, so what are the implications of this fact?

So, simplification also seems to come at the cost of hiding possible important optimizations, like having a thread that needs to block in order to join() subtasks, to go and execute other tasks while waiting (via helpJoin()).

I'm not sure what the conclusion should be. Hopefully in 3-4 years collective experience will be substantial and we will better understand how these shiny new tools are best used, and when the underlying concurrency utilities should be used instead. Personally, as of now, while I am eager to experiment with actors, I feel more at home with more low-level tools, so I can more easily reason about the performance characteristics of my code. Hopefully someone will submit to the task of writing a good scala actors book - current books are OK, but Scala is new, so they are devoted to Scala mostly, and perhaps have a chapter on actors, which is too little to go anywhere beyond the very basics.

Barbara Liskov talk and vintage photo

2009-12-26T04:35:00.000-08:00

Few days earlier I was happy to see this talk given by Barbara Liskov:

http://www.infoq.com/presentations/liskov-power-of-abstraction

Few days earlier I submitted a paper to ESWC10 ( found here , titled "Flexible Ranking and Matchmaking for Semantic Service Discovery"), which happened to include a reference to Liskov and her well known substitution principle (the bottom side of the 3rd page).

It was not a big deal, just "common sense" reiterated. As you will notice from the talk above, though, at the 70ies this kind of "common sense" we take for granted today, was debatable and unclear then -- just as debatable was whether the goto statement was evil or not!

One thing that impressed me most in that talk was a photo that Barbara shared with the audience, from the 70ies. I had never saw her young!

I will leave the photo uncommented - it speaks for itself. -edit: well, I commented it after all, see below :)

Beware of recursive set union building!

2009-10-20T12:30:00.001-07:00

The excellent Google collections library spoils many of us, but that doesn't mean we can afford not being alert using it!

Observe how easy it is, for example, to create the (live) union of two sets: Sets.union(setA, setB).

One might be tempted to write code like the following, to make the union of all sets in "S":


Set<E> union = Collections.emptySet();
 for (Set<E> someSet : S) {
   union = Sets.union(someSet, union)
}

Wow! This build the union in just O(|S|) time! Sure enough, accessing the elements of the union is a different issue, but how slow could it be? (Note that we do pass the smallest union as first argument, in agreement with what javadocs suggest).

Well, it turns out, this is quite slow. Iterating over the elements of the union take O(N|S|), which for really small sets can be up to O(N^2), where N is the number of all elements of the union. In comparison, creating a single HashSet and calling addAll() to add each set in S to that, takes only O(N) time.

To understand the issue, consider the algorithm for computing the elements of the union of sets A and B:


report all elements in A
for all items x in A
 if !b.contains(x)
   report x

Now consider this union: union(A, union(B, union(C, D))), graphically shown below.

This is how the union's iterator would report its elements:

1) Iterate and report all elements of A

2) Iterate elements of B, if they are not in A, report them

3) Iterate elements in C, if they are not in B, then if they are not in A, report them

4) Iterate elements in D, if they are not in C, then if they are not in B, then if they are not in A, report them

See the pattern there? Well, that's it. Just resist the temptation to make a recursive union, that's all. (I haven't looked the matter deeply, but I think this shouldn't be affecting recursive intersection or recursive difference).

So, in this case, creating a big HashSet and dumping all elements in it is the way to go. It is a bit of a pity that a HashSet is really a HashMap in disguise, i.e. horribly wasteful (compared to what a genuine HashSet implementation should be), but that's life in Java :)

Till next time,

Bye bye!

Using JConsole to monitor...JConsole

2009-10-08T19:18:00.001-07:00

I was in the mood for some recursive monitoring, so I fired up a JConsole process and ordered it to monitor itself. I managed to make it show the stack trace of the thread that had the task to show the stack trace of the thread that had the....... you get the idea :)

For what it worths, here is the stack trace:


sun.tools.jconsole.Worker.add(Worker.java:56)
sun.tools.jconsole.Tab.workerAdd(Tab.java:73)
  - locked sun.tools.jconsole.ThreadTab@c829e3
sun.tools.jconsole.ThreadTab.valueChanged(ThreadTab.java:316)
javax.swing.JList.fireSelectionValueChanged(JList.java:1765)
javax.swing.JList$ListSelectionHandler.valueChanged(JList.java:1779)
javax.swing.DefaultListSelectionModel.fireValueChanged(DefaultListSelectionModel.java:167)
javax.swing.DefaultListSelectionModel.fireValueChanged(DefaultListSelectionModel.java:137)
javax.swing.DefaultListSelectionModel.setValueIsAdjusting(DefaultListSelectionModel.java:668)
javax.swing.JList.setValueIsAdjusting(JList.java:2110)
javax.swing.plaf.basic.BasicListUI$Handler.mouseReleased(BasicListUI.java:2788)
java.awt.AWTEventMulticaster.mouseReleased(AWTEventMulticaster.java:273)
java.awt.Component.processMouseEvent(Component.java:6263)
javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
java.awt.Component.processEvent(Component.java:6028)

I'm monitoring a process that should take about 2 hours, so yes, I do have a lot of time in my hands :)

Rapidshare can compress any file to just 16 bytes!!!

2009-09-18T14:29:00.000-07:00

Or so it claims:

there is a program which calculates the MD5 checksum for each file immediately after the upload. The MD5 checksum is a 16 bytes value which is alsways the same when calculated for the same file. This value is stored in connection with the file. If you upload and distribute an illegal file, we will delete the file upon notification and add the MD5 checksum to a blacklist, so the same file cannot be uploaded again.

(Emphasis mine).

Now, I really, really, really want to see their blacklist implementation, since obviously it holds the key to the dark art of infinite compression, which is the Holy Grail Of Computer Science, the Universe, and Everything. Yes, the notorious Holy Grail that nobody wants to touch because apparently it's the favorite pooping place for some very special pigeons...

JComboBox pure craziness

2009-09-14T12:35:00.000-07:00

Someone asked me why his ItemListener, attached to a JComboBox, was getting two events when he selected something on it. Weird.

I looked up JComboBox' javadocs, just to be sured, and....this is what I found:

Adds an ItemListener.
aListener will receive one or two ItemEvents when the selected item changes.

I understand Swing is a huge framework and it has its rough edges...but this? It certainly looks like someone couldn't fix this strange behavior, and desided to make it a documented feature instead. Way to go!

The best way to enumerated directed acyclic graphs

2009-07-29T17:35:00.000-07:00

Here is a tricky problem for the algorithmically oriented people to ponder about: "enumerate all directed acyclic graphs (DAGs) of a given size n".

I'll describe the process I followed in tackling this problem (the result is rather fascinating), but if you find this challenging enough, it would be good to stop reading for a while and see if you can solve it (hopefully in a good way).

First, some background. "Are you nuts? Enumerating graphs? You'll get huge numbers of them almost immediately!". Well, true. For example, here is the arithmetic sequence that shows just how many distinct (non-isomorphic) general graphs there are. It gets out of hand really quickly. Well, some background. I'm trying to develop an algorithm that takes as input a DAG, but it's quite hard to analyze and/or compare it with existing algorithms. So I'm interested in exhausting all possible inputs up to a certain size and count the logical steps of the algorithm for each problem instance, which can offer me useful feedback on the algorithm design, or show me which graphs produce the worst behavior of the algorithm, for which graphs the algorithm performs better than other algorithms, and so on.

Now, on to business. Upon some research, it seems there is no known sequence defining the number of unique DAGs per node count out there. The closest I found is the number of partially ordered sets. A partially ordered set is also a DAG, with the further restriction that its transitive reduction is itself (i.e. no redundant edges exist). Since there are many DAGs which yield the same transitively reduced DAG, it is obvious that this sequence is a strictly lower bound on the number of DAGs.

My first approach was rather brave, but futile nevertheless. I started about defining the sole graph of size 1 (no self loops), which is just a node. Then iteratively I increased the size N of generated graphs. At the iteration, the Nth node of the graphs is created and combined with all graphs of size N-1 in every possible way. How many ways are there? We are only allowed to connect the new node to the older nodes (not the other way around), so we only generated DAGs, so there are N-1 possible edges to add. The power set of this is 2^(N-1), and this is every possible way that a given graph with N-1 size can be extended with a new node.

This produced DAGs of count 2^0 (1 node), 2^1 (2 nodes)

2^0 = 1(one node)
2^1 = 2 (two nodes)
2^3 = 8 (three nodes)
2^6 = 64 (four nodes)
2^10 = 1024 (five nodes)
2^15 = 32768 (six nodes)
2^21 = 2097152 ( seven nodes)

The next milestone was 2^28 (268435456) graphs, but I ran out of memory. :) But storing more than two million graphs of 7 nodes and up to 21 nodes is a formidable task. I managed to pack all those graphs in about 120mb of RAM, meaning less than 60 bytes for each graph, which for a adjacency lists implementation, is pretty impressive. (But adjacency matrix where each possible edge represented as single bits would be the most economical representation). This is possible because each graph of size N shared the N-1 nodes of it, along with their adjacency lists, with the previous graph of size N-1 that it extended, so the cost of each graph is more or less the cost of the final node and its list. (This is the kind of stuff where persistent data structures really excel, but in Javaland the reusable implementations are few and far between - did I mention yet you should check out Scala?)

Then, I talked with Martin, a collegue/Phd student at the Univ. of Bath, who mostly works on NP-hard problems, and apart from a long discussion on "why one earth do you want to have a freak of nature like this one???" and other related sub-discussions, he suggested enumerating all undirected general graphs (not directed, not acyclic: this is what is offered) by nauty, and then produce all DAGs from each graph. Quite a huge amount of work: 2^(n(n-1)) number of graphs, multiplied by the number of DAGs created from each (which can be up to 2^(n-1)(n-2) / 2). But most importantly, he mentioned that nauty enumerates the graphs without having to store the smaller ones, which made me challenge my approach of generating the DAGs.

From there, it only took few seconds to bump on the correct solution, which is very simple. See the following table:

This table is meant to be a graph's adjacency matrix. The rows, as well as the columns, represent the graph's nodes. Each cell can be 0 or 1, to denote if there is a node from the row-node to the column-node, respectively. Note that the nodes of the matrix are actually in topological order (which is defined in DAGs) - a node/row is only allowed to connect to nodes after/up of it, not before/down of it - thus this matrix, the gray area are edges that if were allowed, they would violate the defined topological order. Only the white cells can contain 1, but the can also contain 0 of course. So what do we have here? For a DAG of n nodes, we have this nxn matrix, and (n-1)(n-2)/2 cells which can independently vary between 0 or 1. The solution from here is easy: Just arrange these cells as bits in a bit string, i.e. a binary number, start with zero, and increment it by one at each step, till the number consists of only 1's. Constant storage space, just a number of (n-1)(n-2)/2 bits! And a trivial way to enumerate the dags, transform the enumeration problem to the problem of...adding 1 to a binary number. All it takes is decoding the number into the respective DAG. Isn't this elegant? :)

To sum up, this procedure creates 2^((n-1)(n-2) / 2) DAGs, in equal number of iterations. Which in contrary to the suggested method, has half the exponent, i.e. for n = 10, the last method would yield 2^45 steps/DAGs, while using nauty would create 2^90 general graphs, where each graph would be subsequently transformed to many, many DAGs.

Addendum: The above description leaves a tricky part uncommented. We saw how to generate all DAGs with n nodes, i.e. :

int currentGraph = 1 << (n - 1) * (n - 2) / 2 ;

while (currentGraph >= 0) {

currentGraph--; //represents a DAG!

}

But how to interpret these numbers as graphs? We have to be able to answer, for all graphs, whether there is an edge (i --> j), i.e. connecting node i to node j. Here is the implementation of this test, by Nelly Vouzoukidou, a graduate cs student at Univ. of Crete:


boolean hasEdge(int graph, int i, int j) {
  return i > j && isSet(graph, i * (i - 1)  / 2 + j)
}

//just checks whether a given bit of a number is set
boolean isSet(int number, int bit) {
  return number & (1 << bit) & number != 0;
}

You can see the second picture to understand the bits layout that this formula represents. There is a nice geometric interpretation of it too: "i * (i - 1) / 2" is the surface of the triangle above the selected row. To that, we add "j" to go to the desired cell, since at every row, each cell represents the next bit of its left cell.

If you found this interesting, you might want to check out an older post about enumerating all binary trees, where also an amusing solution is produced.

This is how real men do garbage collection!

2009-07-06T13:56:00.001-07:00

I was quite fascinated today seeing the way the famous JDK figure Martin Bucholtz does garbage collection. Anyone believing he takes garbage more seriously than that?

Enjoy:


private static final Runtime rt = Runtime.getRuntime();
static void gcTillYouDrop() {
    rt.gc();
    rt.runFinalization();
    final CountDownLatch latch = new CountDownLatch(1);
    new Object() {
      protected void finalize() {
        latch.countDown();
      }
    };
    rt.gc();
    try {
      latch.await();
    }
    catch(InterruptedException ie){
      throw new Error(ie);
    }
}

"Hyper-paranoid" indeed, in the words of Kevin.

Funniest Java Interface

2009-06-18T10:26:00.001-07:00

interface sun.net.www.http.Hurryable {
 boolean hurry();
}

:-)

Funny ConcurrentModificationException while playing with Google's Multimap

2009-06-16T03:31:00.000-07:00

(Disclaimer: Google's Multimap surely work fine and as advertised - this post describes a potentially confusing code interaction which can result in unintuitive ConcurrentModificationExceptions)

First, lets create a multimap:

Multimap multimap = HashMultimap.create();

Put somehow some values to it:

putSomeValues(multimap);

Now lets iterate its key set, and potentially filter some elements of each key's collection:

for (Key key : multimap.keySet()) {
  Collection values = multimap.get(key);

  ...
  if (something) {
    Value someValue = ...;
    values.remove(someValue);
  }
}

(We could iterate its entries() as well, which is supposedly faster, but for now, I won't bother).

This seems safe. I iterate the keys, and don't perform any structural modification in the multimap, I might only make a collection inside it shorter.

Well, no. It can blow if "values" becomes empty, since then the multimap will remove the respective entry from the map (this is a very desirable behavior, to be sure), thus structurally modifying the map, thus ConcurrentModificationException when the iterator will try to fetch the next key.

This may be quite nasty if only rarely the "values" collection goes empty and triggers this behavior, so it's good to know.

My solution is this:

Iterator keyIterator = multimap.keySet().iterator();
while (keyIterator.hasNext()) {
Key key = keyIterator.next();
Collection values = multimap.get(key);

...
if (something) {
  Value someValue = ...;
  if (values.size() == 1 && values.contains(someValue)) {
    keyIterator.remove();
    continue; //this is not really needed
  }
  values.remove(someValue);
}
}

Quite simpler would be to simply change this:

for (Key key : multimap.keySet()) {

To this:

for (Key key : Lists.newArrayList(multimap.keySet())) {

If you don't mind creating a copy of the entire key set up-front.