## Thursday, July 29, 2010

### Graph reachability, transitive closures, and a nasty historical accident in the pre-google era

Last few days I'm feeling a bit like an archaeologist. It's probably a great story to explain to a newcomer just what's the importance of the internet, coupled with a great search engine. But I digress.

What follows is a pretty long post, covering a lot of ground in the graph reachability problem and transitive closures, enumerating various naive solutions (i.e., the ones you are likely to see used), then moving to much more promising approaches.

Introduction

Let's say you have a directed graph, G = (V, E), and you want to know real fast whether you can reach some node from some other, in other words whether there is a directed path u --> v, for any nodes u and v. (Take a moment to convince yourself that this is a trivial problem if the graph is undirected). Before we continue, lets stop and think where this might be useful. Consider a (really silly) class hierarchy (we also call it an ontology):

Testing whether there is a directed path between two nodes in a class hierarchy is equivalent to asking "is this class a subclass of another?". That's basically what the instanceof Java operator does (if we only consider classes, then subtyping in Java forms a tree - single inheritance, remember? - but if we add interfaces in the mix, we get a DAG, but thankfully, never a cycle).

Lets formally denote such a query by a function existsPath: V x V --> {1, 0}. It takes an ordered pair of nodes, and returns whether there is a path from the first to the second. Now, lets make sure that having such an existsPath function is, in fact, the characteristic function of the transitive closure of the graph, that is, given any edge, we can apply this function to it to test whether it belongs to the transitive closure or not, i.e., whether it is one of the following edges:

When we talk about transitive closure, though, we mean something more than merely being able to answer graph reachability queries - we also imply that we possess a function successorSet: V --> Powerset(V). i.e., that we can find all ancestors of a node, instead of checking if a particular node is an ancestor of it or not.

By the way, the above explanation is exactly the reason why in Flexigraph I define the interfaces (sorry, but blogger doesn't play very nicely with source code):

 public interface PathFinder { boolean pathExists(Node n1, Node n2); }

 public interface Closure extends PathFindr { SuccessorSet successorsOf(Node node); }

"Closure extends PathFinder" is basically a theorem expressed in code, that any transitive closure representation is also a graph reachability representation. That's the precise relation between these two concepts - I'm not aware of any Java graph library that gets these right. Continuing the self-plug, if you need to work with transitive closures/reductions, you might find the whole transitivity package interesting - compare this, for example, with JGraphT support, or yWorks support: a sad state of affairs. But I digress again.

Supporting graph reachability queries

Ok, lets concentrate now on actually implementing support for graph reachability queries. These are the most important questions to ask before we start:
• How much memory are we willing to spend?
• Is the graph static? Or update operation are supported (and which)?
Lets explore the design space based on some typical answers to the above questions.

Bit-matrix: The fastest reachability test is offered by the bit-matrix method: make a |V| x |V| matrix, and store in each cell whether there is a path from the row-node to the column-node. This actually materializes the transitive closure, in the sense that we explicitly (rather than implicitly, to some degree) represent it (and we already argued that representing the transitive closure also implies support for graph reachability queries).

Of course, this has some obvious shortcomings. Firstly, it uses O(|V| ^ 2) memory - but this is not as bad as it looks, and it would probably be a good solution for you if your graph is static. Which leads us to the second problem - it is terrible if you expect to frequently update your graph.

Adjacency-list graph representation: another way of representing the transitive closure is to just create an adjacency-list graph with all the edges of the transitive closure in it. In the example, that would mean to create a graph to represent the second image. (This is what JGraphT and yWorks offer - it's the simplest to implement). This has numerous disadvantages. First, this is a costly representation - such a representation is good for sparse graphs, but it is very easy for a sparse graph to have a dense transitive closure (consider a single path). Worse, if you represent each edge with an actual "Edge" object, then you are quite probably end up with a representation even costlier than the naive bitmap method - compare using a single bit to encode an edge vs using at least 12 bytes (at least, but typically quite more than that), i.e. 96bits. Also, this representation doesn't even offer fast reachability test - testing whether two nodes are connected depends on the minimum degree of those nodes, and this degree can be large in a dense graph. All in all, this representation is the lest likely to be the best for a given application.

Agrawal et al: Now, lets see the approach in a very influential paper published in 1989:  Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. Lets see a figure from this paper, starting from the simple case of supporting reachability queries in a tree

(Don't be confused by the edge directions, they just happened to drew them the other way around). Each node is labelled with a pair of integers - think of each pair as an interval in the X-axis. The second number is the post-order index of that node in the tree, while the first number is equal to the minimum post-order index of any descendant of that node. Lets quickly visualize those pairs as intervals:

As you can see, the interval of each node subsumes (is a proper superset of) the intervals of exactly its descendants. Thus, we can quickly test whether "e is a descendant of b" by asking "is 3 contained in the interval [1, 4]?", in O(1) time.

Quick check: does it easily support graph updates? No! Inserting a new leaf under node g would require changing the numbers of g and everything on its right (that is, all the nodes of the graph!). And we are still only discussing about trees. Lets see how this generalizes for acyclic graphs (this can be similarly generalized to general graphs, but I won't elaborate on that here). Again, a figure from the paper:

Note that some non-tree edges have been added, particularly d --> b (the rest of the inserted are actually redundant, yes, they could have used a better example). Also note that now d is represented by two intervals, [6,7] and [1,4]. The same invariants are observed as above - the interval set of each node subsumes any interval of exactly its descendants. Basically, we start with a tree (forest) of the graph, label it as previously, and then process all non-tree edges, by propagating all the intervals of the descendant to all its ancestors via that edge.

Second check: how does this cope with graph updates? Much worse! If, for example, we have to change the interval of g, we now must search potentially the whole graph and track where its interval might have been propagated, fixing all those occurrences with the new interval of g. Well, this sucks big time.

Now, lets try to justify the "historical accident" that the title promises. What if I told you that this problem was already solved 7 years earlier than Agrawal reintroduced it? More importantly, what if I told you that (apparently!) hardly anyone noticed this since then? It's hard to believe. For example, here is a technical report published recently, in 2006: Preuveneers, Berbers - Prime Numbers Considered Useful . Don't bother reading all those 50 pages - here is a relevant quote (shortly after having presented the Agrawal et al method - emphasis mine):

In the previous section several encoding algorithms for eﬃcient subsumption or subtype testing were discussed. As previously outlined, the major concern is support for eﬃcient incremental and conﬂict-free encoding. While most algorithms provide a way to encode new classes without re-encoding the whole hierarchy, they often  require going through the hierarchy to ﬁnd and modify any conﬂicting codes. The only algorithm that does not suﬀer from this issue is the binary matrix method, but this method is the most expensive one with respect to the encoding length as no compaction of the representation is achieved.
This statement is the authors' motivation for design a new approach (a not too practical dare I say, requiring expensive computations on very large integers). They didn't know the solution.

Here is another example, published in 2004.Christophides - Optimizing Taxonomic Semantic Web Queries Using Labeling Schemes. This is a great informative paper surveying at depth a number of different graph reachability schemes, with a focus on databases. Christophides, actually, is my former supervisor, Professor at the University of Crete, and a good friend. Unfortunately, even though this paper actually cites the paper that contains the key idea, they also didn't know the solution (this I know first-hand).

Yet another example, published in 1999: Bommel, Beck - Incremental Encoding of Multiple Inheritance Hierarchies. Again, they didn't know the solution, so they ended up creating a more complicated, less-than-ideal work-around (see for yourself).

I don't even dare to look into all ~250 papers that cite Agrawal's work.

This is rather embarrassing. Before continuing, this kind of regression would have been highly unlikely if there was an efficient way (e.g. google scholar :) ) to search for relevant research. Here goes the obligatory memex link.

That being said, lets finally turn our attention to the mysterious solution that was missed by seemingly everyone. Dietz - Maintaining order in a linked list. This paper was published much earlier - in 1982, a year special to me since that's when I was born. Dietz introduces the (lets call it so) order maintainance structure, defined by this interface:
• Insert(x, y)
Inserts element x right afterwards element y.
• Delete(x)  (Obvious)
• Order(x, y)
Returns whether x precedes y.
So this is basically similar to a linked list, with the addition that we can ask (in constant time!) whether a particular list node precedes another. The actual data structure proposed to fulfill this interface has been superseded by Dietz and Sleator in Two Algorithms for Maintaining Order in a List (1988), and then recently by Bender et al in Two Simplified Algorithms for Maintaining Order in a List. These structures offer amortized O(1) time complexity for all these operations (there are also versions offering O(1) worst case, but they are significantly more complex to implement, and likely to be slower in practice). Here is my implementation of Bender's structure.

So how can this structure solve the incremental updates problem? Let's see Dietz' solution of the case of trees. Lets draw another tree.

Okay, so this is the same tree, and I have labelled the nodes with intervals, with one catch - the intervals do not comprise integers, but symbolic names. What's the deal, you say? Here's the deal!

That's it! The colors are there so you can easily match-up the respective nodes. This is a linked list, or more precisely, the order-maintainance list. Any two nodes in it define an interval. For example, [Pet_pre, Pet_post] interval (green in the picture) subsumes both Cat's and Dog's nodes - precisely because it is an ancestor of both! Note that, internally, these list nodes do in fact contain integers that observe the total order (e.g., are monotonically increasing from left to right) so we can test in O(1) whether a node precedes another. This is quite a simple structure, and there are very good algorithms for inserting a node in it - if there is no space between the integers of the adjacent nodes, an appropriate region of this list is renumbered.

Update: probably it's obvious, but better be specific. So how do we add a descendant? Let's say we want to add a graph node X that is a subclass of Pet. We simply create two list nodes, X_pre, X_post, and insert them right before Pet_post. That would result in a chain Pet_pre --> [whatever nodes were already here] --> X_pre --> X_post --> Pet_postOr we could add these right after Pet_pre, it's equivalent really. (As a coincidence, both Dietz and I chose the first alternative, to insert child_pre and child_post right before parent_post. I had a reason for this actually, there is a tiny difference because I'm using Integer.MIN_VALUE as a sentinel value, but that would be way too much details for this post). Of course, to insert a node, say B, between two others, say A and C, we have to find an integer between label(A) and label(C), i.e. to make sure that we have label(A) < label(B) < label(C), without affecting any existing label relation. If label(A) = label(C) + 1 (i.e., consecutive, no space between them for a new node), we have some nodes to make space at the place of insertion. Compare with Dietz' Figure 4.1, in the "Maintaining order in a linked list" paper.

While this might not seem such a big win for the case of trees, it is absolutely necessary in the case of graphs. This is what Agrawal et al missed! Instead of using integers directly, they should have used nodes of an order-maintainance structure. Why? Because when an interval does not have space to accommodate a new descendant, instead of widening the interval and then scanning the whole graph to find appearances of those integers to fix them, we instead make only local changes in the order maintainance structure in O(1), without even touching the graph!

You can visualize the difference in the following example.
The segmented edges are non-tree edges, so they propagate upwards the intervals, thus, the H node ends up having 4 distinct intervals. Now, if instead of nodes such as E_pre and E_post, we had numbers (say, (5,6)), what would have happened if we wanted to add a descendant under E but it had no space in its interval? We would fix the boundaries of E's interval, and we would have to search the graph to see where the old boundaries have been propagated (i.e. in nodes F and H). With the order maintainance structure, this is not needed. We locally renumber whichever node is required, and we do it once - we only renumber E_pre and E_post once, and due to node sharing (instead of copied values/integers!), it doesn't matter where these might have been propagated, all positions are automatically fixed too! Contrast this to having to visit F, G and H, find E_pre, E_post in their intervals, and fix them in all places. Let alone if we had to renumber a series of intervals, and hunt the propagations of all of them in the graph, and so on...

Just for completeness, this is what the order maintainance list would look like, in the last example:

A_pre --> B_pre --> C_pre --> D_pre --> E_pre --> E_post --> D_post --> F_pre --> F_post --> C_post --> G_pre --> G_post --> B_post --> H_pre --> H_post --> A_post.

In this structure we perform the renumbering, not in the graph! Particularly, in some (typically small) region that contains the two nodes inside which we want to insert something.

So, there you go. This is the trick I (re)discovered on some Friday of last March, and ended up coding  http://code.google.com/p/transitivity-utils in the following weekend, which implements all of the above and more, so you don't have to. It is hard to believe that this went unnoticed for 20 years - and even harder to believe that the solution was there all along, waiting for someone to make the connection.

Now that's something to ponder about. Oh, and thanks, Dietz, great work!

If you managed to read all the way down to here, my thanks for your patience. That was way too long a post, but hopefully, you also earned something out of it. I regard this a very interesting story, and wish I was a better storyteller than this, but here you go. :)

## Sunday, July 25, 2010

### Google File System - the 'bandwidth' problem

I am reading the very interesting Google File System paper. It describes the situation back in 2003, not sure how much things have changed since then. Before moving to the subject of this post, lets sum up some interesting points of that paper first (only based on my certainly incomplete understanding).

• GFS clearly favors appending to files than random access writing. If all mutations of a file are appends, sequential readers (readers that scan the whole of the file - which represents the vast majority of cases) have also extra consistency guarantees: in the face of concurrent modifications, the only "bad" thing that they may see is the file ending prematurely. But everything up to that point is guaranteed to be well defined and consistent.
• A GFS file can be used as an excellent producer-consumer queue! Multiple producers may append ("record append") to the file, with no need for extra synchronization/distributed locking. This is a weaker form of append - the client does not choose where the data end up to, the primary replica server of that file chooses this, and the clients get a guarantee that the update with occur at least once (well, not exactly once, but one can work-around this by uniquely naming the records, if they aren't already named so). This seems much simpler and better than either having to establish distributed locking protocols between producers, or letting them choose the offset in the file where their append should take place, and trying to resolve potential (and possibly very frequent) collisions.
• Automatic garbage collection of orphaned chunks (file pieces) is implemented.
• There are quite weak concurrency guarantees for writers that write lots of data. The writer client needs to break up the data into acceptably small pieces, and each piece is written atomically - but if there are multiple writers doing the same, the final result is undefined, it could contain random fragments of the data of each writer.
• File metadata updates are atomic. This is particularly nice, consider the following case: a writer writes to a file, and after  it's done writing, it atomically renames the file to a name which can then publish to readers. This is the analog of doing a compare-and-swap (CAS) in a shared memory multiprocessor (SMP), which, importantly, is the base of most lock-free algorithms. In particular, the writer that writes to a file (unknown to others) is like a thread doing thread-local computation, and only announce the result in the end via an atomic pointer update (atomically update a root pointer, if nobody has already changed it, or in GFS, if nobody has created the intended filename first).
Now, on to the piece that found curious enough to make a post about. A couple of key points:
• When a client wants to write, it can pick whatever replica to initiate this action (presumably the replica closer to it - and in Google's case, they can infer the "closeness" of two nodes by inspecting their IPs).
• The data must be stored (temporarily) to all replicas before the client notifies the (designated) primary replica that the write should be committed.

This leads to the problem: How to move client's data to be written, to all replicas? Google's answer is through a chain (hamiltonian path) through all replicas, starting at the replica that was contacted by the client. This is preferred over e.g. a dissemination tree, so as to bound the requirements of out-bandwidth of  replicas - perhaps the best dissemination could be a star-like tree, but too much burden would be placed on the central node, and so on).

Here is a picture taken from the aforementioned paper. Note the thick arrows, they show the chain that data goes through, from the client and to all replicas.
But how do we choose the next replica in the chain? I.e., why we send the data from Replica A to the Primary Replica and then to Replica B, rather than the other way around? Google is using a simple, greedy rule: pick the replica (that hasn't seen the data yet) that is closest to the current one.

This looks like the well (ok, perhaps not so well) known graph theoretic problem, the Bandwidth Problem. That is, we have a graph, and we want to create some linear order of the nodes, so that we minimize the total edge distances (e.g. if we put nodes side-by-side, the edges between them would have very low distance/cost). It's NP-Hard.

To visualize the bandwidth problem, I'm stealing a picture from the excellent Skiena's book, the Algorithm Design Manual, which I highly recommend. (I guess a free ad here should worth stealing a small picture :-/ ). (By the way, here is an excellent treatment of the problem by Uriel Feige, very conveniently containing approximation algorithms and heuristics).

Well, almost, but not exactly the same problem. The subtle difference is that in the Bandwidth Problem, we seek to minimize the distance/cost sum of all edges, while in the case of GFS, we only need to minimize the length of a single (any) Hamiltonian path (the picture just shows an easy case, where apart from a single path there are no more edges in the graph).

Ok, so I don't know yet how to state this problem. (While reading the paper, I thought it was the same problem, thus the title of this post). Let me do something else for now - lets try to be create an example that maximizes the cost for GFS' heuristic. Here is what I came up with (click to see larger size):

I think that's the worst case scenario. The client chooses the replica in the middle, then we go to the right 1 step (because the one at the left is at distance 2), then we go to the left 3 steps (because the one at the right is at distance 4), and so on, going back and forth, with a total distance travelled = 1 + 3 + 7 + 15 = (2^1 - 1) + (2^2 - 1) + (2^3 - 1) + (2^4 - 1) = (2^5 - 1) = 31, or more generally, exactly 2^N - 1! (Of course, don't hold your breath on ever seeing such a topology in an actual GFS cluster!) The optimal solution here would be to go immediately to a boundary replica, either farthest left or right (choose the one closest), and then linearly visit the rest. That would yield a cost of (1 + 4) + (1 + 2 + 4 + 8) = 20, (1 + 4 to go to the farthest right replica, and then going all the way to the left) or more generally, exactly 2^(N - 1) + 2^(N - 3) (it's fun to work this formula out!).

Now, how much worse is Google's heuristic than the optimal solution?

It turns out, (if we assume that I did find the worst case scenario)Google's solution is at most a factor of 8/5 (= 1.6) of the optimal solution! (Well, sure enough. To compare 2^N to 2^(N-1) + 2^(N-3), divide everything by 2^N, so we have 1 compared to 1/2 + 1/8, or 5/8). Wow! That's extremely good for something as simple! I didn't expect to find such an exact result - one can also compute this easily via WolframAlpha, which is way too cool in this kind of problems.

Nice, nice, nice. I thought I would be find something ludicrously bad (in a completely unrealistic scenario), but it turns out, that's just 1.6 times the optimal solution at most!

/sheer happy - we haz it :)