Sunday, July 25, 2010

Google File System - the 'bandwidth' problem

I am reading the very interesting Google File System paper. It describes the situation back in 2003, not sure how much things have changed since then. Before moving to the subject of this post, lets sum up some interesting points of that paper first (only based on my certainly incomplete understanding).

  • GFS clearly favors appending to files than random access writing. If all mutations of a file are appends, sequential readers (readers that scan the whole of the file - which represents the vast majority of cases) have also extra consistency guarantees: in the face of concurrent modifications, the only "bad" thing that they may see is the file ending prematurely. But everything up to that point is guaranteed to be well defined and consistent.
  • A GFS file can be used as an excellent producer-consumer queue! Multiple producers may append ("record append") to the file, with no need for extra synchronization/distributed locking. This is a weaker form of append - the client does not choose where the data end up to, the primary replica server of that file chooses this, and the clients get a guarantee that the update with occur at least once (well, not exactly once, but one can work-around this by uniquely naming the records, if they aren't already named so). This seems much simpler and better than either having to establish distributed locking protocols between producers, or letting them choose the offset in the file where their append should take place, and trying to resolve potential (and possibly very frequent) collisions.
  • Automatic garbage collection of orphaned chunks (file pieces) is implemented.
  • There are quite weak concurrency guarantees for writers that write lots of data. The writer client needs to break up the data into acceptably small pieces, and each piece is written atomically - but if there are multiple writers doing the same, the final result is undefined, it could contain random fragments of the data of each writer.
  • File metadata updates are atomic. This is particularly nice, consider the following case: a writer writes to a file, and after  it's done writing, it atomically renames the file to a name which can then publish to readers. This is the analog of doing a compare-and-swap (CAS) in a shared memory multiprocessor (SMP), which, importantly, is the base of most lock-free algorithms. In particular, the writer that writes to a file (unknown to others) is like a thread doing thread-local computation, and only announce the result in the end via an atomic pointer update (atomically update a root pointer, if nobody has already changed it, or in GFS, if nobody has created the intended filename first).
Now, on to the piece that found curious enough to make a post about. A couple of key points:
  • When a client wants to write, it can pick whatever replica to initiate this action (presumably the replica closer to it - and in Google's case, they can infer the "closeness" of two nodes by inspecting their IPs). 
  • The data must be stored (temporarily) to all replicas before the client notifies the (designated) primary replica that the write should be committed. 

This leads to the problem: How to move client's data to be written, to all replicas? Google's answer is through a chain (hamiltonian path) through all replicas, starting at the replica that was contacted by the client. This is preferred over e.g. a dissemination tree, so as to bound the requirements of out-bandwidth of  replicas - perhaps the best dissemination could be a star-like tree, but too much burden would be placed on the central node, and so on). 

Here is a picture taken from the aforementioned paper. Note the thick arrows, they show the chain that data goes through, from the client and to all replicas. 
But how do we choose the next replica in the chain? I.e., why we send the data from Replica A to the Primary Replica and then to Replica B, rather than the other way around? Google is using a simple, greedy rule: pick the replica (that hasn't seen the data yet) that is closest to the current one.

This looks like the well (ok, perhaps not so well) known graph theoretic problem, the Bandwidth Problem. That is, we have a graph, and we want to create some linear order of the nodes, so that we minimize the total edge distances (e.g. if we put nodes side-by-side, the edges between them would have very low distance/cost). It's NP-Hard.

To visualize the bandwidth problem, I'm stealing a picture from the excellent Skiena's book, the Algorithm Design Manual, which I highly recommend. (I guess a free ad here should worth stealing a small picture :-/ ). (By the way, here is an excellent treatment of the problem by Uriel Feige, very conveniently containing approximation algorithms and heuristics).

Well, almost, but not exactly the same problem. The subtle difference is that in the Bandwidth Problem, we seek to minimize the distance/cost sum of all edges, while in the case of GFS, we only need to minimize the length of a single (any) Hamiltonian path (the picture just shows an easy case, where apart from a single path there are no more edges in the graph).

Ok, so I don't know yet how to state this problem. (While reading the paper, I thought it was the same problem, thus the title of this post). Let me do something else for now - lets try to be create an example that maximizes the cost for GFS' heuristic. Here is what I came up with (click to see larger size):

I think that's the worst case scenario. The client chooses the replica in the middle, then we go to the right 1 step (because the one at the left is at distance 2), then we go to the left 3 steps (because the one at the right is at distance 4), and so on, going back and forth, with a total distance travelled = 1 + 3 + 7 + 15 = (2^1 - 1) + (2^2 - 1) + (2^3 - 1) + (2^4 - 1) = (2^5 - 1) = 31, or more generally, exactly 2^N - 1! (Of course, don't hold your breath on ever seeing such a topology in an actual GFS cluster!) The optimal solution here would be to go immediately to a boundary replica, either farthest left or right (choose the one closest), and then linearly visit the rest. That would yield a cost of (1 + 4) + (1 + 2 + 4 + 8) = 20, (1 + 4 to go to the farthest right replica, and then going all the way to the left) or more generally, exactly 2^(N - 1) + 2^(N - 3) (it's fun to work this formula out!). 

Now, how much worse is Google's heuristic than the optimal solution? 

It turns out, (if we assume that I did find the worst case scenario)Google's solution is at most a factor of 8/5 (= 1.6) of the optimal solution! (Well, sure enough. To compare 2^N to 2^(N-1) + 2^(N-3), divide everything by 2^N, so we have 1 compared to 1/2 + 1/8, or 5/8). Wow! That's extremely good for something as simple! I didn't expect to find such an exact result - one can also compute this easily via WolframAlpha, which is way too cool in this kind of problems.

Nice, nice, nice. I thought I would be find something ludicrously bad (in a completely unrealistic scenario), but it turns out, that's just 1.6 times the optimal solution at most!

/sheer happy - we haz it :)


Ah, the art of creating good corner case examples. It turns out the above is not the worst case, there is a worse than that, and quite simpler too (I had thought about this earlier, but for some reason I decided that forcing the algorithm to continually go back and forth would be the most expensive).

Consider this example (again, click to see in full size):

Each adjacent replica pair is separated by unit distance, while the last two are separated by two. We can make the path arbitrarily long, easily forcing the algorithm to choose a path which is a factor of 2 - ε longer than the optimal. And this is to be further restrained: I'm only considering geometric graphs, i.e. graphs where the triangle inequality holds. I have next to no idea whether actual network topologies resemble geometric graphs, and certainly one could construct much worse examples in graphs where the notion of distance is completely arbitrary (but distances in a network topology certainly isn't an arbitrary function). It's too late in the night to worry about that, so lets stop here hoping I'm still making some sense. :)

By the way, I recently bought Vazirani's Approximate Algorithms book. Neat, refreshing read, and highly recommended.