Wednesday, July 29, 2009

The best way to enumerated directed acyclic graphs



Here is a tricky problem for the algorithmically oriented people to ponder about: "enumerate all directed acyclic graphs (DAGs) of a given size n".

I'll describe the process I followed in tackling this problem (the result is rather fascinating), but if you find this challenging enough, it would be good to stop reading for a while and see if you can solve it (hopefully in a good way).

First, some background. "Are you nuts? Enumerating graphs? You'll get huge numbers of them almost immediately!". Well, true. For example, here is the arithmetic sequence that shows just how many distinct (non-isomorphic) general graphs there are. It gets out of hand really quickly. Well, some background. I'm trying to develop an algorithm that takes as input a DAG, but it's quite hard to analyze and/or compare it with existing algorithms. So I'm interested in exhausting all possible inputs up to a certain size and count the logical steps of the algorithm for each problem instance, which can offer me useful feedback on the algorithm design, or show me which graphs produce the worst behavior of the algorithm, for which graphs the algorithm performs better than other algorithms, and so on.

Now, on to business. Upon some research, it seems there is no known sequence defining the number of unique DAGs per node count out there. The closest I found is the number of partially ordered sets. A partially ordered set is also a DAG, with the further restriction that its transitive reduction is itself (i.e. no redundant edges exist). Since there are many DAGs which yield the same transitively reduced DAG, it is obvious that this sequence is a strictly lower bound on the number of DAGs.

My first approach was rather brave, but futile nevertheless. I started about defining the sole graph of size 1 (no self loops), which is just a node. Then iteratively I increased the size N of generated graphs. At the iteration, the Nth node of the graphs is created and combined with all graphs of size N-1 in every possible way. How many ways are there? We are only allowed to connect the new node to the older nodes (not the other way around), so we only generated DAGs, so there are N-1 possible edges to add. The power set of this is 2^(N-1), and this is every possible way that a given graph with N-1 size can be extended with a new node.

This produced DAGs of count 2^0 (1 node), 2^1 (2 nodes)
  • 2^0 = 1(one node)
  • 2^1 = 2 (two nodes)
  • 2^3 = 8 (three nodes)
  • 2^6 = 64 (four nodes)
  • 2^10 = 1024 (five nodes)
  • 2^15 = 32768 (six nodes)
  • 2^21 = 2097152 ( seven nodes)
The next milestone was 2^28 (268435456) graphs, but I ran out of memory. :) But storing more than two million graphs of 7 nodes and up to 21 nodes is a formidable task. I managed to pack all those graphs in about 120mb of RAM, meaning less than 60 bytes for each graph, which for a adjacency lists implementation, is pretty impressive. (But adjacency matrix where each possible edge represented as single bits would be the most economical representation). This is possible because each graph of size N shared the N-1 nodes of it, along with their adjacency lists, with the previous graph of size N-1 that it extended, so the cost of each graph is more or less the cost of the final node and its list. (This is the kind of stuff where persistent data structures really excel, but in Javaland the reusable implementations are few and far between - did I mention yet you should check out Scala?)

Then, I talked with Martin, a collegue/Phd student at the Univ. of Bath, who mostly works on NP-hard problems, and apart from a long discussion on "why one earth do you want to have a freak of nature like this one???" and other related sub-discussions, he suggested enumerating all undirected general graphs (not directed, not acyclic: this is what is offered) by nauty, and then produce all DAGs from each graph. Quite a huge amount of work: 2^(n(n-1)) number of graphs, multiplied by the number of DAGs created from each (which can be up to 2^(n-1)(n-2) / 2). But most importantly, he mentioned that nauty enumerates the graphs without having to store the smaller ones, which made me challenge my approach of generating the DAGs.

From there, it only took few seconds to bump on the correct solution, which is very simple. See the following table:



This table is meant to be a graph's adjacency matrix. The rows, as well as the columns, represent the graph's nodes. Each cell can be 0 or 1, to denote if there is a node from the row-node to the column-node, respectively. Note that the nodes of the matrix are actually in topological order (which is defined in DAGs) - a node/row is only allowed to connect to nodes after/up of it, not before/down of it - thus this matrix, the gray area are edges that if were allowed, they would violate the defined topological order. Only the white cells can contain 1, but the can also contain 0 of course. So what do we have here? For a DAG of n nodes, we have this nxn matrix, and (n-1)(n-2)/2 cells which can independently vary between 0 or 1. The solution from here is easy: Just arrange these cells as bits in a bit string, i.e. a binary number, start with zero, and increment it by one at each step, till the number consists of only 1's. Constant storage space, just a number of (n-1)(n-2)/2 bits! And a trivial way to enumerate the dags, transform the enumeration problem to the problem of...adding 1 to a binary number. All it takes is decoding the number into the respective DAG. Isn't this elegant? :)

To sum up, this procedure creates 2^((n-1)(n-2) / 2) DAGs, in equal number of iterations. Which in contrary to the suggested method, has half the exponent, i.e. for n = 10, the last method would yield 2^45 steps/DAGs, while using nauty would create 2^90 general graphs, where each graph would be subsequently transformed to many, many DAGs.

Addendum: The above description leaves a tricky part uncommented. We saw how to generate all DAGs with n nodes, i.e. :

int currentGraph = 1 << (n - 1) * (n - 2) / 2 ;
while (currentGraph >= 0) {
currentGraph--; //represents a DAG!
}

But how to interpret these numbers as graphs? We have to be able to answer, for all graphs, whether there is an edge (i --> j), i.e. connecting node i to node j. Here is the implementation of this test, by Nelly Vouzoukidou, a graduate cs student at Univ. of Crete:



boolean hasEdge(int graph, int i, int j) {
return i > j && isSet(graph, i * (i - 1) / 2 + j)
}

//just checks whether a given bit of a number is set
boolean isSet(int number, int bit) {
return number & (1 << bit) & number != 0;
}


You can see the second picture to understand the bits layout that this formula represents. There is a nice geometric interpretation of it too: "i * (i - 1) / 2" is the surface of the triangle above the selected row. To that, we add "j" to go to the desired cell, since at every row, each cell represents the next bit of its left cell.

If you found this interesting, you might want to check out an older post about enumerating all binary trees, where also an amusing solution is produced.