Tuesday, October 20, 2009

Beware of recursive set union building!

The excellent Google collections library spoils many of us, but that doesn't mean we can afford not being alert using it!

Observe how easy it is, for example, to create the (live) union of two sets: Sets.union(setA, setB).

One might be tempted to write code like the following, to make the union of all sets in "S":

Set<E> union = Collections.emptySet();
for (Set<E> someSet : S) {
union = Sets.union(someSet, union)
}
Wow! This build the union in just O(|S|) time! Sure enough, accessing the elements of the union is a different issue, but how slow could it be? (Note that we do pass the smallest union as first argument, in agreement with what javadocs suggest).

Well, it turns out, this is quite slow. Iterating over the elements of the union take O(N|S|), which for really small sets can be up to O(N^2), where N is the number of all elements of the union. In comparison, creating a single HashSet and calling addAll() to add each set in S to that, takes only O(N) time.

To understand the issue, consider the algorithm for computing the elements of the union of sets A and B:

report all elements in A
for all items x in A
if !b.contains(x)
report x

Now consider this union: union(A, union(B, union(C, D))), graphically shown below.


This is how the union's iterator would report its elements:

1) Iterate and report all elements of A
2) Iterate elements of B, if they are not in A, report them
3) Iterate elements in C, if they are not in B, then if they are not in A, report them
4) Iterate elements in D, if they are not in C, then if they are not in B, then if they are not in A, report them

See the pattern there? Well, that's it. Just resist the temptation to make a recursive union, that's all. (I haven't looked the matter deeply, but I think this shouldn't be affecting recursive intersection or recursive difference).

So, in this case, creating a big HashSet and dumping all elements in it is the way to go. It is a bit of a pity that a HashSet is really a HashMap in disguise, i.e. horribly wasteful (compared to what a genuine HashSet implementation should be), but that's life in Java :)

Till next time,
Bye bye!