Sunday, December 12, 2010

How long is the period of random numbers created by java.util.Random?

The answer is, pretty darn long! 

I feel a bit cheated for not knowning, while I had several times seen the code the years before.

So it uses a good old linear congruential generator:

x = A * x + C

So, java.util.Random uses these two constants (A and C respectively):

    private final static long multiplier = 0x5DEECE66DL;
    private final static long addend = 0xBL;

Quite appropriate to place two "random" numbers in the Random class, right? Only these are not entirely random...

0x5DEECE66DL % 8 == 5
0xBL is an odd.

So what? 

It turns out, when these exact conditions are satisfied, the defined period is maximal, so the generated period (x is long) is 2^64. This piece of magic is documented here, page 9. The "why" of it is beyond me. 

If any ninja mathematician comes across this post, please leave some explanation if you can.

EDIT: oops. I overlooked that java.util.Random uses a mask to only reduce itself to 48 bits, instead of 64, so correspondingly the period is 2^48 instead of 2^64; apparently in an attempt to increase the quality of the generated numbers. In any case, if you need longer period, the technique described here will do the trick.

Juggling consistent hashing, hashtables, and good old sorting

Consistent hashing can be summarized in one sentence: it is a hashtable with variant-length buckets.

You have a line segment (that's the key space). Breaking it to N equal parts (the buckets) means it is very easy (O(1)) to compute which part contains a given point. But if you increase or decrease N, all parts change (and shrink or grow respectively). Only the first and the last part have one of its ends unaffected, but the rest have both their ends moved.

Now consider breaking it in N parts, but not of necessarily the same length. Now it is easier to change the number of parts. Do you want more parts? Break one in two pieces. Want fewer parts? Merge two parts sitting next to each other. All other parts remain firm in their place, unaffected. But now finding which part contains a given point gets more complicated (O(logN) if you keep the parts sorted, say in a red-black tree).

Hashtables themselves can be summarized in an interesting way: it is hash (it doesn't matter if the original key space is not totally ordered - the hashes are). But if it is sorting, where the logN factor in the complexity of hashtable operations? Well, there isn't one: hashes are small natural numbers (bounded by the hashtable size, which is O(N) of the elements in it). And what do you think of when you want to sort small natural numbers? Counting sort. (Caveat: hash functions do not just aim to map a large key space into a sequence of natural numbers, that's only part of the story. They also aim to approximate a random permutation of the original space, which is separately useful - to try to load balance the buckets when the incoming distribution is not uniform. In consistent hashing, we sacrifice the first role of hash functions, but we keep this latter one).

Trying to compress the discussion above:

  • Hashtables are really counting sort
    • (Yes, data structures are algorithms)
    • An equal piece of the (approximately permuted) key space is mapped to each bucket
      • Easy to find the corresponding bucket for a key (++)
      • Changing the number of buckets affects the mapped spaces of all buckets (--)
  • Consistent hashing differs in the following way:
    •  A variable piece the (approximately permuted) key space is mapped to each bucket
      • Harder to find the corresponding bucket for a key (--)
      • Easy to split or merge buckets, without affecting any other buckets (++)

Bonus section: Bigtables (and the good old Bigtable paper here). In Bigtable, The rows are primarily sorted by row key. The tablets (think buckets) are variable length, so that overloaded tablets can be split, without touching the huge amount of data that potentially reside in the same bigtable. (One can help load balancing the tablets by -did you guess it?- sorting by hash first...). So, here you go. The rows are sorted. Yet it can still rightfully be described as consistent hashing. Confused yet? Well, at least I tried. :)