Sunday, February 5, 2012

Updated: Memory cost per Java/Guava structure (+Cache)

I just updated the ElementCostInDataStructures page, after 6 months, bringing it in line with the latest Guava version (11).

The most interesting part is just how easy it is to measure the cost per entry in a Cache (or LoadingCache, which turns out to be equivalent). This is the cheat-sheet I compiled.

To compute the cost of a single (key, value) entry:

So, consider this example from the javadoc:

   LoadingCache graphs = CacheBuilder.newBuilder()
       .maximumSize(10000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .removalListener(MY_LISTENER)
       .build(
           new CacheLoader() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });

The cost of an Entry in this structure this is computed as follows:
  • It's a Cache: +12 words
  • It uses maximumSize(): +4 words
  • It uses expiration: +4 words
Thus, each (key, value) entry would have a footprint of 20 words (thus 80 bytes in a 32bit VM, or 160 in a 64bit one). 

To estimate the overhead imposed in the garbage collector, one could count how many references (pointers) each entry introduces, which the garbage collector would have to traverse to compute object reachability. The same list again, this time only counting references:

Thus, for example, if you use all features together (expiration, maximumSize(), weakKeys(), and weak or soft values), then each entry introduces 6+2+2+3+4 = 17 references. If you have 1000 entries in this cache, that would imply 17,000 references.

A simple advice to close this post: since these numbers can quickly add up, be sure to use only those features that your application needs. Don't overdo it just because these features are easily accessible - if you can practically get away with fewer features, do so, and you'll get a leaner cache.

That said, I'm positively surprised with the Cache implementation - that you pay exactly for what you use, and that the cost various combinations of features is simply the sum of the cost of the individual features. This is not obvious, not given, and not an accident, but good engineering. 

2 comments:

  1. FYI, some of the comments on that post are out of date: HashMultiset doesn't use a plain old Integer anymore, it uses a mutable (but non-atomic) Count type.

    ReplyDelete