Code-o-matic: 2009

Saturday, December 26, 2009

Barbara Liskov talk and vintage photo

Few days earlier I was happy to see this talk given by Barbara Liskov:

http://www.infoq.com/presentations/liskov-power-of-abstraction

Few days earlier I submitted a paper to ESWC10 ( found here , titled "Flexible Ranking and Matchmaking for Semantic Service Discovery"), which happened to include a reference to Liskov and her well known substitution principle (the bottom side of the 3rd page).

It was not a big deal, just "common sense" reiterated. As you will notice from the talk above, though, at the 70ies this kind of "common sense" we take for granted today, was debatable and unclear then -- just as debatable was whether the goto statement was evil or not!

One thing that impressed me most in that talk was a photo that Barbara shared with the audience, from the 70ies. I had never saw her young!

I will leave the photo uncommented - it speaks for itself. -edit: well, I commented it after all, see below :)

Tuesday, October 20, 2009

Beware of recursive set union building!

The excellent Google collections library spoils many of us, but that doesn't mean we can afford not being alert using it!

Observe how easy it is, for example, to create the (live) union of two sets: Sets.union(setA, setB).

One might be tempted to write code like the following, to make the union of all sets in "S":


Set<E> union = Collections.emptySet();
 for (Set<E> someSet : S) {
   union = Sets.union(someSet, union)
}

Wow! This build the union in just O(|S|) time! Sure enough, accessing the elements of the union is a different issue, but how slow could it be? (Note that we do pass the smallest union as first argument, in agreement with what javadocs suggest).

Well, it turns out, this is quite slow. Iterating over the elements of the union take O(N|S|), which for really small sets can be up to O(N^2), where N is the number of all elements of the union. In comparison, creating a single HashSet and calling addAll() to add each set in S to that, takes only O(N) time.

To understand the issue, consider the algorithm for computing the elements of the union of sets A and B:


report all elements in A
for all items x in A
 if !b.contains(x)
   report x

Now consider this union: union(A, union(B, union(C, D))), graphically shown below.

This is how the union's iterator would report its elements:

1) Iterate and report all elements of A

2) Iterate elements of B, if they are not in A, report them

3) Iterate elements in C, if they are not in B, then if they are not in A, report them

4) Iterate elements in D, if they are not in C, then if they are not in B, then if they are not in A, report them

See the pattern there? Well, that's it. Just resist the temptation to make a recursive union, that's all. (I haven't looked the matter deeply, but I think this shouldn't be affecting recursive intersection or recursive difference).

So, in this case, creating a big HashSet and dumping all elements in it is the way to go. It is a bit of a pity that a HashSet is really a HashMap in disguise, i.e. horribly wasteful (compared to what a genuine HashSet implementation should be), but that's life in Java :)

Till next time,

Bye bye!

Thursday, October 8, 2009

Using JConsole to monitor...JConsole

I was in the mood for some recursive monitoring, so I fired up a JConsole process and ordered it to monitor itself. I managed to make it show the stack trace of the thread that had the task to show the stack trace of the thread that had the....... you get the idea :)

For what it worths, here is the stack trace:


sun.tools.jconsole.Worker.add(Worker.java:56)
sun.tools.jconsole.Tab.workerAdd(Tab.java:73)
  - locked sun.tools.jconsole.ThreadTab@c829e3
sun.tools.jconsole.ThreadTab.valueChanged(ThreadTab.java:316)
javax.swing.JList.fireSelectionValueChanged(JList.java:1765)
javax.swing.JList$ListSelectionHandler.valueChanged(JList.java:1779)
javax.swing.DefaultListSelectionModel.fireValueChanged(DefaultListSelectionModel.java:167)
javax.swing.DefaultListSelectionModel.fireValueChanged(DefaultListSelectionModel.java:137)
javax.swing.DefaultListSelectionModel.setValueIsAdjusting(DefaultListSelectionModel.java:668)
javax.swing.JList.setValueIsAdjusting(JList.java:2110)
javax.swing.plaf.basic.BasicListUI$Handler.mouseReleased(BasicListUI.java:2788)
java.awt.AWTEventMulticaster.mouseReleased(AWTEventMulticaster.java:273)
java.awt.Component.processMouseEvent(Component.java:6263)
javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
java.awt.Component.processEvent(Component.java:6028)

I'm monitoring a process that should take about 2 hours, so yes, I do have a lot of time in my hands :)

Friday, September 18, 2009

Rapidshare can compress any file to just 16 bytes!!!

Or so it claims:

there is a program which calculates the MD5 checksum for each file immediately after the upload. The MD5 checksum is a 16 bytes value which is alsways the same when calculated for the same file. This value is stored in connection with the file. If you upload and distribute an illegal file, we will delete the file upon notification and add the MD5 checksum to a blacklist, so the same file cannot be uploaded again.

(Emphasis mine).

Now, I really, really, really want to see their blacklist implementation, since obviously it holds the key to the dark art of infinite compression, which is the Holy Grail Of Computer Science, the Universe, and Everything. Yes, the notorious Holy Grail that nobody wants to touch because apparently it's the favorite pooping place for some very special pigeons...

Monday, September 14, 2009

JComboBox pure craziness

Someone asked me why his ItemListener, attached to a JComboBox, was getting two events when he selected something on it. Weird.

I looked up JComboBox' javadocs, just to be sured, and....this is what I found:

Adds an ItemListener.
aListener will receive one or two ItemEvents when the selected item changes.

I understand Swing is a huge framework and it has its rough edges...but this? It certainly looks like someone couldn't fix this strange behavior, and desided to make it a documented feature instead. Way to go!

Wednesday, July 29, 2009

The best way to enumerated directed acyclic graphs

Here is a tricky problem for the algorithmically oriented people to ponder about: "enumerate all directed acyclic graphs (DAGs) of a given size n".

I'll describe the process I followed in tackling this problem (the result is rather fascinating), but if you find this challenging enough, it would be good to stop reading for a while and see if you can solve it (hopefully in a good way).

First, some background. "Are you nuts? Enumerating graphs? You'll get huge numbers of them almost immediately!". Well, true. For example, here is the arithmetic sequence that shows just how many distinct (non-isomorphic) general graphs there are. It gets out of hand really quickly. Well, some background. I'm trying to develop an algorithm that takes as input a DAG, but it's quite hard to analyze and/or compare it with existing algorithms. So I'm interested in exhausting all possible inputs up to a certain size and count the logical steps of the algorithm for each problem instance, which can offer me useful feedback on the algorithm design, or show me which graphs produce the worst behavior of the algorithm, for which graphs the algorithm performs better than other algorithms, and so on.

Now, on to business. Upon some research, it seems there is no known sequence defining the number of unique DAGs per node count out there. The closest I found is the number of partially ordered sets. A partially ordered set is also a DAG, with the further restriction that its transitive reduction is itself (i.e. no redundant edges exist). Since there are many DAGs which yield the same transitively reduced DAG, it is obvious that this sequence is a strictly lower bound on the number of DAGs.

My first approach was rather brave, but futile nevertheless. I started about defining the sole graph of size 1 (no self loops), which is just a node. Then iteratively I increased the size N of generated graphs. At the iteration, the Nth node of the graphs is created and combined with all graphs of size N-1 in every possible way. How many ways are there? We are only allowed to connect the new node to the older nodes (not the other way around), so we only generated DAGs, so there are N-1 possible edges to add. The power set of this is 2^(N-1), and this is every possible way that a given graph with N-1 size can be extended with a new node.

This produced DAGs of count 2^0 (1 node), 2^1 (2 nodes)

2^0 = 1(one node)
2^1 = 2 (two nodes)
2^3 = 8 (three nodes)
2^6 = 64 (four nodes)
2^10 = 1024 (five nodes)
2^15 = 32768 (six nodes)
2^21 = 2097152 ( seven nodes)

The next milestone was 2^28 (268435456) graphs, but I ran out of memory. :) But storing more than two million graphs of 7 nodes and up to 21 nodes is a formidable task. I managed to pack all those graphs in about 120mb of RAM, meaning less than 60 bytes for each graph, which for a adjacency lists implementation, is pretty impressive. (But adjacency matrix where each possible edge represented as single bits would be the most economical representation). This is possible because each graph of size N shared the N-1 nodes of it, along with their adjacency lists, with the previous graph of size N-1 that it extended, so the cost of each graph is more or less the cost of the final node and its list. (This is the kind of stuff where persistent data structures really excel, but in Javaland the reusable implementations are few and far between - did I mention yet you should check out Scala?)

Then, I talked with Martin, a collegue/Phd student at the Univ. of Bath, who mostly works on NP-hard problems, and apart from a long discussion on "why one earth do you want to have a freak of nature like this one???" and other related sub-discussions, he suggested enumerating all undirected general graphs (not directed, not acyclic: this is what is offered) by nauty, and then produce all DAGs from each graph. Quite a huge amount of work: 2^(n(n-1)) number of graphs, multiplied by the number of DAGs created from each (which can be up to 2^(n-1)(n-2) / 2). But most importantly, he mentioned that nauty enumerates the graphs without having to store the smaller ones, which made me challenge my approach of generating the DAGs.

From there, it only took few seconds to bump on the correct solution, which is very simple. See the following table:

This table is meant to be a graph's adjacency matrix. The rows, as well as the columns, represent the graph's nodes. Each cell can be 0 or 1, to denote if there is a node from the row-node to the column-node, respectively. Note that the nodes of the matrix are actually in topological order (which is defined in DAGs) - a node/row is only allowed to connect to nodes after/up of it, not before/down of it - thus this matrix, the gray area are edges that if were allowed, they would violate the defined topological order. Only the white cells can contain 1, but the can also contain 0 of course. So what do we have here? For a DAG of n nodes, we have this nxn matrix, and (n-1)(n-2)/2 cells which can independently vary between 0 or 1. The solution from here is easy: Just arrange these cells as bits in a bit string, i.e. a binary number, start with zero, and increment it by one at each step, till the number consists of only 1's. Constant storage space, just a number of (n-1)(n-2)/2 bits! And a trivial way to enumerate the dags, transform the enumeration problem to the problem of...adding 1 to a binary number. All it takes is decoding the number into the respective DAG. Isn't this elegant? :)

To sum up, this procedure creates 2^((n-1)(n-2) / 2) DAGs, in equal number of iterations. Which in contrary to the suggested method, has half the exponent, i.e. for n = 10, the last method would yield 2^45 steps/DAGs, while using nauty would create 2^90 general graphs, where each graph would be subsequently transformed to many, many DAGs.

Addendum: The above description leaves a tricky part uncommented. We saw how to generate all DAGs with n nodes, i.e. :

int currentGraph = 1 << (n - 1) * (n - 2) / 2 ;

while (currentGraph >= 0) {

currentGraph--; //represents a DAG!

}

But how to interpret these numbers as graphs? We have to be able to answer, for all graphs, whether there is an edge (i --> j), i.e. connecting node i to node j. Here is the implementation of this test, by Nelly Vouzoukidou, a graduate cs student at Univ. of Crete:


boolean hasEdge(int graph, int i, int j) {
  return i > j && isSet(graph, i * (i - 1)  / 2 + j)
}

//just checks whether a given bit of a number is set
boolean isSet(int number, int bit) {
  return number & (1 << bit) & number != 0;
}

You can see the second picture to understand the bits layout that this formula represents. There is a nice geometric interpretation of it too: "i * (i - 1) / 2" is the surface of the triangle above the selected row. To that, we add "j" to go to the desired cell, since at every row, each cell represents the next bit of its left cell.

If you found this interesting, you might want to check out an older post about enumerating all binary trees, where also an amusing solution is produced.

Monday, July 6, 2009

This is how real men do garbage collection!

I was quite fascinated today seeing the way the famous JDK figure Martin Bucholtz does garbage collection. Anyone believing he takes garbage more seriously than that?

Enjoy:


private static final Runtime rt = Runtime.getRuntime();
static void gcTillYouDrop() {
    rt.gc();
    rt.runFinalization();
    final CountDownLatch latch = new CountDownLatch(1);
    new Object() {
      protected void finalize() {
        latch.countDown();
      }
    };
    rt.gc();
    try {
      latch.await();
    }
    catch(InterruptedException ie){
      throw new Error(ie);
    }
}

"Hyper-paranoid" indeed, in the words of Kevin.

Thursday, June 18, 2009

Funniest Java Interface

interface sun.net.www.http.Hurryable {
 boolean hurry();
}

:-)

Tuesday, June 16, 2009

Funny ConcurrentModificationException while playing with Google's Multimap

(Disclaimer: Google's Multimap surely work fine and as advertised - this post describes a potentially confusing code interaction which can result in unintuitive ConcurrentModificationExceptions)

First, lets create a multimap:

Multimap multimap = HashMultimap.create();

Put somehow some values to it:

putSomeValues(multimap);

Now lets iterate its key set, and potentially filter some elements of each key's collection:

for (Key key : multimap.keySet()) {
  Collection values = multimap.get(key);

  ...
  if (something) {
    Value someValue = ...;
    values.remove(someValue);
  }
}

(We could iterate its entries() as well, which is supposedly faster, but for now, I won't bother).

This seems safe. I iterate the keys, and don't perform any structural modification in the multimap, I might only make a collection inside it shorter.

Well, no. It can blow if "values" becomes empty, since then the multimap will remove the respective entry from the map (this is a very desirable behavior, to be sure), thus structurally modifying the map, thus ConcurrentModificationException when the iterator will try to fetch the next key.

This may be quite nasty if only rarely the "values" collection goes empty and triggers this behavior, so it's good to know.

My solution is this:

Iterator keyIterator = multimap.keySet().iterator();
while (keyIterator.hasNext()) {
Key key = keyIterator.next();
Collection values = multimap.get(key);

...
if (something) {
  Value someValue = ...;
  if (values.size() == 1 && values.contains(someValue)) {
    keyIterator.remove();
    continue; //this is not really needed
  }
  values.remove(someValue);
}
}

Quite simpler would be to simply change this:

for (Key key : multimap.keySet()) {

To this:

for (Key key : Lists.newArrayList(multimap.keySet())) {

If you don't mind creating a copy of the entire key set up-front.

Tuesday, May 12, 2009

Double-checked locking idiom, sweet in Scala!

In Java, this is what you have to write when you want a lazily, thread-safe constructed value, which isn't static (otherwise you would use Initialization On Demand Holder idiom).

private volatile Object field;

public Object get() {

Object o = field;

if (o == null) {

synchronize(this) {

o = field = create();

}

return o;

}

private Object create() { ... }

Over, and over again, as many times as you need it. Or you could make a light abstraction over it, like this:

public class LazyHolder {

private volatile T field;

public T get() {

T o = field;

if (o == null) {

synchronize(this) {

o = field = create();

}

return o;

}

protected abstract T create();

}

So the user would have to do this:

private final LazyHolder holder = new LazyHolder() {

protected T create() { ... }

};

A very tiny caveat here (in both versions): if your computation happens to return null, the computation will be repeated the next time, since we used null as an indicator of missing initialization.

Enough about Java. How would you go about doing this in Scala?

Turns out, it is supported directly by the compiler, so you only have to write this:

lazy val myLazyField = create();

This is all it takes! It's thread-safe, it's computed only the first time if it's needed, and amazingly easy.

This is translated to a variant of the double-checked locking idiom. I decompiled a lazy val declaration and this is what I got:

public volatile int bitmap$0;

private Object myLazyField;

public String myLazyField() {

if((bitmap$0 & 1) == 0)

{

synchronized(this)

{

if((bitmap$0 & 1) == 0)

{

myLazyField = ...

bitmap$0 = bitmap$0 | 1;

}

return myLazyField;

}

Note the bitmap field, which can support the lazy initialization of up to 32 fields (even if some are initialized merely to null).

Nice and easy!

Tuesday, April 14, 2009

A Wonderful Programming Exercise

Yesterday, while reading the famous and excellent Introduction to Algorithms ("CLRS") book, I bumped on a seemingly innocuous problem on Appendix B (page 1092). The problem statement reads:

Show that by removing a single edge, we can partition the vertices of any n-vertex
binary tree into two sets A and B such that |A| <= 3n/4 and |B| <= 3n/4.

My first reaction was, "wow". If you are like me, you tend to consider binary trees as something very, very basic. Something a computer-science student gets to learn and love by second semester, and then move on to more fancy and complex stuff. Yet, this is the first time I came across to this very peculiar property of binary trees (even in the form of an exercise), which states that "you can always split any binary tree by removing some edge and have both components less or equal to 3/4 of the initial size, and equivalently greater or equal to 1/2 of the initial size". Why nature specially picked 3/4 and 1/4? I don't know. (You may google it, I did, and didn't find the answer/proof).

Here is a simple image of a binary tree, courtesy of Wikipedia: A binary tree picture

Anyway, on to the point. My problem was that I couldn't even find a binary tree which I could not cut more evenly than 3/4 and 1/4 (because, the exercise implies that there are some binary trees that you cannot reduce the greater split component any less than 3/4). So how am I supposed to build some basic intuition to be guided to a formal proof? I didn't fancy drawing and redrawing lots of trees, trying to find the "right" example. So, being a programmer, that's where I thought I could use some help from the computer in front of me: why not let the computer find the example and show it to me?

So this was my new problem statement: enumerate *all* binary trees of a given size N. How about trying this problem yourself and leaving a comment with your solution? (I would be quite interested in a nice solution in Scala by the way).

Assuming you now go to your favorite editor and see if you can hack this, let me describe here what happened on my side. My first attempt was ugly. Real ugly. I felt bad doing it, even bad about myself as a programmer. "A real programmer would be able to hack through it in few minutes". Well, mine was ugly, and I didn't even finish it, perhaps it couldn't work that way at all. For history, I tried building all trees in a top-down fashion (by the way, some months ago I was curious about "how many binary trees are there with a given size", and did that top-down approach to find out how many, but that was a simpler problem, and I ended up rediscovering the Catalan numbers).

I gave up, did not bother with it for another few hours, and then it hit me: why not build the trees bottom-up? In a dynamic programming fashion. Knowing all trees of size (N-1), one can pretty easily find the trees with size N.

Enough description, lets have some code:


public class Main {
public static void main(String[] args) {
    int count = 0;

    for (Node tree : TreeEnumerator.findTrees(5)) {
        System.out.println(tree);
        count++;
    }
    System.out.println(count);
}
}

class TreeEnumerator {
public static Collection<Node> findTrees(int n) {
    List<Collection<Node>> knownTrees = new ArrayList<Collection<Node>>(n);
    knownTrees.add(Collections.<Node>singleton(Nil.instance));
    knownTrees.add(Collections.<Node>singleton(new InternalNode(Nil.instance, Nil.instance)));
    for (int i = 2; i <= n; i++) {
        final int kids = i - 1;
        Collection<Node> trees = new LinkedList<Node>();
        for (int leftKids = kids; leftKids >= 0; leftKids--) {
            int rightKids = kids - leftKids;
            Collection<Node> leftTrees = knownTrees.get(leftKids);
            Collection<Node> rightTrees = knownTrees.get(rightKids);
            for (Node left : leftTrees) {
                for (Node right : rightTrees) {
                    trees.add(new InternalNode(left, right));
                }
            }
        }
        knownTrees.add(trees);
    }
    return knownTrees.get(n);
}
}

interface Node {
}

class Nil implements Node {
static final Nil instance = new Nil();

private Nil() { }

@Override
public String toString() {
    return ".";
}
}

class InternalNode implements Node {
final Node left, right;

InternalNode(Node left, Node right) {
    this.left = left;
    this.right = right;
}

@Override
public String toString() {
    return "(" + left + right + ")";
}
}

I was very, very happy with the end result and the easiness that new trees are generated, always reusing some existing lower-rank tree. I suppose one might only appreciate this if he had already tried and realized how easily a "simple" problem, thought just a little bit wrongly, may end up in something so complicated and unmanageable - then one really appreciates simplicity! This algorithm is very efficient (probably optimal), and only creates as many nodes as trees (of size N or less), no more. Also noteworthy is that no tree copying takes place anywhere, only tree sharing. (This is facilitated by the fact that nodes do not ren eference their parents).

All this is nice, but this is not the complete solution. We still have to find which tree of those cannot be broken more evenly than 1/4 and 3/4. An obvious approach is to take every tree, and break it in every place possible, and keep the most even cut of those, and check whether it is 3/4. But then, how would one actually cut the trees? Nulling references in nodes? But that would destroy all other trees sharing those nodes! Node copying? It would seem that we should after all fall back to copying every tree, then try all breaks, so the original trees are kept intact.

Since this gets quite a big post, I'm not going to show the code of my solution, just sketch it. It is just a simple neat trick really. You don't have to actually break the tree! Just define a recursive "int count()" method on Node, and for every tree, do a couple of actions:

Perform count() on the root, store it to "treeSize"
Perform count() on every other node, store it to "subTreeSize"

Then you automatically have the sizes of the two components "assuming" they were actually disconnected (while they are not; they are still the same tree). "subTreeSize" is the size of the tree component below the point we would cut it. "treeSize - subTreeSize" is the size of everything else, i.e. the second component. Voila!

Hopefully you would find this interesting enough and tried it out before reading this.

Cheers!

PS: This blog post is a kind offer from the CoCo coffee-bar (wifi-enabled), located on Cowley St., at the beautiful city of Oxford.

PS2: SyntaxHighlighter just won't work for me. :-( Sorry. I am no CSS expert, and would need something pre-integrated in the blogging software.

Tuesday, February 17, 2009

Crazy Java Puzzler

Here is a puzzle for you. As the title promises, yes, it's quite crazy. It is assumed that you already are a die-hard Java programmer that knows the language better than the skin of his very palm. Ok, maybe not that much, but anyway.

As a little history, I had posted this on the Hellenic (i.e. Greek) Java User Group, which used to have an online forum, now defunct (too bad). So if you happen to be on those few who happened to see the solution there, just restrain yourself, I don't want any stinkin' spoilers!

Enough of intro. Here is the code:

interface id {
  <id> Void read(List<id> list);
  <id> Void read(List<id> list);
  <id> Void read(List<id>... list); //*
}

The puzzle: What is the shortest edit (measured in inserted/deleted characters) that makes this program compilable? (While keeping the read methods overloaded).

(We're not talking any kind of cheat like compiling with some arbitrary compiler, just the plain Java compiler. Oh, and commenting out lines of code is the wrong answer too).

The problem, obviously, is that the two methods produce a name clash.Well, it's all yours! Hit it hard! Can you nail it?

* [10-Mar-10] As Sureth below points, a solution would be appending "[]" to a list, a 2-character ending. Indeed. Sloppy of me. That basically ruins the fun, so I have to this third overloading to avoid [] and varargs solutions. Now (hopefully) only the intended solution is there, hopefully in not too artificial a puzzle. Oh, by the way, trying this problem is not a waste of time: it reveals a deeper knowledge about the java platform and gives a new (for those unaware of it) technique that can be used in API design. It's already the "magic sauce" of the nice API of a popular testing-related framework.

Saturday, January 24, 2009

A DSL to output XML, in Java

Yesterday, I had to output some simple XML in Java. I couldn't find any decent way of doing this out there. Writing with SAX sucks, and DOM didn't look much better either. I just println'ed the thing out and went on my way.

Today I revisited the problem. I wanted to easily express the XML tree structure and evaluate it as something, a String for instance. I came up with a neat solution heavily based on generics.

Here is a usage example of it:

XmlBuilder xmlBuilder = new XmlBuilder();
String s = Xml.newDocument(new XmlBuilder())
.kid("start")
.kid("kid")
  .kid("subkid").attr("key", "value")
  .close()
.close()
.close();

System.out.println(s);

Note that calling close() sometimes returns a node (that can also close()), but the last time returns directly a String!

This is the output:


<?xml version="1.0"?>
<root>
<child>
  <sub_child key="value">
  </sub_child>
</child>
</root>

Here is the simplified crux of the idea:


public class Node<T> {
private final String name;
private final T returnValue;
private final XmlHandler handler;

Node(String name, T returnValue, XmlHandler handler) {
  this.name = name;
  this.returnValue = returnValue;
  this.handler = handler;
}
//...

Node<Node<T>> child(String label) {
  return new Node<Node<T>>(label, this, handler);
}

T close() {
  //...
  handler.endNode(name);
  return returnValue;
}
}

The magic part is methods child() and close(). Node<T> means "when close() is called, return T". Imagine we want to output an XML as a string. We want the root node to be of type Node<String>, so when that closes, we get a String back. Right. Now pay attention to method to the child() method. It returns a Node<Node<T>>, meaning "here is a node that when it closes, you return back to this node" (and we give "this" as a returnValue of the new node).

This means that the compiler can figure out the nesting level of each Node. The root Node is just Node<String>. All children of that are of type Node<Node<String>>. Children of those children are Node<Node<Node<String>>>, and so on. When a close() is invoked, a nesting goes away, like pealing an onion kind of way, till we get back to the result we want.

XmlBuilder is similar to ContentHandler, but simplified. Here it is:


public interface XmlHandler<T> {
void startNode(String name, Map attributes);
void endNode(String name);

T result();
}

The above library can be used with any implementation of XmlHandler - producing a String is only a viable use-case. One could easily create a DOM as well, simply by creating another XmlHandler implementation. XmlBuilder in the example above is XmlHandler, and this is why the root node is of type Node. We could implement an XmlHandler and get back the root node as Node.

To conclude, this is a nice DSL for expressing XML trees, combined with an abstract handler to process the tree. A point to highlight is that this modelling allows the compiler to know the nesting of each node, so it knows how many consequtive "close()" I can call for example. I must call the right amound of close() to get back to the result I want, or else a type error will occur.

Still, this kind of type-safety doesn't automatically translate to valid XML, because the user could store a node somewhere and close it twice - we would only find that at runtime.

All in all, that was a good exercise in Java generics. I would give the complete source if there was an easy way to attach it here, but I guess the margin is too narrow. :-)

Tuesday, January 20, 2009

The Subtleties of PhantomReference and finalization

A common misconception regarding PhantomReference is that it is designed to "fix" the dead object resurrection problem that finalizers have. For example, this old java.net blog says about phantom references:

PhantomReferences avoid a fundamental problem with finalization: finalize() methods can "resurrect" objects by creating new strong references to them. So what, you say? Well, the problem is that an object which overrides finalize() must now be determined to be garbage in at least two separate garbage collection cycles in order to be collected.

Well, yeah. Except for the fact that even PhantomReference can let their referred objects be resurrected. This is even how it can be done:

Reference ref = referenceQueue.remove();
//ref is our PhantomReference instance
Field f = Reference.class.getDeclaredField("referent");
f.setAccessible(true);
System.out.println("I see dead objects! --> " + f.get(ref));
//This is obviously a very bad practice

Yes, unreachable objects are really referenced from the Reference#referent field, seemingly strongly, but no, the garbage collector makes an exception for that particular field. This fact also contradicts what the previous blog declares:

PhantomReferences are enqueued only when the object is physically removed from memory.

Which is not true, as we just saw. Javadocs also say:

Phantom references are most often used for scheduling pre-mortem cleanup

So if PhantomReference was not meant to fix the resurrection problem (which indeed is serious, as pointedly proven by Lazarus, Jesus and many others), what is it useful for?

The main advantage of using a PhantomReference over finalize() is that finalize() is called by a garbage-collector thread, meaning it introduces concurrency even in a single-threaded program, with all the potential issues (like correctly synchronizing shared state). With a PhantomReference, you choose the thread that dequeues references from your queue (in a single-threaded program, that thread could periodically do this job).

What about using WeakReference? It seems to also fit the bill for pre-mortem clean-up. The difference lies in when exactly the reference is enqueued. A PhantomReference is enqueued after finalization of the object. A WeakReference is enqueued before. This doesn't matter for objects with no non-trivial finalize() method.

And how exactly are you supposed to clean-up a dead object you don't even know? (PhantomReference's get() method always returns null). Well, you should store as much state as needed to perform the clean-up. If cleaning up an object means nulling an element of a global array, then you have to keep track of the element index, for example. This can be easily done by extending the PhantomReference and adding the fields you want, and then create PhantomReference instances from that subclass.

Now lets talk about even darker corners than these.

Lets say that it is even darker relating finalization than this. If you are about to write a clean-up hook for an object (by finalize() or with a [Weak|Phantom]Reference), and you happen to call a method on it while there is it is strongly-referenced only from the thread stack (i.e. a local variable), and you happen to invoke a method to that object, bad things can happen.

This is very unfortunate. For performance reasons, the VM is permitted if it can to reuse the register that holds the object reference, thus making the object unreachable. So, during the method invocation on an object, finalization might be executed concurrently, leading to unpredictable results (finalize() could modify state that is needed by the other method execution). This should be extremely rare though. Currently, this can be fixed by:

Object method() {
 //do work here
 synchronized (this) { }
 return result;
}

public void finalize() {
synchronized (this) { }
//do work here
}

This will only affect you if you have an object referenced only from the thread stack and either one holds:

it has a non-trivial finalize() method
there is a [Weak|Soft|PhantomReference] to it, enlisted to a ReferenceQueue, and there is a different thread that dequeues references from ReferenceQueue

To conclude, the safest clean-up for objects is to have a ReferenceQueue and a PhantomReference, and you use the same thread that uses the object to do the clean-up. (If this the clean-up is performed from another thread, synchronization will probably be needed, and the above issue might be relevant).

You may check these JavaOne slides too.

Sunday, January 18, 2009

Beware of hidden contention of Math.random()

Did you know that Math.random() uses a single java.util.Random instance, which itself is thread-safe*? What this means is that if you have a benchmark with multiple threads that are supposed to run in parallel, and all of them constantly generate random numbers through Math.random(), they will all compete for its lock and can deteriorate into running serially.

So, beware of this pitfall. It is preferrable to share instead a ThreadLocal, as for example in:

static final ThreadLocal localRandom = new ThreadLocal() {

@Override protected Random initialValue() {

return new Random();

}

};

...

Random random = localRandom.get();

This is already quite a common practice when using SimpleDateFormat in web applications, where it is more efficient to have a separate instance per thread rather than have all threads contend for a single instance (which tends to be quite computationally intensive).

A better alternative is slated for inclusion in Java 7: ThreadLocalRandom. This is like a thread local Random, only it is not thread-safe, so some unnecessary overhead is avoided. There is a slight inconvenience that you can't control the initial seed - I could be wrong though (the javadoc says that setSeed throws UnsupportedOperationException, but the implementation seems to be that exactly one call of this method is allowed).

* Using volatile reads and compare-and-swaps, which makes it lock-free

Monday, January 12, 2009

Comments on "Is Java Obsolete?"

Today I read this blog: Is Java obsolete? (spoiler alert: Scala inside)

Apart from the title which used to be an attention-grabbing (along with "java is dead") but not really any more, I am in a quite similar situation as that author.

To begin with, I'm a Java programmer. I've invested (literally) thousand hours of my own time in it the past few years, eagerly learning its ins and outs, and also use it for a day job. I can't attain the same level of productiveness using anything else, without spending (much) much time learning it.

Now, on with the post. It's about someone starting http://www.scala-lang.org/. And doing it, feels (like me) that poor Java-the-language misses so much power (no closures, no FP...). (By the way here is how I picked up Scala: by reading a paper/overview on it, watching a talk by Martin Odersky, and I'm currently reading Programming in Scala).

I played a bit with various scripting languages. I quite like Ruby/Python, but when I found out what performance implications those pertain, I grew very skeptical (see for example this performance essay of Guido and amaze). What I saw was performance characteristics (i.e. "slow"), not in the fixable sense Java was in its start, but not fixable, due to language design itself. So I decided I needed something else (and thankfully, Scala was there all along).

Previously, I also explored some other Java-friendly alternatives, like Groovy. In the words of an anonymous commentator of the above blog: "you might want to have a look at the Groovy language, instead of Scala". Now that's a motivation, heh. Anyway. My experience with groovy was not so good. I looked at it more than a year ago, and the compiler pretty much sucked - I couldn't understand any of the syntax errors. I had the intention of learning it, to at least be able to use it in small-scale scripting tasks, on the console. But that proved futile: it was so easy getting wrong, that I really had to grok most of the groovy language to be able to use it, and that just wouldn't pay off for small-scale scripting tasks. For large-scale, just forget about it. This quote from groovy's main page nicely summarizes the reason:

You will not get compile errors like you would in Java for using undefined members or passing arguments of the wrong type. See Runtime vs Compile time, Static vs Dynamic.

So, I found the comment's last part "... instead of Scala" quite hilarious. Maybe "complementary", but "instead"? Anyway, I don't want to go there just yet.

I remember actually settling for BeanShell, since it was much easier to use. Less powerful, true, but not so cumbersome - easy to put to actual work.

But those are things of the past. I'm quite happy to be able to have both 1) a very powerful oo-functional language, and 2) a very powerful type systems. Types, after all, is the natural currency of Java (as nicely said by crazy bob lee). I tend to view types as language extentions, i.e. "things you can do if you reach here". It's like having the author of a library there with you, giving you advice and direction all the time, whereas in dynamic languages all you have is a string soup where you have to attain the skill of choosing the right string to do the job (and while being at it, write a test proving you picked the right string).

To sum up, I can't picture groovy, or any dynamically typed languages replacing java anytime soon. I also don't see scala replacing java anytime soon, either. But I like the ability to enjoy more powerful programming in scala, while the rest of the world can keep using plain java, and all be peaceful and well.

Enough rambling from me, on with actual work :)