Saturday, January 24, 2009

A DSL to output XML, in Java

Yesterday, I had to output some simple XML in Java. I couldn't find any decent way of doing this out there. Writing with SAX sucks, and DOM didn't look much better either. I just println'ed the thing out and went on my way.

Today I revisited the problem. I wanted to easily express the XML tree structure and evaluate it as something, a String for instance. I came up with a neat solution heavily based on generics.

Here is a usage example of it:



XmlBuilder xmlBuilder = new XmlBuilder();
String s = Xml.newDocument(new XmlBuilder())
.kid("start")
.kid("kid")
.kid("subkid").attr("key", "value")
.close()
.close()
.close();

System.out.println(s);

Note that calling close() sometimes returns a node (that can also close()), but the last time returns directly a String!

This is the output:



<?xml version="1.0"?>
<root>
<child>
<sub_child key="value">
</sub_child>
</child>
</root>


Here is the simplified crux of the idea:



public class Node<T> {
private final String name;
private final T returnValue;
private final XmlHandler handler;

Node(String name, T returnValue, XmlHandler handler) {
this.name = name;
this.returnValue = returnValue;
this.handler = handler;
}
//...

Node<Node<T>> child(String label) {
return new Node<Node<T>>(label, this, handler);
}

T close() {
//...
handler.endNode(name);
return returnValue;
}
}
The magic part is methods child() and close(). Node<T> means "when close() is called, return T". Imagine we want to output an XML as a string. We want the root node to be of type Node<String>, so when that closes, we get a String back. Right. Now pay attention to method to the child() method. It returns a Node<Node<T>>, meaning "here is a node that when it closes, you return back to this node" (and we give "this" as a returnValue of the new node).

This means that the compiler can figure out the nesting level of each Node. The root Node is just Node<String>. All children of that are of type Node<Node<String>>. Children of those children are Node<Node<Node<String>>>, and so on. When a close() is invoked, a nesting goes away, like pealing an onion kind of way, till we get back to the result we want.

XmlBuilder is similar to ContentHandler, but simplified. Here it is:


public interface XmlHandler<T> {
void startNode(String name, Map attributes);
void endNode(String name);

T result();
}


The above library can be used with any implementation of XmlHandler - producing a String is only a viable use-case. One could easily create a DOM as well, simply by creating another XmlHandler implementation. XmlBuilder in the example above is XmlHandler, and this is why the root node is of type Node. We could implement an XmlHandler and get back the root node as Node.

To conclude, this is a nice DSL for expressing XML trees, combined with an abstract handler to process the tree. A point to highlight is that this modelling allows the compiler to know the nesting of each node, so it knows how many consequtive "close()" I can call for example. I must call the right amound of close() to get back to the result I want, or else a type error will occur.

Still, this kind of type-safety doesn't automatically translate to valid XML, because the user could store a node somewhere and close it twice - we would only find that at runtime.

All in all, that was a good exercise in Java generics. I would give the complete source if there was an easy way to attach it here, but I guess the margin is too narrow. :-)



Tuesday, January 20, 2009

The Subtleties of PhantomReference and finalization

A common misconception regarding PhantomReference is that it is designed to "fix" the dead object resurrection problem that finalizers have. For example, this old java.net blog says about phantom references:

PhantomReferences avoid a fundamental problem with finalization: finalize() methods can "resurrect" objects by creating new strong references to them. So what, you say? Well, the problem is that an object which overrides finalize() must now be determined to be garbage in at least two separate garbage collection cycles in order to be collected.

Well, yeah. Except for the fact that even PhantomReference can let their referred objects be resurrected. This is even how it can be done:

Reference ref = referenceQueue.remove();
//ref is our PhantomReference instance
Field f = Reference.class.getDeclaredField("referent");
f.setAccessible(true);
System.out.println("I see dead objects! --> " + f.get(ref));
//This is obviously a very bad practice
Yes, unreachable objects are really referenced from the Reference#referent field, seemingly strongly, but no, the garbage collector makes an exception for that particular field. This fact also contradicts what the previous blog declares:
PhantomReferences are enqueued only when the object is physically removed from memory.
Which is not true, as we just saw. Javadocs also say:
Phantom references are most often used for scheduling pre-mortem cleanup
So if PhantomReference was not meant to fix the resurrection problem (which indeed is serious, as pointedly proven by Lazarus, Jesus and many others), what is it useful for?

The main advantage of using a PhantomReference over finalize() is that finalize() is called by a garbage-collector thread, meaning it introduces concurrency even in a single-threaded program, with all the potential issues (like correctly synchronizing shared state). With a PhantomReference, you choose the thread that dequeues references from your queue (in a single-threaded program, that thread could periodically do this job).

What about using WeakReference? It seems to also fit the bill for pre-mortem clean-up. The difference lies in when exactly the reference is enqueued. A PhantomReference is enqueued after finalization of the object. A WeakReference is enqueued before. This doesn't matter for objects with no non-trivial finalize() method.

And how exactly are you supposed to clean-up a dead object you don't even know? (PhantomReference's get() method always returns null). Well, you should store as much state as needed to perform the clean-up. If cleaning up an object means nulling an element of a global array, then you have to keep track of the element index, for example. This can be easily done by extending the PhantomReference and adding the fields you want, and then create PhantomReference instances from that subclass.

Now lets talk about even darker corners than these.

Lets say that it is even darker relating finalization than this. If you are about to write a clean-up hook for an object (by finalize() or with a [Weak|Phantom]Reference), and you happen to call a method on it while there is it is strongly-referenced only from the thread stack (i.e. a local variable), and you happen to invoke a method to that object, bad things can happen.

This is very unfortunate. For performance reasons, the VM is permitted if it can to reuse the register that holds the object reference, thus making the object unreachable. So, during the method invocation on an object, finalization might be executed concurrently, leading to unpredictable results (finalize() could modify state that is needed by the other method execution). This should be extremely rare though. Currently, this can be fixed by:



Object method() {
//do work here
synchronized (this) { }
return result;
}


public void finalize() {
synchronized (this) { }
//do work here
}
This will only affect you if you have an object referenced only from the thread stack and either one holds:
  • it has a non-trivial finalize() method
  • there is a [Weak|Soft|PhantomReference] to it, enlisted to a ReferenceQueue, and there is a different thread that dequeues references from ReferenceQueue
To conclude, the safest clean-up for objects is to have a ReferenceQueue and a PhantomReference, and you use the same thread that uses the object to do the clean-up. (If this the clean-up is performed from another thread, synchronization will probably be needed, and the above issue might be relevant).

You may check these JavaOne slides too.



Sunday, January 18, 2009

Beware of hidden contention of Math.random()

Did you know that Math.random() uses a single java.util.Random instance, which itself is thread-safe*? What this means is that if you have a benchmark with multiple threads that are supposed to run in parallel, and all of them constantly generate random numbers through Math.random(), they will all compete for its lock and can deteriorate into running serially.

So, beware of this pitfall. It is preferrable to share instead a ThreadLocal, as for example in:

static final ThreadLocal localRandom = new ThreadLocal() {
    @Override protected Random initialValue() {
        return new Random();
    }
};

...
Random random = localRandom.get();

This is already quite a common practice when using SimpleDateFormat in web applications, where it is more efficient to have a separate instance per thread rather than have all threads contend for a single instance (which tends to be quite computationally intensive).

A better alternative is slated for inclusion in Java 7: ThreadLocalRandom. This is like a thread local Random, only it is not thread-safe, so some unnecessary overhead is avoided. There is a slight inconvenience that you can't control the initial seed - I could be wrong though (the javadoc says that setSeed throws UnsupportedOperationException, but the implementation seems to be that exactly one call of this method is allowed).


* Using volatile reads and compare-and-swaps, which makes it lock-free