Eugene has posted 1 posts at DZone. View Full User Profile

Immutable data structures in groovy++. Why not?

02.04.2010
| 7502 views |
  • submit to reddit

Immutable data structures are a hot topic these days. Indeed, they seem to solve the fundamental problem present in all destructively updated data structures: either you have to synchronize all accesses to the state of the data structure, or you risk that one thread will see some inconsistent state while the other is performing some update. So, no shared state - no problem: just generate a new collection, and atomically set the shared variable to it using some CAS(compare-and-swap) instruction, persent in almost all modern processors. But wait, I need to copy the entire collection when adding a single element? No, thank you:)

Fortunately this problem has been solved a long time ago. The solution comes in the form of data sharing: you don't need to copy everything to produce new collection, most of the data will be shared with the old one. Sounds puzzling? Continue reading, and I'll show you how this can be coded in static groovy.

The collection I've chosen is the familiar hash map, the one everybody coding Java probably knows. The implementation is also nothing new, actually it comes from Clojure, where immutable data structures are called persistent and are used quite heavily.  But you don't need to learn Clojure to enjoy efficient immutable data structures, static Groovy to the rescue.

 So to start with, the main idea behind sharing is representing everything as a tree (Trie to be more precise, but we are not computer scientists, right?:) This allows to perform adds and removes using only a logarithm of the data structure size to generate the new one. How can hash map be viewed as a tree? Simply dividing a 32 bit hash into groups of 5 bits that correspond to tree levels. This gives us 6 levels, so the height of the tree is at most 6, while we still can keep up to 2^30 objects in the map if we store actual objects just in the leaves, which should be more than enough.

So now  to the code.

@Typed

abstract class FHashMap<K, V> implements Iterable<Map.Entry<K,V>> {

final V getAt(K key) { getAt(key, key.hashCode()) }


final FHashMap<K, V> put(K key, V value) {
update(0, key, key.hashCode(), value)
}

final FHashMap<K, V> remove(K key) {
remove(key, key.hashCode())
}

protected abstract V getAt(K key, int hash)

protected abstract FHashMap<K,V> update(int shift, K key, int hash, V value)

protected abstract FHashMap<K,V> remove(K key, int hash)

}
 
Pretty straightforward, isn't it? We provide get/add/remove operations for maps. I'll show just add operation, others folllow similar ideas.

We start with empty tree:

class EmptyNode extends FHashMap<K,V> {
FHashMap<K,V> update(int shift, K key, int hash, V value) { new LeafNode(hash, key, value) }
}

Again nothing fancy: when adding to the empty tree, we produce the tree that consists of just one leaf node. Next let's introduce a base class for all leaves in the tree, where we'll store key/value bindings:

abstract class SingleNode extends FHashMap<K,V> {
final int hash

BitmappedNode bitmap(int shift, int hash, K key, V value) {
def shift1 = (getHash() >>> shift) & 0x1f
def shift2 = (hash >>> shift) & 0x1f
def table = new FHashMap<K,V>[Math.max(shift1, shift2) + 1]
table[shift1] = this
def bits1 = 1 << shift1
def bits2 = 1 << shift2
if (shift1 == shift2) {
table[shift2] = table[shift2].update(shift + 5, key, hash, value)
} else {
table[shift2] = new LeafNode(hash, key, value)
}
return new BitmappedNode(shift, bits1 | bits2, table)
}
}

So SingleNode has a hash stored and a method to produce BitmappedNode, which represents the node with more than 1 (but less than 32) children. The parameters given are the level we are inserting at, and hash, key and value inserted respectively. The method starts by calculating a group of 5 bits both from the hash stored of the current key and the one to be added. Then it creates an actual bitmap, i.e. an array of at most 32 elements and sets the relevant element to itself. Then the interesting part: if the added element has the same 5 bits then we split the node calling update recursively with the next level, otherwise we just create a new leaf. Finally a new BitmappedNode is returned with the table created and bits set. Not very easy code, but definitely fun both to write and to read:)

Now there are 2 implementations of SingleNode, LeafNode and CollisionNode, which represents the node storing multiple bindings in case a hash collision occurred.

class LeafNode extends SingleNode {
final K key
final V value

FHashMap<K,V> update(int shift, K key, int hash, V value) {
if (this.key == key) {
if (this.value == value) return this else return new LeafNode(hash, key, value)
} else if (this.hash == hash) {
return new CollisionNode(hash, FList.emptyList + new BucketElement(this.key, this.value) + new BucketElement(key, value))
} else {
return bitmap(shift, hash, key, value)
}
}
}

class CollisionNode extends SingleNode {
final FList<BucketElement<K, V>> bucket

FHashMap<K,V> update(int shift, K key, int hash, V value) {
if (this.hash == hash) {
return new CollisionNode(hash, removeBinding(key, bucket) + [key, value])
            } else {
return bitmap(shift, hash, key, value)
}
}
}

Now, you might wonder what is BucketElement and removeBinding? I've left the definitions out, but I hope the names speak for themselves. What I would like to mention though is FList which is functional list, available in static groovy standard library. Unlike java.util.List it only supports appending a new element to the list and getting the head and tail of the list (those familiar with scala or some other functional language should feel at home).  Also notice how groovy, while still having a Java-like syntax, allows for a much cleaner definitions than Java. And since the binding is static, the performance does not suffer, so no price to pay:)

 Finally here are the definitions of the last nodes that make up immutable hash map implementation:

class BitmappedNode<K,V> extends FHashMap<K,V> {
int shift
int bits
FHashMap<K,V> [] table

FHashMap<K,V> update(int shift, K key, int hash, V value) {
def i = (hash >>> shift) & 0x1f
def mask = 1 << i
if (bits & mask) {
def node = table[i].update(shift + 5, key, hash, value)
if (node === table[i]) return this else {
def newTable = new FHashMap<K,V>[table.length]
System.arraycopy table, 0, newTable, 0, table.length
newTable[i] = node
return new BitmappedNode(shift, bits, newTable)
}
} else {
def newTable = new FHashMap<K,V>[Math.max(table.length, i + 1)]
System.arraycopy table, 0, newTable, 0, table.length
newTable[i] = new LeafNode(hash, key, value)
def newBits = bits | mask
if (newBits == ~0) {
return new FullNode(shift, newTable)
} else {
return new BitmappedNode(shift, newBits, newTable)
}
}
}
}
// Just a specialization of BitmappedNode for the case when all bits are set.
class FullNode extends FHashMap<K,V> {
int shift
FHashMap<K,V>[] table

FHashMap<K,V> update(int shift, K key, int hash, V value) {
def i = (hash >>> shift) & 0x1f
def node = table[i].update(shift + 5, key, hash, value)
if (node === table[i]) return this else {
def newTable = new FHashMap<K,V>[32]
System.arraycopy table, 0, newTable, 0, 32
newTable[i] = node
return new FullNode(shift, newTable)
}
}
}

Again I won't go into much detail here, the curious reader can easily follow what's going on. One thing worth mentioing still is that update is naturally recursive: in order to insert an element at some level we first insert it on the next and analyze the result. Also note that '===' means reference equality in static groovy, which makes the recursiive result analysis much cheaper.

One final remark after we are done with the code. As you might see we allocate a new BitmappedNode when it has at least 2 children. If these children are far from each other, we can potentially lose 30 out of 32 slots in the table, which might be seen as a memory leak. The good news is however that you are likely to put objects with 'near' hash codes into the map, and for this situation the waste would be minimal. Still you should remember that this collection is tailored towards locality sensitive hashing http://en.wikipedia.org/wiki/Locality_sensitive_hashing and the usual trick to shuffle the bits got from identity hash code calculation makes things just worse.

 So as you might see, implementing efficient immutable data structures is not rocket science, especially when you have an expressive language at hand. If you want a full version of the described class, please download static groovy distribution available at http://code.google.com/p/groovypptest/downloads/list and you'll see all the sources of standard library.

Best regards and until next time,

Eugene.

 

Published at DZone with permission of its author, Eugene Vigdorchik.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Václav Pech replied on Thu, 2010/02/04 - 12:34pm

Very inspiring read, Eugene. Looks like something an STM implementation could be built on.

Eugene Vigdorchik replied on Thu, 2010/02/04 - 1:43pm

Vaclav, yes, this is a perfect data structure to hold transaction-level state.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.