I was quite excited by the introduction of the hashCode and equals methods to custom classes in Winter ’13, as I was finally going to be able to get a map to behave the way that I wanted it to. I started to write such a map and a post to go with it when I started to discover a few, shall we say, nuances about what was necessary to implement these methods – in particular the hashCode one. I will follow up with my magic map post but for the meantime I want to discuss the implementation of hashCode method.
I never studied computer science, or indeed maths in enough depth to know the intricacies of hash codes but I know enough to know that good hashing methods are hard to come by and potentially are computationally expensive. Given both of these facts I came up with a cunning plan when I was designing my hashCode method, in fact it was more cunning than a… never mind. My plan was thus:
First of all I would create a string representation of my object by making use of the JSON serializer and once I had this string I would simply call it’s hashCode() method.
The beauty of this plan is that I would be leveraging the internal hashing algorithm on the platform, which should be great, well optimized, well dispersed etc and I would only be using two statements to get said hash code. (I could have made it one I know but for clarity I kept it at two) Basically I would get a great, generic, hash generator without using a million lines of Apex and hence continuing to play nicely with the governor limits.
Now the thing about plans as cunning as this is that they normally don’t work and whilst this is a gross generalisation in this case it was the unfortunate truth. It didn’t work. The problem is that the hashCode method isn’t exposed on the string object. I must admit I found this a little odd: in order to use a custom class in a set then we have to implement this public method. Given that a string can be used in a set I would have expected the same pubic method to exist. I’m sure that it does exist the problem is that it hasn’t been exposed to use mere mortals. Whilst I’m on this subject I will mention my general confusion at the implementation of this feature in general. I would have expected an interface to implement or indeed a virtual class to override. However instead we’re asked to add two methods randomly to each class. I’m not a great fan of this – there is obviously some jiggery going on as the code is compiled and I think that rather than hiding this it would have been better to expose it to us. Heading down this route flies in the face of some OO principles and as such, in my eyes, makes the code much more confusing and potentially unclear than it needs to be; but what do I know?
Anyway, putting this to one side and returning to my hunt for a good hashing method. My next thought was to have a look around the web for hashing algorithms and investigate implementing one in Apex. After a bit of hunting I came across the FNV algorithm – it is spoken well of and seems easy enough to implement. Easy enough that is in most other languages but Apex as ever presents it’s challenges. The problem comes in the fact that you need to operate on an octet of data – just one byte. Now using a string as our starting point it’s quite difficult to extract just one byte of data from it. You can split the string into individual characters but you have no way of getting back to the integer that represents that character. A shame really given that you can do this process the other way around: you can use string.fromCharArray() to convert a list of integers into a string. Oh well, this is part of the fun of this platform – find the your way around the obstacles.
So, how to navigate around this small issue then? Well we could build a static map that lists all possible characters and maps them to an integer. This could work, although with some of the extended character sets this could become rather painful to build, not to mention the fact that we need to keep our value less that 256 to stay with our 8 bits of data. We need to represent our string of data some other way – some way that has a limited character set. Base64 was designed exactly for this purpose so we could base64 encode our string – the size of the map is limited and the platform offers us a way to do this. You could also convert your string to a hex representation – the map is even smaller in this case, although the resulting string would be longer. To be honest I think this is a personal preference thing, I have no string thoughts about going with either solution. With this adaptation we can now nicely execute the FNV algorithm in Apex and it ends up looking something like this:
It’s beautiful and it works. However it has a small flaw that is very much specific to the platform. It is computationally expensive in fact, in terms of script statements it’s almost unbounded. Only really limited by the size of the string put into it, which in turn is only limited by the size of the heap! So whilst we have a way of generating a good hash code we need to do something about controlling it’s size. Ironically the hing about a hash code is that it is always the same length, no matter what input you give it and it’s this realisation that leads me to the solution to this problem. The somewhat odd solution is in fact to hash the string before putting through our FNV hashing code. The best part of this is that we can make use of standard platform functions to do this for us. The crypto class allows us to generate a HMAC from any input – ta da! This is fantastic we can now in one script statement compress any string into a reasonably unique (very very low chance of collision) string and best of all it will always be the same length, so we can now predict the effort needed by of hash code method to compute our object’s hash.
At this point I can imagine what you’re thinking – why on earth didn’t you just make use of the built in hash function right at the start. Well I thought about it but it returns a string and our hash code method needs to return an integer. So I would still need a way to convert a string to an integer – which is what the FNV piece of code will do for us.
Putting all of the pieces together we get something that looks like this:
I have kept the conversion to JSON in there as it allows me to provide a generic function for finding an objects hash code, although if you need to make use of internal state to uniquely identify your objects you would need to modify this piece. I then create a HMAC of the JSON and convert it to a hex value. This hex string is then broken down and character by character mapped to an integer which acts as our octet of data in the FNV code. Then, after all of that, we’re left with an integer which we can use as our hash code.
I have to admit this does seem like an awfully large amount of work to simply generate an integer although as I mentioned before finding a good hashing algorithms isn’t trivial stuff. And it is further complicated by needing to stick within governor limits. I feel as though what I have come up with here is a good compromise that makes use of standard platform functionality where possible and hopefully provides a decent dispersion. However, as I mentioned before I don’t really know much about the maths behind hashing and am very happy to have any flaws pointed out to me. (I’m sure I’ve made a rookie mistake somewhere in there)
As I said before it would be great if the platform exposed some way for us to do this – be it exposing their own native hash methods on a string or indeed providing an object.hash method that all objects have access to. But in the meantime I will be using this method to help me create custom sets and maps.
As a small side note: you may have noticed a couple of odd lines in the code, where I convert a string into a long. Trust me, I believe this has to be in here… I don’t want it to be but there’s what looks like a small platform bug, after I have done some more testing on this I shall report back.