[three]Bean
Distributing jobs via hashmaths
Jun 10, 2013 | categories: python, aosa, fedmsg, fedora View CommentsAs things stand, we can't load balance across multiple fedmsg-hub daemons (We can and do balance across multiple fedmsg-gateway instances, though -- that is another story).
For the hubs though, here's a scheme that might work. However.. is it fast enough?
#!/usr/bin/env python """ Distribute jobs via.. "hashspace"? Actually give this a run. See what it does. I learned about it from the mailman3 chapter in AOSA http://www.aosabook.org/en/mailman.html """ import json import hashlib class Consumer(object): def __init__(self, i, n): self.i = i self.n = n def should_I_process_this(self, msg): """ True if this message falls in "our" portion of the hash space. This seems great, except I bet its pretty expensive. Can you come up with an equivalent, "good enough" method that is more efficient? """ as_a_string = json.dumps(msg) as_a_hash = hashlib.md5(as_a_string).hexdigest() as_a_long = int(as_a_hash, 16) return (as_a_long % self.n) == self.i def demonstration(msg): """ Handy printing utility to show the results """ print "* Who takes this message? %r" % msg for consumer in consumers: print " *", consumer.i, "of", consumer.n, "hubs.", print " Process this one?", consumer.should_I_process_this(msg) if __name__ == '__main__': # Say we have 4 moksha-hubs each running the same set of consumers. # For story-telling sake, we'll say we're dealing here with the datanommer # consumer. Let's pretend it has to do some heavy scrubbing on the message # before it throws it in the DB, so we need to load-balance that. # As things stand now with fedmsg.. we can't do that *hand waving*. # This is a potential scheme to make it possible. # We have 4 moksha-hubs, one on each of 4 machines. N = 4 consumers = [Consumer(i, N) for i in range(N)] # Fedmsg delivers a message. All 4 of our hubs receive it. # They each take the md5 sum of the message, convert that to a long # and then mod that by the number of moksha-hubs. If that remainder is # *their* id, then they claim the message and process it. demonstration(msg={'blah': 'I am a message, lol.'}) demonstration(msg={'blah': 'I am a different message.'}) demonstration(msg={'blah': 'me too.'}) # Since md5 sums are "effectively" random, this should distribute across # all our nodes mostly-evenly.
As I was typing this post up, Toshio Kuratomi mentioned that I should look into zlib.adler32 and binascii.crc32 if I am concerned about speed (which I am).
Perhaps some benchmarking is in order?