|
Peter Haidinyak
2011-03-16, 22:36
Andrey Stepachev
2011-03-16, 22:48
Ted Dunning
2011-03-17, 07:23
Pete Haidinyak
2011-03-17, 07:30
Michael Segel
2011-03-17, 15:21
Chris Tarnas
2011-03-17, 16:20
Peter Haidinyak
2011-03-17, 16:44
Ted Dunning
2011-03-17, 17:23
Ted Dunning
2011-03-17, 17:23
Chris Tarnas
2011-03-17, 18:19
Ted Dunning
2011-03-17, 18:32
Peter Haidinyak
2011-03-17, 18:43
Chris Tarnas
2011-03-17, 18:48
|
-
OT - Hash Code CreationPeter Haidinyak 2011-03-16, 22:36
Hi,
This is a little off topic but this group seems pretty swift so I thought I would ask. I am aggregating a day's worth of log data which means I have a Map of over 24 million elements. What would be a good algorithm to use for generating Hash Codes for these elements that cut down on collisions? I application starts out reading in a log (144 logs in all) in about 20 seconds and by the time I reach the last log it is taking around 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. I've played around with different Hashing algorithms and cut the original time from over 300 seconds to 120 but I know I can do better. The key I am using for the Map is an alpha-numeric string that is approximately 16 character long with the last 4 or 5 character being the most unique. Any ideas? Thanks -Pete
-
Re: OT - Hash Code CreationAndrey Stepachev 2011-03-16, 22:48
Try hash table with double hashing.
Something like this http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> > Hi, > This is a little off topic but this group seems pretty swift so I > thought I would ask. I am aggregating a day's worth of log data which means > I have a Map of over 24 million elements. What would be a good algorithm to > use for generating Hash Codes for these elements that cut down on > collisions? I application starts out reading in a log (144 logs in all) in > about 20 seconds and by the time I reach the last log it is taking around > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. > I've played around with different Hashing algorithms and cut the original > time from over 300 seconds to 120 but I know I can do better. > The key I am using for the Map is an alpha-numeric string that is > approximately 16 character long with the last 4 or 5 character being the > most unique. > > Any ideas? > > Thanks > > -Pete >
-
Re: OT - Hash Code CreationTed Dunning 2011-03-17, 07:23
Double hashing is a find thing. To actually answer the question, though, I
would recommend Murmurhash or JOAAT ( http://en.wikipedia.org/wiki/Jenkins_hash_function) On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > Try hash table with double hashing. > Something like this > > http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm > > 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> > > > Hi, > > This is a little off topic but this group seems pretty swift so I > > thought I would ask. I am aggregating a day's worth of log data which > means > > I have a Map of over 24 million elements. What would be a good algorithm > to > > use for generating Hash Codes for these elements that cut down on > > collisions? I application starts out reading in a log (144 logs in all) > in > > about 20 seconds and by the time I reach the last log it is taking around > > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. > > I've played around with different Hashing algorithms and cut the original > > time from over 300 seconds to 120 but I know I can do better. > > The key I am using for the Map is an alpha-numeric string that is > > approximately 16 character long with the last 4 or 5 character being the > > most unique. > > > > Any ideas? > > > > Thanks > > > > -Pete > > >
-
Re: OT - Hash Code CreationPete Haidinyak 2011-03-17, 07:30
Thanks, I'll give that a try.
-Pete On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <[EMAIL PROTECTED]> wrote: > Double hashing is a find thing. To actually answer the question, > though, I > would recommend Murmurhash or JOAAT ( > http://en.wikipedia.org/wiki/Jenkins_hash_function) > > On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> > wrote: > >> Try hash table with double hashing. >> Something like this >> >> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm >> >> 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> >> >> > Hi, >> > This is a little off topic but this group seems pretty swift >> so I >> > thought I would ask. I am aggregating a day's worth of log data which >> means >> > I have a Map of over 24 million elements. What would be a good >> algorithm >> to >> > use for generating Hash Codes for these elements that cut down on >> > collisions? I application starts out reading in a log (144 logs in >> all) >> in >> > about 20 seconds and by the time I reach the last log it is taking >> around >> > 120 seconds. The extra 100 seconds have to do with Hash Table >> Collisions. >> > I've played around with different Hashing algorithms and cut the >> original >> > time from over 300 seconds to 120 but I know I can do better. >> > The key I am using for the Map is an alpha-numeric string that is >> > approximately 16 character long with the last 4 or 5 character being >> the >> > most unique. >> > >> > Any ideas? >> > >> > Thanks >> > >> > -Pete >> >
-
RE: OT - Hash Code CreationMichael Segel 2011-03-17, 15:21
Why not keep it simple? Use a SHA-1 hash of your key. See: http://codelog.blogial.com/2008/09/13/password-encryption-using-sha1-md5-java/ (This was just the first one I found and there are others...) So as long as your key is unique, the sha-1 hash should also be unique. The reason I suggest sha-1 is that it ships as part of Java SE (security package I think) so its always there and its unique enough. (While the chances of a collision are theoretically possible, no one has found one yet. You could be the first. :-) ) HTH -Mike > From: [EMAIL PROTECTED] > Date: Thu, 17 Mar 2011 00:23:00 -0700 > Subject: Re: OT - Hash Code Creation > To: [EMAIL PROTECTED] > CC: [EMAIL PROTECTED] > > Double hashing is a find thing. To actually answer the question, though, I > would recommend Murmurhash or JOAAT ( > http://en.wikipedia.org/wiki/Jenkins_hash_function) > > On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > > > Try hash table with double hashing. > > Something like this > > > > http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm > > > > 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> > > > > > Hi, > > > This is a little off topic but this group seems pretty swift so I > > > thought I would ask. I am aggregating a day's worth of log data which > > means > > > I have a Map of over 24 million elements. What would be a good algorithm > > to > > > use for generating Hash Codes for these elements that cut down on > > > collisions? I application starts out reading in a log (144 logs in all) > > in > > > about 20 seconds and by the time I reach the last log it is taking around > > > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. > > > I've played around with different Hashing algorithms and cut the original > > > time from over 300 seconds to 120 but I know I can do better. > > > The key I am using for the Map is an alpha-numeric string that is > > > approximately 16 character long with the last 4 or 5 character being the > > > most unique. > > > > > > Any ideas? > > > > > > Thanks > > > > > > -Pete > > > > >
-
Re: OT - Hash Code CreationChris Tarnas 2011-03-17, 16:20
With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.
-chris On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote: > Thanks, I'll give that a try. > > -Pete > > On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Double hashing is a find thing. To actually answer the question, though, I >> would recommend Murmurhash or JOAAT ( >> http://en.wikipedia.org/wiki/Jenkins_hash_function) >> >> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> >>> Try hash table with double hashing. >>> Something like this >>> >>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm >>> >>> 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> >>> >>> > Hi, >>> > This is a little off topic but this group seems pretty swift so I >>> > thought I would ask. I am aggregating a day's worth of log data which >>> means >>> > I have a Map of over 24 million elements. What would be a good algorithm >>> to >>> > use for generating Hash Codes for these elements that cut down on >>> > collisions? I application starts out reading in a log (144 logs in all) >>> in >>> > about 20 seconds and by the time I reach the last log it is taking around >>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. >>> > I've played around with different Hashing algorithms and cut the original >>> > time from over 300 seconds to 120 but I know I can do better. >>> > The key I am using for the Map is an alpha-numeric string that is >>> > approximately 16 character long with the last 4 or 5 character being the >>> > most unique. >>> > >>> > Any ideas? >>> > >>> > Thanks >>> > >>> > -Pete >>> > >
-
RE: OT - Hash Code CreationPeter Haidinyak 2011-03-17, 16:44
Hash Code in Object is limited to an int and a quick look at HashMap and Trove's HashMap looks like they are only using 31 bits of that. I am now trying a modified version of what Ted pointed at and it seems to be working very well. I modified the original since only the last few bytes in the key are usually unique so I start at both ends when creating the Hash Code. So far I am half way through an import and the times went down from 24 seconds to 11 seconds on the first few files and have been holding around 13 seconds at half way vs 45 seconds with the old method.
Thanks -Pete -----Original Message----- From: Christopher Tarnas [mailto:[EMAIL PROTECTED]] On Behalf Of Chris Tarnas Sent: Thursday, March 17, 2011 9:21 AM To: [EMAIL PROTECTED] Subject: Re: OT - Hash Code Creation With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters. -chris On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote: > Thanks, I'll give that a try. > > -Pete > > On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Double hashing is a find thing. To actually answer the question, though, I >> would recommend Murmurhash or JOAAT ( >> http://en.wikipedia.org/wiki/Jenkins_hash_function) >> >> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> >>> Try hash table with double hashing. >>> Something like this >>> >>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm >>> >>> 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> >>> >>> > Hi, >>> > This is a little off topic but this group seems pretty swift so I >>> > thought I would ask. I am aggregating a day's worth of log data which >>> means >>> > I have a Map of over 24 million elements. What would be a good algorithm >>> to >>> > use for generating Hash Codes for these elements that cut down on >>> > collisions? I application starts out reading in a log (144 logs in all) >>> in >>> > about 20 seconds and by the time I reach the last log it is taking around >>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. >>> > I've played around with different Hashing algorithms and cut the original >>> > time from over 300 seconds to 120 but I know I can do better. >>> > The key I am using for the Map is an alpha-numeric string that is >>> > approximately 16 character long with the last 4 or 5 character being the >>> > most unique. >>> > >>> > Any ideas? >>> > >>> > Thanks >>> > >>> > -Pete >>> > >
-
Re: OT - Hash Code CreationTed Dunning 2011-03-17, 17:23
There can be some odd effects with this because the keys are not uniformly
distributed. Beware if you are using pre-split tables because the region traffic can be pretty unbalanced if you do a naive split. On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <[EMAIL PROTECTED]> wrote: > I've been using base-64 encoding when I use my hashes as rowkeys - makes > them printable while still being fairly dense, IIRC a 64bit key should be > only 11 characters.
-
Re: OT - Hash Code CreationTed Dunning 2011-03-17, 17:23
On Thu, Mar 17, 2011 at 8:21 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
> > Why not keep it simple? > > Use a SHA-1 hash of your key. See: > http://codelog.blogial.com/2008/09/13/password-encryption-using-sha1-md5-java/ > (This was just the first one I found and there are others...) > Sha-1 is kind of slow. > > So as long as your key is unique, the sha-1 hash should also be unique. > Pretty much. But you can get comparable performance with simpler hashes. These simpler hashes are very widely available even if not part of java.
-
Re: OT - Hash Code CreationChris Tarnas 2011-03-17, 18:19
I'm not sure I am clear, are you saying 64 bit chunks of a MD5 keys are not uniformly distributed? Or that a base-64 encoding is not evenly distributed?
thanks, -chris On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote: > > There can be some odd effects with this because the keys are not uniformly distributed. Beware if you are using pre-split tables because the region traffic can be pretty unbalanced if you do a naive split. > > On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <[EMAIL PROTECTED]> wrote: > I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters. >
-
Re: OT - Hash Code CreationTed Dunning 2011-03-17, 18:32
Just that base-64 is not uniformly distributed relative to a binary
representation. This is simply because it is all printable characters. If you do a 256 way pre-split based on a binary interpretation of the key, 64 regions will get traffic and 192 will get none. Among other things, this can seriously mess up benchmarking. The situation is even worse with decimal integer representations. On Thu, Mar 17, 2011 at 11:19 AM, Chris Tarnas <[EMAIL PROTECTED]> wrote: > I'm not sure I am clear, are you saying 64 bit chunks of a MD5 keys are not > uniformly distributed? Or that a base-64 encoding is not evenly distributed? > > thanks, > -chris > > On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote: > > > There can be some odd effects with this because the keys are not uniformly > distributed. Beware if you are using pre-split tables because the region > traffic can be pretty unbalanced if you do a naive split. > > On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <[EMAIL PROTECTED]> wrote: > >> I've been using base-64 encoding when I use my hashes as rowkeys - makes >> them printable while still being fairly dense, IIRC a 64bit key should be >> only 11 characters. > > > >
-
RE: OT - Hash Code CreationPeter Haidinyak 2011-03-17, 18:43
Final tally on the import of a full days worth of search logs. The process started out at 12 seconds per log and ended at 15 seconds per log. Previously, the process started out at 24 seconds per log and ended at 154 seconds per log. I think I'll stay with my current Hash Code Generation Algorithm. Thanks Ted.
-Pete -----Original Message----- From: Peter Haidinyak [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 17, 2011 9:44 AM To: [EMAIL PROTECTED] Subject: RE: OT - Hash Code Creation Hash Code in Object is limited to an int and a quick look at HashMap and Trove's HashMap looks like they are only using 31 bits of that. I am now trying a modified version of what Ted pointed at and it seems to be working very well. I modified the original since only the last few bytes in the key are usually unique so I start at both ends when creating the Hash Code. So far I am half way through an import and the times went down from 24 seconds to 11 seconds on the first few files and have been holding around 13 seconds at half way vs 45 seconds with the old method. Thanks -Pete -----Original Message----- From: Christopher Tarnas [mailto:[EMAIL PROTECTED]] On Behalf Of Chris Tarnas Sent: Thursday, March 17, 2011 9:21 AM To: [EMAIL PROTECTED] Subject: Re: OT - Hash Code Creation With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters. -chris On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote: > Thanks, I'll give that a try. > > -Pete > > On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Double hashing is a find thing. To actually answer the question, though, I >> would recommend Murmurhash or JOAAT ( >> http://en.wikipedia.org/wiki/Jenkins_hash_function) >> >> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> >>> Try hash table with double hashing. >>> Something like this >>> >>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm >>> >>> 2011/3/17 Peter Haidinyak <[EMAIL PROTECTED]> >>> >>> > Hi, >>> > This is a little off topic but this group seems pretty swift so I >>> > thought I would ask. I am aggregating a day's worth of log data which >>> means >>> > I have a Map of over 24 million elements. What would be a good algorithm >>> to >>> > use for generating Hash Codes for these elements that cut down on >>> > collisions? I application starts out reading in a log (144 logs in all) >>> in >>> > about 20 seconds and by the time I reach the last log it is taking around >>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. >>> > I've played around with different Hashing algorithms and cut the original >>> > time from over 300 seconds to 120 but I know I can do better. >>> > The key I am using for the Map is an alpha-numeric string that is >>> > approximately 16 character long with the last 4 or 5 character being the >>> > most unique. >>> > >>> > Any ideas? >>> > >>> > Thanks >>> > >>> > -Pete >>> > >
-
Re: OT - Hash Code CreationChris Tarnas 2011-03-17, 18:48
Ok - now I understand - doing pre-splits using the full binary space does not make sense when using a limited range. I do all my splits in the base-64 character space or let hbase do them organically.
thanks for the explanation. -chris On Mar 17, 2011, at 11:32 AM, Ted Dunning wrote: > Just that base-64 is not uniformly distributed relative to a binary representation. This is simply because it is all printable characters. If you do a 256 way pre-split based on a binary interpretation of the key, 64 regions will get traffic and 192 will get none. Among other things, this can seriously mess up benchmarking. The situation is even worse with decimal integer representations. > > On Thu, Mar 17, 2011 at 11:19 AM, Chris Tarnas <[EMAIL PROTECTED]> wrote: > I'm not sure I am clear, are you saying 64 bit chunks of a MD5 keys are not uniformly distributed? Or that a base-64 encoding is not evenly distributed? > > thanks, > -chris > > On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote: > >> >> There can be some odd effects with this because the keys are not uniformly distributed. Beware if you are using pre-split tables because the region traffic can be pretty unbalanced if you do a naive split. >> >> On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <[EMAIL PROTECTED]> wrote: >> I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters. >> > > |