|
Simon Kelly
2012-06-12, 08:17
Michael Segel
2012-06-12, 08:23
Simon Kelly
2012-06-12, 09:41
Oliver Meyn
2012-06-12, 11:29
Michael Segel
2012-06-12, 11:37
Simon Kelly
2012-06-12, 11:55
Michael Segel
2012-06-12, 13:48
Simon Kelly
2012-06-12, 14:37
Simon Kelly
2012-06-12, 15:05
Michael Segel
2012-06-12, 16:09
Simon Kelly
2012-06-13, 07:01
|
-
Pre-split table using shellSimon Kelly 2012-06-12, 08:17
Hi
I'm getting some unexpected results with a pre-split table where some of the regions are not getting any data. The table keys are UUID (generated using Java's UUID.randomUUID() ) which I'm storing as a byte[16]: key[0-7] = uuid most significant bits key[8-15] = uuid least significant bits The table is created via the shell as follows: create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} The splits.txt is generated using the code here: http://pastebin.com/DAExXMDz which generates 32 regions split between x00 and xFF. I have also tried with 16 byte regions keys (x00x00... to xFFxFF...). As far as I understand this should distribute the rows evenly across the regions but I'm getting a bunch of regions with no rows. I'm also confused as the the ordering of the regions since it seems the start and end keys aren't really matching up correctly. You can see the regions and the requests they are getting here: http://pastebin.com/B4771g5X Thanks in advance for the help. Simon
-
Re: Pre-split table using shellMichael Segel 2012-06-12, 08:23
UUIDs are unique but not necessarily random and even in random samplings, you may not see an even distribution except over time.
Sent from my iPhone On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> wrote: > Hi > > I'm getting some unexpected results with a pre-split table where some of > the regions are not getting any data. > > The table keys are UUID (generated using Java's UUID.randomUUID() ) which > I'm storing as a byte[16]: > > key[0-7] = uuid most significant bits > key[8-15] = uuid least significant bits > > The table is created via the shell as follows: > > create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} > > The splits.txt is generated using the code here: > http://pastebin.com/DAExXMDz which generates 32 regions split between x00 > and xFF. I have also tried with 16 byte regions keys (x00x00... to > xFFxFF...). > > As far as I understand this should distribute the rows evenly across the > regions but I'm getting a bunch of regions with no rows. I'm also confused > as the the ordering of the regions since it seems the start and end keys > aren't really matching up correctly. You can see the regions and the > requests they are getting here: http://pastebin.com/B4771g5X > > Thanks in advance for the help. > Simon
-
Re: Pre-split table using shellSimon Kelly 2012-06-12, 09:41
Yes, I'm aware that UUID's are designed to be unique and not evenly
distributed but I wouldn't expect a big gap in their distribution either. The other thing that is really confusing me is that the regions splits aren't lexicographical sorted. Perhaps there is a problem with the way I'm specifying the splits in the split file. I haven't been able to find any docs on what format the splits keys should be in so I've used what's produced by Bytes.toStringBinary. Is that correct? Simon On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> wrote: > UUIDs are unique but not necessarily random and even in random samplings, > you may not see an even distribution except over time. > > > Sent from my iPhone > > On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> wrote: > > > Hi > > > > I'm getting some unexpected results with a pre-split table where some of > > the regions are not getting any data. > > > > The table keys are UUID (generated using Java's UUID.randomUUID() ) which > > I'm storing as a byte[16]: > > > > key[0-7] = uuid most significant bits > > key[8-15] = uuid least significant bits > > > > The table is created via the shell as follows: > > > > create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} > > > > The splits.txt is generated using the code here: > > http://pastebin.com/DAExXMDz which generates 32 regions split between > x00 > > and xFF. I have also tried with 16 byte regions keys (x00x00... to > > xFFxFF...). > > > > As far as I understand this should distribute the rows evenly across the > > regions but I'm getting a bunch of regions with no rows. I'm also > confused > > as the the ordering of the regions since it seems the start and end keys > > aren't really matching up correctly. You can see the regions and the > > requests they are getting here: http://pastebin.com/B4771g5X > > > > Thanks in advance for the help. > > Simon >
-
Re: Pre-split table using shellOliver Meyn 2012-06-12, 11:29
Hi Simon,
I might be wrong but I'm pretty sure the splits file you specify is assumed to be full of strings. So even though they look like bytes they're being interpreted as the string value (like '\x00') instead of the actual byte \x00. The only way I could get the byte representation of ints (in my case) to be used for pre-splitting was to do it programatically. Hope that helps, Oliver On 2012-06-12, at 11:41 AM, Simon Kelly wrote: > Yes, I'm aware that UUID's are designed to be unique and not evenly > distributed but I wouldn't expect a big gap in their distribution either. > > The other thing that is really confusing me is that the regions splits > aren't lexicographical sorted. Perhaps there is a problem with the way I'm > specifying the splits in the split file. I haven't been able to find any > docs on what format the splits keys should be in so I've used what's > produced by Bytes.toStringBinary. Is that correct? > > Simon > > On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> wrote: > >> UUIDs are unique but not necessarily random and even in random samplings, >> you may not see an even distribution except over time. >> >> >> Sent from my iPhone >> >> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> wrote: >> >>> Hi >>> >>> I'm getting some unexpected results with a pre-split table where some of >>> the regions are not getting any data. >>> >>> The table keys are UUID (generated using Java's UUID.randomUUID() ) which >>> I'm storing as a byte[16]: >>> >>> key[0-7] = uuid most significant bits >>> key[8-15] = uuid least significant bits >>> >>> The table is created via the shell as follows: >>> >>> create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} >>> >>> The splits.txt is generated using the code here: >>> http://pastebin.com/DAExXMDz which generates 32 regions split between >> x00 >>> and xFF. I have also tried with 16 byte regions keys (x00x00... to >>> xFFxFF...). >>> >>> As far as I understand this should distribute the rows evenly across the >>> regions but I'm getting a bunch of regions with no rows. I'm also >> confused >>> as the the ordering of the regions since it seems the start and end keys >>> aren't really matching up correctly. You can see the regions and the >>> requests they are getting here: http://pastebin.com/B4771g5X >>> >>> Thanks in advance for the help. >>> Simon >> -- Oliver Meyn Software Developer Global Biodiversity Information Facility (GBIF) +45 35 32 15 12 http://www.gbif.org
-
Re: Pre-split table using shellMichael Segel 2012-06-12, 11:37
Ok,
Now that I'm awake, and am drinking my first cup of joe... If you just generate UUIDs you are not going to have an even distribution. Nor are they going to be truly random due to how the machines are generating their random numbers. But this is not important in solving your problem.... There is a set of UUIDs which are hashed and then truncated back down to a 128 bit string. You can generate the UUID, take a hash (SHA-1 or MD5) and then truncate it to 128 bits. This would generate a more random distribution across your splits. I'm also a bit curious about why you're pre-splitting in the first place. I mean I understand why people do it, but its a short term fix and I wonder how much pain you feel. Of course YMMV based on your use case. Hash your key and you'll be ok. On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: > Yes, I'm aware that UUID's are designed to be unique and not evenly > distributed but I wouldn't expect a big gap in their distribution either. > > The other thing that is really confusing me is that the regions splits > aren't lexicographical sorted. Perhaps there is a problem with the way I'm > specifying the splits in the split file. I haven't been able to find any > docs on what format the splits keys should be in so I've used what's > produced by Bytes.toStringBinary. Is that correct? > > Simon > > On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> wrote: > >> UUIDs are unique but not necessarily random and even in random samplings, >> you may not see an even distribution except over time. >> >> >> Sent from my iPhone >> >> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> wrote: >> >>> Hi >>> >>> I'm getting some unexpected results with a pre-split table where some of >>> the regions are not getting any data. >>> >>> The table keys are UUID (generated using Java's UUID.randomUUID() ) which >>> I'm storing as a byte[16]: >>> >>> key[0-7] = uuid most significant bits >>> key[8-15] = uuid least significant bits >>> >>> The table is created via the shell as follows: >>> >>> create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} >>> >>> The splits.txt is generated using the code here: >>> http://pastebin.com/DAExXMDz which generates 32 regions split between >> x00 >>> and xFF. I have also tried with 16 byte regions keys (x00x00... to >>> xFFxFF...). >>> >>> As far as I understand this should distribute the rows evenly across the >>> regions but I'm getting a bunch of regions with no rows. I'm also >> confused >>> as the the ordering of the regions since it seems the start and end keys >>> aren't really matching up correctly. You can see the regions and the >>> requests they are getting here: http://pastebin.com/B4771g5X >>> >>> Thanks in advance for the help. >>> Simon >>
-
Re: Pre-split table using shellSimon Kelly 2012-06-12, 11:55
Thanks Michael
I'm 100% sure its not the UUID distribution that's causing the problem. I'm going to try us the API to create the table and see if that changes things. The reason I want to pre-split the table is that HBase doesn't handle the initial load to a single regionserver and I can't start the system off slowly and allow a few splits to happen before fully loading it. Its 100% or nothing. I'm also stuck with only 8Gb of RAM per server and only 5 servers so I need to try and get as much as I can from the get go. Simon On 12 June 2012 13:37, Michael Segel <[EMAIL PROTECTED]> wrote: > Ok, > Now that I'm awake, and am drinking my first cup of joe... > > If you just generate UUIDs you are not going to have an even distribution. > Nor are they going to be truly random due to how the machines are > generating their random numbers. > But this is not important in solving your problem.... > > There is a set of UUIDs which are hashed and then truncated back down to a > 128 bit string. > You can generate the UUID, take a hash (SHA-1 or MD5) and then truncate it > to 128 bits. > This would generate a more random distribution across your splits. > > I'm also a bit curious about why you're pre-splitting in the first place. > I mean I understand why people do it, but its a short term fix and I > wonder how much pain you feel. > > Of course YMMV based on your use case. > > Hash your key and you'll be ok. > > > > On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: > > > Yes, I'm aware that UUID's are designed to be unique and not evenly > > distributed but I wouldn't expect a big gap in their distribution either. > > > > The other thing that is really confusing me is that the regions splits > > aren't lexicographical sorted. Perhaps there is a problem with the way > I'm > > specifying the splits in the split file. I haven't been able to find any > > docs on what format the splits keys should be in so I've used what's > > produced by Bytes.toStringBinary. Is that correct? > > > > Simon > > > > On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> wrote: > > > >> UUIDs are unique but not necessarily random and even in random > samplings, > >> you may not see an even distribution except over time. > >> > >> > >> Sent from my iPhone > >> > >> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> > wrote: > >> > >>> Hi > >>> > >>> I'm getting some unexpected results with a pre-split table where some > of > >>> the regions are not getting any data. > >>> > >>> The table keys are UUID (generated using Java's UUID.randomUUID() ) > which > >>> I'm storing as a byte[16]: > >>> > >>> key[0-7] = uuid most significant bits > >>> key[8-15] = uuid least significant bits > >>> > >>> The table is created via the shell as follows: > >>> > >>> create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} > >>> > >>> The splits.txt is generated using the code here: > >>> http://pastebin.com/DAExXMDz which generates 32 regions split between > >> x00 > >>> and xFF. I have also tried with 16 byte regions keys (x00x00... to > >>> xFFxFF...). > >>> > >>> As far as I understand this should distribute the rows evenly across > the > >>> regions but I'm getting a bunch of regions with no rows. I'm also > >> confused > >>> as the the ordering of the regions since it seems the start and end > keys > >>> aren't really matching up correctly. You can see the regions and the > >>> requests they are getting here: http://pastebin.com/B4771g5X > >>> > >>> Thanks in advance for the help. > >>> Simon > >> > >
-
Re: Pre-split table using shellMichael Segel 2012-06-12, 13:48
Ok...
Please tell me that this isn't a production system. Is this on EC2? On Jun 12, 2012, at 6:55 AM, Simon Kelly wrote: > Thanks Michael > > I'm 100% sure its not the UUID distribution that's causing the problem. I'm > going to try us the API to create the table and see if that changes things. > > The reason I want to pre-split the table is that HBase doesn't handle the > initial load to a single regionserver and I can't start the system off > slowly and allow a few splits to happen before fully loading it. Its 100% > or nothing. I'm also stuck with only 8Gb of RAM per server and only 5 > servers so I need to try and get as much as I can from the get go. > > Simon > > On 12 June 2012 13:37, Michael Segel <[EMAIL PROTECTED]> wrote: > >> Ok, >> Now that I'm awake, and am drinking my first cup of joe... >> >> If you just generate UUIDs you are not going to have an even distribution. >> Nor are they going to be truly random due to how the machines are >> generating their random numbers. >> But this is not important in solving your problem.... >> >> There is a set of UUIDs which are hashed and then truncated back down to a >> 128 bit string. >> You can generate the UUID, take a hash (SHA-1 or MD5) and then truncate it >> to 128 bits. >> This would generate a more random distribution across your splits. >> >> I'm also a bit curious about why you're pre-splitting in the first place. >> I mean I understand why people do it, but its a short term fix and I >> wonder how much pain you feel. >> >> Of course YMMV based on your use case. >> >> Hash your key and you'll be ok. >> >> >> >> On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: >> >>> Yes, I'm aware that UUID's are designed to be unique and not evenly >>> distributed but I wouldn't expect a big gap in their distribution either. >>> >>> The other thing that is really confusing me is that the regions splits >>> aren't lexicographical sorted. Perhaps there is a problem with the way >> I'm >>> specifying the splits in the split file. I haven't been able to find any >>> docs on what format the splits keys should be in so I've used what's >>> produced by Bytes.toStringBinary. Is that correct? >>> >>> Simon >>> >>> On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> wrote: >>> >>>> UUIDs are unique but not necessarily random and even in random >> samplings, >>>> you may not see an even distribution except over time. >>>> >>>> >>>> Sent from my iPhone >>>> >>>> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> >> wrote: >>>> >>>>> Hi >>>>> >>>>> I'm getting some unexpected results with a pre-split table where some >> of >>>>> the regions are not getting any data. >>>>> >>>>> The table keys are UUID (generated using Java's UUID.randomUUID() ) >> which >>>>> I'm storing as a byte[16]: >>>>> >>>>> key[0-7] = uuid most significant bits >>>>> key[8-15] = uuid least significant bits >>>>> >>>>> The table is created via the shell as follows: >>>>> >>>>> create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} >>>>> >>>>> The splits.txt is generated using the code here: >>>>> http://pastebin.com/DAExXMDz which generates 32 regions split between >>>> x00 >>>>> and xFF. I have also tried with 16 byte regions keys (x00x00... to >>>>> xFFxFF...). >>>>> >>>>> As far as I understand this should distribute the rows evenly across >> the >>>>> regions but I'm getting a bunch of regions with no rows. I'm also >>>> confused >>>>> as the the ordering of the regions since it seems the start and end >> keys >>>>> aren't really matching up correctly. You can see the regions and the >>>>> requests they are getting here: http://pastebin.com/B4771g5X >>>>> >>>>> Thanks in advance for the help. >>>>> Simon >>>> >> >>
-
Re: Pre-split table using shellSimon Kelly 2012-06-12, 14:37
No, this isn't on EC2 and yes, its (supposed to be) production. Please
elaboration on your inferred sigh of dispair.... On 12 June 2012 15:48, Michael Segel <[EMAIL PROTECTED]> wrote: > Ok... > > Please tell me that this isn't a production system. > > Is this on EC2? > > On Jun 12, 2012, at 6:55 AM, Simon Kelly wrote: > > > Thanks Michael > > > > I'm 100% sure its not the UUID distribution that's causing the problem. > I'm > > going to try us the API to create the table and see if that changes > things. > > > > The reason I want to pre-split the table is that HBase doesn't handle the > > initial load to a single regionserver and I can't start the system off > > slowly and allow a few splits to happen before fully loading it. Its 100% > > or nothing. I'm also stuck with only 8Gb of RAM per server and only 5 > > servers so I need to try and get as much as I can from the get go. > > > > Simon > > > > On 12 June 2012 13:37, Michael Segel <[EMAIL PROTECTED]> wrote: > > > >> Ok, > >> Now that I'm awake, and am drinking my first cup of joe... > >> > >> If you just generate UUIDs you are not going to have an even > distribution. > >> Nor are they going to be truly random due to how the machines are > >> generating their random numbers. > >> But this is not important in solving your problem.... > >> > >> There is a set of UUIDs which are hashed and then truncated back down > to a > >> 128 bit string. > >> You can generate the UUID, take a hash (SHA-1 or MD5) and then truncate > it > >> to 128 bits. > >> This would generate a more random distribution across your splits. > >> > >> I'm also a bit curious about why you're pre-splitting in the first > place. > >> I mean I understand why people do it, but its a short term fix and I > >> wonder how much pain you feel. > >> > >> Of course YMMV based on your use case. > >> > >> Hash your key and you'll be ok. > >> > >> > >> > >> On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: > >> > >>> Yes, I'm aware that UUID's are designed to be unique and not evenly > >>> distributed but I wouldn't expect a big gap in their distribution > either. > >>> > >>> The other thing that is really confusing me is that the regions splits > >>> aren't lexicographical sorted. Perhaps there is a problem with the way > >> I'm > >>> specifying the splits in the split file. I haven't been able to find > any > >>> docs on what format the splits keys should be in so I've used what's > >>> produced by Bytes.toStringBinary. Is that correct? > >>> > >>> Simon > >>> > >>> On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> > wrote: > >>> > >>>> UUIDs are unique but not necessarily random and even in random > >> samplings, > >>>> you may not see an even distribution except over time. > >>>> > >>>> > >>>> Sent from my iPhone > >>>> > >>>> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> > >> wrote: > >>>> > >>>>> Hi > >>>>> > >>>>> I'm getting some unexpected results with a pre-split table where some > >> of > >>>>> the regions are not getting any data. > >>>>> > >>>>> The table keys are UUID (generated using Java's UUID.randomUUID() ) > >> which > >>>>> I'm storing as a byte[16]: > >>>>> > >>>>> key[0-7] = uuid most significant bits > >>>>> key[8-15] = uuid least significant bits > >>>>> > >>>>> The table is created via the shell as follows: > >>>>> > >>>>> create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'} > >>>>> > >>>>> The splits.txt is generated using the code here: > >>>>> http://pastebin.com/DAExXMDz which generates 32 regions split > between > >>>> x00 > >>>>> and xFF. I have also tried with 16 byte regions keys (x00x00... to > >>>>> xFFxFF...). > >>>>> > >>>>> As far as I understand this should distribute the rows evenly across > >> the > >>>>> regions but I'm getting a bunch of regions with no rows. I'm also > >>>> confused > >>>>> as the the ordering of the regions since it seems the start and end > >> keys > >>>>> aren't really matching up correctly. You can see the regions and the
-
Re: Pre-split table using shellSimon Kelly 2012-06-12, 15:05
Using the API to create the splits worked. The data is now evenly spread
across all the regions. However every time I tried to create a table the HBase master crashed. I used the class listed here http://pastebin.com/i1yFVEwj as follows: ./hbase CreateTable The table gets created but HBase master crashed. The full master log is here: http://pastebin.com/JE1rLC0C This is HBase 0.92.1 with Hadoop 1.0.1 Simon On 12 June 2012 16:37, Simon Kelly <[EMAIL PROTECTED]> wrote: > No, this isn't on EC2 and yes, its (supposed to be) production. Please > elaboration on your inferred sigh of dispair.... > > > On 12 June 2012 15:48, Michael Segel <[EMAIL PROTECTED]> wrote: > >> Ok... >> >> Please tell me that this isn't a production system. >> >> Is this on EC2? >> >> On Jun 12, 2012, at 6:55 AM, Simon Kelly wrote: >> >> > Thanks Michael >> > >> > I'm 100% sure its not the UUID distribution that's causing the problem. >> I'm >> > going to try us the API to create the table and see if that changes >> things. >> > >> > The reason I want to pre-split the table is that HBase doesn't handle >> the >> > initial load to a single regionserver and I can't start the system off >> > slowly and allow a few splits to happen before fully loading it. Its >> 100% >> > or nothing. I'm also stuck with only 8Gb of RAM per server and only 5 >> > servers so I need to try and get as much as I can from the get go. >> > >> > Simon >> > >> > On 12 June 2012 13:37, Michael Segel <[EMAIL PROTECTED]> wrote: >> > >> >> Ok, >> >> Now that I'm awake, and am drinking my first cup of joe... >> >> >> >> If you just generate UUIDs you are not going to have an even >> distribution. >> >> Nor are they going to be truly random due to how the machines are >> >> generating their random numbers. >> >> But this is not important in solving your problem.... >> >> >> >> There is a set of UUIDs which are hashed and then truncated back down >> to a >> >> 128 bit string. >> >> You can generate the UUID, take a hash (SHA-1 or MD5) and then >> truncate it >> >> to 128 bits. >> >> This would generate a more random distribution across your splits. >> >> >> >> I'm also a bit curious about why you're pre-splitting in the first >> place. >> >> I mean I understand why people do it, but its a short term fix and I >> >> wonder how much pain you feel. >> >> >> >> Of course YMMV based on your use case. >> >> >> >> Hash your key and you'll be ok. >> >> >> >> >> >> >> >> On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: >> >> >> >>> Yes, I'm aware that UUID's are designed to be unique and not evenly >> >>> distributed but I wouldn't expect a big gap in their distribution >> either. >> >>> >> >>> The other thing that is really confusing me is that the regions splits >> >>> aren't lexicographical sorted. Perhaps there is a problem with the way >> >> I'm >> >>> specifying the splits in the split file. I haven't been able to find >> any >> >>> docs on what format the splits keys should be in so I've used what's >> >>> produced by Bytes.toStringBinary. Is that correct? >> >>> >> >>> Simon >> >>> >> >>> On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> >> wrote: >> >>> >> >>>> UUIDs are unique but not necessarily random and even in random >> >> samplings, >> >>>> you may not see an even distribution except over time. >> >>>> >> >>>> >> >>>> Sent from my iPhone >> >>>> >> >>>> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> >> >> wrote: >> >>>> >> >>>>> Hi >> >>>>> >> >>>>> I'm getting some unexpected results with a pre-split table where >> some >> >> of >> >>>>> the regions are not getting any data. >> >>>>> >> >>>>> The table keys are UUID (generated using Java's UUID.randomUUID() ) >> >> which >> >>>>> I'm storing as a byte[16]: >> >>>>> >> >>>>> key[0-7] = uuid most significant bits >> >>>>> key[8-15] = uuid least significant bits >> >>>>> >> >>>>> The table is created via the shell as follows: >> >>>>> >> >>>>> create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'}
-
Re: Pre-split table using shellMichael Segel 2012-06-12, 16:09
?Inferred sigh of despair? Was it that obvious? :-)
I'm not sure what hardware you're running on so its hard to say. Here's the problem... On each DN, you're running a DN and a RS. Assuming that you're not going to run a TT or do any M/R to push/pull data in and out of HBase. You don't have a lot of memory to play with. I guess you could make the heap size a max of 4GB... Its really tight. In general, I'd recommend at least 4GB per physical core. Some are looking at 8GB. The problem is that with too little memory, if you hit swap, you can cause a cascading failure taking down your entire instance. Sorry, I tend to be a bit paranoid and try to make the servers as robust as budgets allow. Getting back to your initial problem... Hash the keys and I think then you'll be ok. HTH -Mike On Jun 12, 2012, at 9:37 AM, Simon Kelly wrote: > No, this isn't on EC2 and yes, its (supposed to be) production. Please > elaboration on your inferred sigh of dispair.... > > On 12 June 2012 15:48, Michael Segel <[EMAIL PROTECTED]> wrote: > >> Ok... >> >> Please tell me that this isn't a production system. >> >> Is this on EC2? >> >> On Jun 12, 2012, at 6:55 AM, Simon Kelly wrote: >> >>> Thanks Michael >>> >>> I'm 100% sure its not the UUID distribution that's causing the problem. >> I'm >>> going to try us the API to create the table and see if that changes >> things. >>> >>> The reason I want to pre-split the table is that HBase doesn't handle the >>> initial load to a single regionserver and I can't start the system off >>> slowly and allow a few splits to happen before fully loading it. Its 100% >>> or nothing. I'm also stuck with only 8Gb of RAM per server and only 5 >>> servers so I need to try and get as much as I can from the get go. >>> >>> Simon >>> >>> On 12 June 2012 13:37, Michael Segel <[EMAIL PROTECTED]> wrote: >>> >>>> Ok, >>>> Now that I'm awake, and am drinking my first cup of joe... >>>> >>>> If you just generate UUIDs you are not going to have an even >> distribution. >>>> Nor are they going to be truly random due to how the machines are >>>> generating their random numbers. >>>> But this is not important in solving your problem.... >>>> >>>> There is a set of UUIDs which are hashed and then truncated back down >> to a >>>> 128 bit string. >>>> You can generate the UUID, take a hash (SHA-1 or MD5) and then truncate >> it >>>> to 128 bits. >>>> This would generate a more random distribution across your splits. >>>> >>>> I'm also a bit curious about why you're pre-splitting in the first >> place. >>>> I mean I understand why people do it, but its a short term fix and I >>>> wonder how much pain you feel. >>>> >>>> Of course YMMV based on your use case. >>>> >>>> Hash your key and you'll be ok. >>>> >>>> >>>> >>>> On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: >>>> >>>>> Yes, I'm aware that UUID's are designed to be unique and not evenly >>>>> distributed but I wouldn't expect a big gap in their distribution >> either. >>>>> >>>>> The other thing that is really confusing me is that the regions splits >>>>> aren't lexicographical sorted. Perhaps there is a problem with the way >>>> I'm >>>>> specifying the splits in the split file. I haven't been able to find >> any >>>>> docs on what format the splits keys should be in so I've used what's >>>>> produced by Bytes.toStringBinary. Is that correct? >>>>> >>>>> Simon >>>>> >>>>> On 12 June 2012 10:23, Michael Segel <[EMAIL PROTECTED]> >> wrote: >>>>> >>>>>> UUIDs are unique but not necessarily random and even in random >>>> samplings, >>>>>> you may not see an even distribution except over time. >>>>>> >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[EMAIL PROTECTED]> >>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I'm getting some unexpected results with a pre-split table where some >>>> of >>>>>>> the regions are not getting any data. >>>>>>> >>>>>>> The table keys are UUID (generated using Java's UUID.randomUUID() )
-
Re: Pre-split table using shellSimon Kelly 2012-06-13, 07:01
Thanks Mike, that's pretty much the same reaction I had before. We should
be getting another 8Gb shortly but that's the limit for those servers and while that's still not a lot I think we'll manage for now. Unfortunately I'm not the decision maker when it comes to these things so I'm just doing my best with what I've got. On the point of hashing, considering that we have a 4% std deviation in the regions do you still think we need to hash the UUID? Or is your concern that we might get hot spotting if a set of UUID's happen to be close together? Thanks again for the help. Simon On 12 June 2012 18:09, Michael Segel <[EMAIL PROTECTED]> wrote: > ?Inferred sigh of despair? Was it that obvious? :-) > > I'm not sure what hardware you're running on so its hard to say. > > Here's the problem... On each DN, you're running a DN and a RS. Assuming > that you're not going to run a TT or do any M/R to push/pull data in and > out of HBase. > You don't have a lot of memory to play with. > > I guess you could make the heap size a max of 4GB... > > Its really tight. > > In general, I'd recommend at least 4GB per physical core. Some are looking > at 8GB. > > The problem is that with too little memory, if you hit swap, you can cause > a cascading failure taking down your entire instance. > > Sorry, I tend to be a bit paranoid and try to make the servers as robust > as budgets allow. > > Getting back to your initial problem... > Hash the keys and I think then you'll be ok. > > > HTH > > -Mike > > On Jun 12, 2012, at 9:37 AM, Simon Kelly wrote: > > > No, this isn't on EC2 and yes, its (supposed to be) production. Please > > elaboration on your inferred sigh of dispair.... > > > > On 12 June 2012 15:48, Michael Segel <[EMAIL PROTECTED]> wrote: > > > >> Ok... > >> > >> Please tell me that this isn't a production system. > >> > >> Is this on EC2? > >> > >> On Jun 12, 2012, at 6:55 AM, Simon Kelly wrote: > >> > >>> Thanks Michael > >>> > >>> I'm 100% sure its not the UUID distribution that's causing the problem. > >> I'm > >>> going to try us the API to create the table and see if that changes > >> things. > >>> > >>> The reason I want to pre-split the table is that HBase doesn't handle > the > >>> initial load to a single regionserver and I can't start the system off > >>> slowly and allow a few splits to happen before fully loading it. Its > 100% > >>> or nothing. I'm also stuck with only 8Gb of RAM per server and only 5 > >>> servers so I need to try and get as much as I can from the get go. > >>> > >>> Simon > >>> > >>> On 12 June 2012 13:37, Michael Segel <[EMAIL PROTECTED]> > wrote: > >>> > >>>> Ok, > >>>> Now that I'm awake, and am drinking my first cup of joe... > >>>> > >>>> If you just generate UUIDs you are not going to have an even > >> distribution. > >>>> Nor are they going to be truly random due to how the machines are > >>>> generating their random numbers. > >>>> But this is not important in solving your problem.... > >>>> > >>>> There is a set of UUIDs which are hashed and then truncated back down > >> to a > >>>> 128 bit string. > >>>> You can generate the UUID, take a hash (SHA-1 or MD5) and then > truncate > >> it > >>>> to 128 bits. > >>>> This would generate a more random distribution across your splits. > >>>> > >>>> I'm also a bit curious about why you're pre-splitting in the first > >> place. > >>>> I mean I understand why people do it, but its a short term fix and I > >>>> wonder how much pain you feel. > >>>> > >>>> Of course YMMV based on your use case. > >>>> > >>>> Hash your key and you'll be ok. > >>>> > >>>> > >>>> > >>>> On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote: > >>>> > >>>>> Yes, I'm aware that UUID's are designed to be unique and not evenly > >>>>> distributed but I wouldn't expect a big gap in their distribution > >> either. > >>>>> > >>>>> The other thing that is really confusing me is that the regions > splits > >>>>> aren't lexicographical sorted. Perhaps there is a problem with the |