Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> ‘split’ start/stop key range of large table regions for more map tasks


+
Lu, Wei 2013-03-25, 10:02
+
Jean-Marc Spaggiari 2013-03-25, 11:20
Copy link to this message
-
Re: ‘split’ start/stop key range of large table regions for more map tasks
I think the problem is that Wei has been reading some stuff in blogs and that's why he has such a large region size to start with.

So if he manually splits the logs, drops the region size to something more appropriate...

Or if he unloads the table, drops the table, recreates the table with a smaller more reasonable region size... reloads...  he'd be better off.
On Mar 25, 2013, at 6:20 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote:

> Hi Wei,
>
> Have you looked at MAX_FILESIZE? If your table is 1TB size, and you
> have 10 RS and want 12 regions per server, you can setup this to
> 1TB/(10x12) and you will get at least all those regions (and even a
> bit more).
>
> JM
>
> 2013/3/25 Lu, Wei <[EMAIL PROTECTED]>:
>> We are facing big region size but small  region number of a table. 10 region servers, each has only one region with size over 10G, map slot of each task tracker is 12. We are planning to ‘split’ start/stop key range of large table regions for more map tasks, so that we can better make usage of mapreduce resource (currently only one of 12 map slot is used). I have some ideas below to split, please give me comments or advice.
>> We are considering of implementing a TableInputFormat that optimized the method:
>> @Override
>> public List<InputSplit> getSplits(JobContext context) throws IOException
>> Following is a idea:
>>
>> 1)      Split start/stop key range based on threshold or avg. of region size
>> Set a threshold t1; collect each region’s size, if region size is larger than region size, then ‘split’ the range [startkey, stopkey) of the region, to N = {region size} / t1 sub-ranges: [startkey, stopkey1), [stopkey1, stopkey2),….,[stopkeyN-1, stopkey);
>> As for  t1, we could set as we like, or leave it as the average size of all region size. We will set it to be a small value when each region size is very large, so that ‘split’ will happen;
>>
>> 2)      Get split key by sampling hfile block keys
>> As for  the stopkey1, …stopkeyN-1, hbase doesn’t supply apis to get them and only Pair<byte[][],byte[][]> getStartEndKeys()is given to get start/stop key of the region. 1) We could do calculate to roughly get them, or 2) we can directly get each store file’s block key through Hfile.Reader and merge sort them. Then we can do sampling.
>> Does this method make sense?
>>
>> Thanks,
>> Wei
>>
>
+
Lu, Wei 2013-03-26, 02:33
+
Ted Yu 2013-03-26, 02:37
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB