-Re: Bulkload into empty table with configureIncrementalLoad()
Dolan Antenucci 2013-09-20, 02:27
To follow up on my previous question about how best to do the pre-splits, I
ended up using to following when creating my table:
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647),
This was somewhat of a stab in the dark, but I based it
on RegionSplitter.MD5StringSplit's documentation, which said: Row are long
values in the range "00000000" => "7FFFFFFF". (Reminder: I'm using strings,
probably not uniformly distributed, as my row ID's).
It looks like about 80 of the regions received very little keys (many
received 0), and the other 20 received between 35m - 70m each. Glancing at
the nodes responsible for the 20 popular regions, it looks like a fairly
even distribution across my cluster, so overall I'm optimistic with the
result (performance at first glance seems fine too).
Question: is there something I can do to achieve an even better
distribution across my regions? As mentioned before, I have a table that I
populated via puts, so maybe this can be used to guide my pre-splits? I
did try passing the result of this table's HTable.getStartKeys() (as well
as getEndKeys()) in as the splits, but got an error along the lines of "key
cannot be empty".
Thanks again for your help.
On Thu, Sep 19, 2013 at 2:53 PM, Dolan Antenucci <[EMAIL PROTECTED]>wrote:
> Thanks J-D. Any recommendations on how to determine what splits to use?
> For the keys I'm using strings, so wasn't sure what to put for my startKey
> and endKey. For number of regions, I have a table pre-populated with the
> same data (not using bulk load), so I can see that it has 68 regions.
> On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
>> You need to create the table with pre-splits, see
>> On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci <[EMAIL PROTECTED]
>> > I have about 1 billion values I am trying to load into a new HBase table
>> > (with just one column and column family), but am running into some
>> > Currently I am trying to use MapReduce to import these by first
>> > them to HFiles and then using LoadIncrementalHFiles.doBulkLoad(). I
>> > use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.
>> > code is essentially the same as this example:
>> > The problem I'm running into is that only 1 reducer is created
>> > by configureIncrementalLoad(), and there is not enough space on this
>> > to handle all this data. configureIncrementalLoad() should start one
>> > reducer for every region the table has, so apparently the table only
>> has 1
>> > region -- maybe because it is empty and brand new (my understanding of
>> > regions work is not crystal clear)? The cluster has 5 region servers,
>> > I'd at least like that many reducers to handle this loading.
>> > On a side note, I also tried the command line tool, completebulkload,
>> > am running into other issues with this (timeouts, possible heap issues)
>> > probably due to only one server being assigned the task of inserting all
>> > the records (i.e. I look at the region servers' logs, and only one of
>> > servers has log entries; the rest are idle).
>> > Any help is appreciated
>> > -Dolan Antenucci