Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> hbase puts in map tasks don't seem to run in parallel

Copy link to this message
Re: hbase puts in map tasks don't seem to run in parallel
This is probably more of an [EMAIL PROTECTED] topic than common-user.

To answer your question, you will want to pre-split the table like so: http://hbase.apache.org/book/perf.writing.html



Sent from my iPhone

On Jun 3, 2012, at 3:45 PM, Jonathan Bishop <[EMAIL PROTECTED]> wrote:

> Thanks Joep,
> My table is empty when I start and will consist of 18M rows when completed
> So I guess I need to understand how to pick row keys such that the regions
> will be on that mappers node. Any advice would be appreciated.
> BTW, I do notice that the region servers of other nodes become busy, but
> only after a large number of rows have been processed - say 10%. It would
> be better if I could deliberately control which regions/regionserver were
> going to be used though, to prevent the network traffic of sending rows to
> regionservers on other nodes.
> Jon
> On Sun, Jun 3, 2012 at 12:02 PM, Joep Rottinghuis <[EMAIL PROTECTED]>wrote:
>> How large is your table?
>> If it is newly created and still almost empty then it will probably
>> consist of only one region, which will be hosted on one region server.
>> Even as the table grows and gets split into multiple regions, you will
>> have to split your mappers in such a way that each writes to the key ranges
>> corresponding to the regions hosted locally on the corresponding region
>> sever.
>> Cheers,
>> Joep
>> Sent from my iPhone
>> On Jun 2, 2012, at 6:25 PM, Jonathan Bishop <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>> I am new to hadoop and hbase, but have spent the last few weeks learning
>> as
>>> much as I can...
>>> I am attempting to create an hbase table during a hadoop job by simply
>>> doing puts to a table from each map task. I am hoping that each map task
>>> will use the regionserver on its node so that all 10 of my nodes are
>>> putting values into the table at the same time.
>>> Here is my map class below. The Node class is a simple data structure
>> which
>>> knows how to parse a line of input and create a Put for hbase.
>>> When I run this I see that only one region server is active for the
>> table I
>>> am creating. I know that my input file is split among all 10 of my data
>>> nodes, and I know that if I do not do puts to the hbase table everything
>>> runs in a parallel on all 10 machines. It is only when I start doing
>> hbase
>>> puts that the run times go way up.
>>> Thanks,
>>> Jon
>>> public static class MapClass extends Mapper<Object, Text, IntWritable,
>>> Node> {
>>> HTableInterface table = null;
>>> @Override
>>> protected void setup(Context context) throws IOException,
>>> InterruptedException {
>>> String tableName = context.getConfiguration().get(TABLE);
>>> table = new HTable(tableName);
>>> }
>>> @Override
>>> public void map(Object key, Text value, Context context) throws
>>> IOException, InterruptedException {
>>> Node node = null;
>>> try {
>>> node = Node.parseNode(value.toString());
>>> } catch (ParseException e) {
>>> throw new IOException();
>>> }
>>> Put put = node.getPut();
>>> table.put(put);
>>> }
>>> @Override
>>> protected void cleanup(Context context) throws IOException,
>>> InterruptedException {
>>> table.close();
>>> }
>>> }