Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - HBase Table Row Count Optimization - A Solicitation For Help


+
James Birchfield 2013-09-20, 21:47
+
lars hofhansl 2013-09-20, 23:56
+
James Birchfield 2013-09-21, 00:07
+
Ted Yu 2013-09-20, 22:19
+
James Birchfield 2013-09-20, 22:50
+
Vladimir Rodionov 2013-09-20, 23:58
+
James Birchfield 2013-09-21, 00:09
+
lars hofhansl 2013-09-21, 01:09
+
James Birchfield 2013-09-21, 01:21
Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
lars hofhansl 2013-09-21, 05:06
Hey we all start somewhere. I did the "LocalJobRunner" thing many times and wondered why it was so slow, until I realized I hadn't setup my client correctly.
The LocalJobRunner runs the M/R job on the client machine. This is really just for testing and terribly slow.

From later emails in this I gather you managed to run this as an actual M/R on the cluster? (by the way you do not need to start the job on a machine on the cluster, but just configure your client correctly to ship the job to the M/R cluster)
Was that still too slow? I would love to get my hand on some numbers. If you have trillions of rows and can run this job with a few mappers per machines, those would be good numbers to publish here.
In any case, let us know how it goes.
-- Lars
btw. my calculation were assuming that network IO is the bottleneck. For
larger jobs (such as yours) it's typically either that or disk IO.
________________________________

From: James Birchfield <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Sent: Friday, September 20, 2013 6:21 PM
Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
Thanks Lars.  I like your time calculations much better than mine.

So this is where my inexperience is probably going to come glaring through.  And maybe the root of all this.  I am not running the MapReduce job on a node in the cluster.  It is running on a development server that connects remotely to the cluster.  Further more, I am not executing the MpReduce job from the command line using the CLI as seen in many of the examples.  I am executing them in process of a stand-alone Java process I have written.  It is simple in nature, it simply creates an HBaseAdmin connection, list the tables and looks up the column families, code the admin connection, then loops over the table list, and runs the following code:

public class RowCounterRunner {

    public static long countRows(String tableName) throws Exception {

        Job job = RowCounter.createSubmittableJob(
                ConfigManager.getConfiguration(), new String[]{tableName});
        boolean waitForCompletion = job.waitForCompletion(true);
        Counters counters = job.getCounters();
        Counter findCounter = counters.findCounter(hbaseadminconnection.Counters.ROWS);
        long value2 = findCounter.getValue();
        return value2;

    }
}

I sort of stumbled on to this implementation as a fairly easy way to automate the process.  So based on your comments, and the fact that I see this in my log:

2013-09-20 23:41:05,556 INFO  [LocalJobRunner Map Task Executor #0] LocalJobRunner                 : map

makes me think I am not taking advantage of the cluster effectively, if at all.  I do not mind at all running the MapReduce job using the hbase/hadoop CLI, I can script that as well.  I just thought this would work decently enough.

It does seem like it will be possible to use the Agregation coprocessor as suggested a little earlier in this thread.  It may speed things up as well.  But either way, I need to understand if I am losing significant performance running in the manner I am.  Which at this point sounds like I probably am.

Birch
On Sep 20, 2013, at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> From your numbers below you have about 26k regions, thus each region is about 545tb/26k = 20gb. Good.
>
> How many mappers are you running?
> And just to rule out the obvious, the M/R is running on the cluster and not locally, right? (it will default to a local runner when it cannot use the M/R cluster).
>
> Some back of the envelope calculations tell me that assuming 1ge network cards, the best you can expect for 110 machines to map through this data is about 10h. (so way faster than what you see).
> (545tb/(110*1/8gb/s) ~ 40ks ~11h)
>
>
> We should really add a rowcounting coprocessor to HBase and allow using it via M/R.
>
> -- Lars
>
>
>
> ________________________________
> From: James Birchfield <[EMAIL PROTECTED]>
+
James Birchfield 2013-09-24, 16:33
+
Himanshu Vashishtha 2013-09-26, 21:20
+
Ted Yu 2013-09-21, 01:17
+
James Birchfield 2013-09-21, 01:26
+
Ted Yu 2013-09-21, 01:32
+
James Birchfield 2013-09-21, 01:34
+
Bryan Beaudreault 2013-09-21, 01:46
+
James Birchfield 2013-09-21, 01:57
+
James Birchfield 2013-09-21, 02:19
+
James Birchfield 2013-09-21, 01:46
+
Ted Yu 2013-09-21, 03:41
+
James Birchfield 2013-09-21, 03:46
+
Ted Yu 2013-09-21, 04:11
+
James Birchfield 2013-09-21, 04:16
+
Bryan Beaudreault 2013-09-21, 00:39
+
James Birchfield 2013-09-21, 00:53
+
Ted Yu 2013-09-21, 01:10
+
James Birchfield 2013-09-21, 01:12