Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: MR job for creating splits


Copy link to this message
-
Re: MR job for creating splits
Why do you need to know this?  Were you trying to do a percentage of rows
per region?  Otherwise just have a member variable of your reducer class
and increment it on each call to reduce().  I think you'll be better of
finding a way to do it not using percentage if possible.  Try calculating
the size of the data instead perhaps.  You should have that available since
you are trying to bulkload anyway (which requires Put or KeyValue values,
both of which you can get the size from).

On Sun, May 13, 2012 at 2:11 AM, Something Something <
[EMAIL PROTECTED]> wrote:

> Is there no way to find out inside a single reducer how many records were
> created by all the Mappers?  I tried several ways but nothing works.  For
> example, I tried this:
>
> reporter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS).getValue();
>
> It's not working for me.  Should this have worked?  Am I just doing
> something dumb?  I would rather not create another MR job just to count #
> of lines.
>
>
> On Sat, May 12, 2012 at 7:07 PM, Bryan Beaudreault <
> [EMAIL PROTECTED]
> > wrote:
>
> > I did a very similar approach and it worked fine for me.  Just spot check
> > the regions after to make sure they look lexicographically sorted.  I
> used
> > ImmutableBytesWritable as my key, and the default hadoop sorting for that
> > turned out to sort lexicographically as required.  Our hbase rows varied
> in
> > size, so instead of doing a count of the number of rows, we tallied up
> the
> > KeyValue.getLenght() for each KeyValue in a row until the size reached a
> > certain limit.
> >
> > On Sat, May 12, 2012 at 7:21 PM, Something Something <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hello,
> > >
> > > This is really a MapReduce question, but the output from this will be
> > used
> > > to create regions for an HBase table.  Here's what I want to do:
> > >
> > > Take an input file that contains data about users.
> > > Sort this file by a key (which consists of a few fields from the row)
> > > After every x # of rows write the key.
> > >
> > >
> > > Here's how I was going to structure my MapReduce:
> > >
> > > public Splitter {
> > >
> > >    static int counter;
> > >
> > >    private Mapper {
> > >        map() {
> > >            Build key by concatenating fields
> > >            Write key
> > >            increment counter;
> > >        }
> > >    }
> > >
> > >    //  # of reducers will be set to 1.  My understanding is that this
> > will
> > > send the lines to reducer in sorted order one at a time - is this a
> > correct
> > > assumption?
> > >    private Reducer {
> > >         static long i;
> > >         reduce() {
> > >             static long splitSize = counter / 300;  //  300 is region
> > size
> > >             if (i == 0 || i == splitSize) {
> > >                 Write key;  // this will be used as a 'startkey'.
> > >                  i = 0;
> > >             }
> > >             i++;
> > >         }
> > >    }
> > > }
> > >
> > > To summarize, there are 2 questions:
> > >
> > > 1)  I am passing # of rows processed by Mapper to Reducer via a static
> > > counter.  Would this work?  Is there a better way?
> > > 2)  If I set # of reducers to 1, would the lines be sent to reducer in
> > > sorted order one at a time?
> > >
> > > Thanks in advance for the help.
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB