|
|
-
Re: MR job for creating splits
Bryan Beaudreault 2012-05-13, 02:07
I did a very similar approach and it worked fine for me. Just spot check the regions after to make sure they look lexicographically sorted. I used ImmutableBytesWritable as my key, and the default hadoop sorting for that turned out to sort lexicographically as required. Our hbase rows varied in size, so instead of doing a count of the number of rows, we tallied up the KeyValue.getLenght() for each KeyValue in a row until the size reached a certain limit.
On Sat, May 12, 2012 at 7:21 PM, Something Something < [EMAIL PROTECTED]> wrote:
> Hello, > > This is really a MapReduce question, but the output from this will be used > to create regions for an HBase table. Here's what I want to do: > > Take an input file that contains data about users. > Sort this file by a key (which consists of a few fields from the row) > After every x # of rows write the key. > > > Here's how I was going to structure my MapReduce: > > public Splitter { > > static int counter; > > private Mapper { > map() { > Build key by concatenating fields > Write key > increment counter; > } > } > > // # of reducers will be set to 1. My understanding is that this will > send the lines to reducer in sorted order one at a time - is this a correct > assumption? > private Reducer { > static long i; > reduce() { > static long splitSize = counter / 300; // 300 is region size > if (i == 0 || i == splitSize) { > Write key; // this will be used as a 'startkey'. > i = 0; > } > i++; > } > } > } > > To summarize, there are 2 questions: > > 1) I am passing # of rows processed by Mapper to Reducer via a static > counter. Would this work? Is there a better way? > 2) If I set # of reducers to 1, would the lines be sent to reducer in > sorted order one at a time? > > Thanks in advance for the help. >
-
Re: MR job for creating splits
Bryan Beaudreault 2012-05-13, 16:35
Why do you need to know this? Were you trying to do a percentage of rows per region? Otherwise just have a member variable of your reducer class and increment it on each call to reduce(). I think you'll be better of finding a way to do it not using percentage if possible. Try calculating the size of the data instead perhaps. You should have that available since you are trying to bulkload anyway (which requires Put or KeyValue values, both of which you can get the size from).
On Sun, May 13, 2012 at 2:11 AM, Something Something < [EMAIL PROTECTED]> wrote:
> Is there no way to find out inside a single reducer how many records were > created by all the Mappers? I tried several ways but nothing works. For > example, I tried this: > > reporter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS).getValue(); > > It's not working for me. Should this have worked? Am I just doing > something dumb? I would rather not create another MR job just to count # > of lines. > > > On Sat, May 12, 2012 at 7:07 PM, Bryan Beaudreault < > [EMAIL PROTECTED] > > wrote: > > > I did a very similar approach and it worked fine for me. Just spot check > > the regions after to make sure they look lexicographically sorted. I > used > > ImmutableBytesWritable as my key, and the default hadoop sorting for that > > turned out to sort lexicographically as required. Our hbase rows varied > in > > size, so instead of doing a count of the number of rows, we tallied up > the > > KeyValue.getLenght() for each KeyValue in a row until the size reached a > > certain limit. > > > > On Sat, May 12, 2012 at 7:21 PM, Something Something < > > [EMAIL PROTECTED]> wrote: > > > > > Hello, > > > > > > This is really a MapReduce question, but the output from this will be > > used > > > to create regions for an HBase table. Here's what I want to do: > > > > > > Take an input file that contains data about users. > > > Sort this file by a key (which consists of a few fields from the row) > > > After every x # of rows write the key. > > > > > > > > > Here's how I was going to structure my MapReduce: > > > > > > public Splitter { > > > > > > static int counter; > > > > > > private Mapper { > > > map() { > > > Build key by concatenating fields > > > Write key > > > increment counter; > > > } > > > } > > > > > > // # of reducers will be set to 1. My understanding is that this > > will > > > send the lines to reducer in sorted order one at a time - is this a > > correct > > > assumption? > > > private Reducer { > > > static long i; > > > reduce() { > > > static long splitSize = counter / 300; // 300 is region > > size > > > if (i == 0 || i == splitSize) { > > > Write key; // this will be used as a 'startkey'. > > > i = 0; > > > } > > > i++; > > > } > > > } > > > } > > > > > > To summarize, there are 2 questions: > > > > > > 1) I am passing # of rows processed by Mapper to Reducer via a static > > > counter. Would this work? Is there a better way? > > > 2) If I set # of reducers to 1, would the lines be sent to reducer in > > > sorted order one at a time? > > > > > > Thanks in advance for the help. > > > > > >
-
Re: MR job for creating splits
Dave Revell 2012-05-14, 16:45
Re: your question #1, you won't be able to pass information from mappers to reducers by using static variables. Since map tasks run in different JVM instances than reduce tasks, the value of the static variable will never be sent from the mapper JVM to the reducer JVM.
It might work in standalone mode, but that's probably not the case for your production environment.
Re: your question #2, google for "hadoop secondary sort."
Some vague advice on your algorithm to determine the best splits for your data: if you don't need the splits to be optimal, you might try randomly sampling your keys instead of processing all of them. This might not even require mapreduce.
Best, Dave
On Sat, May 12, 2012 at 4:21 PM, Something Something < [EMAIL PROTECTED]> wrote:
> Hello, > > This is really a MapReduce question, but the output from this will be used > to create regions for an HBase table. Here's what I want to do: > > Take an input file that contains data about users. > Sort this file by a key (which consists of a few fields from the row) > After every x # of rows write the key. > > > Here's how I was going to structure my MapReduce: > > public Splitter { > > static int counter; > > private Mapper { > map() { > Build key by concatenating fields > Write key > increment counter; > } > } > > // # of reducers will be set to 1. My understanding is that this will > send the lines to reducer in sorted order one at a time - is this a correct > assumption? > private Reducer { > static long i; > reduce() { > static long splitSize = counter / 300; // 300 is region size > if (i == 0 || i == splitSize) { > Write key; // this will be used as a 'startkey'. > i = 0; > } > i++; > } > } > } > > To summarize, there are 2 questions: > > 1) I am passing # of rows processed by Mapper to Reducer via a static > counter. Would this work? Is there a better way? > 2) If I set # of reducers to 1, would the lines be sent to reducer in > sorted order one at a time? > > Thanks in advance for the help. >
|
|