|
T Vinod Gupta
2012-02-28, 14:34
Tim Robertson
2012-02-28, 14:44
Ben Snively
2012-02-28, 14:45
T Vinod Gupta
2012-02-28, 14:50
T Vinod Gupta
2012-02-28, 14:51
Tim Robertson
2012-02-28, 15:02
T Vinod Gupta
2012-02-28, 15:06
Ben Snively
2012-02-28, 15:22
T Vinod Gupta
2012-02-28, 15:25
Michel Segel
2012-02-28, 15:44
T Vinod Gupta
2012-02-28, 16:14
Jacques
2012-02-28, 16:15
Michael Segel
2012-02-28, 16:20
Jacques
2012-02-28, 16:21
Ben Snively
2012-02-28, 17:40
Jacques
2012-02-29, 05:16
Michel Segel
2012-02-29, 13:04
Michel Segel
2012-02-29, 13:18
Ben Snively
2012-02-29, 13:21
Jacques
2012-03-01, 17:28
|
-
multiple puts in reducer?T Vinod Gupta 2012-02-28, 14:34
while doing map reduce on hbase tables, is it possible to do multiple puts
in the reducer? what i want is a way to be able to write multiple rows. if its not possible, then what are the other alternatives? i mean like creating a wider table in that case. thanks
-
Re: multiple puts in reducer?Tim Robertson 2012-02-28, 14:44
Hi,
Assuming you use TableOutputFormat [1] you can emit as many PUTs as you want from a reducer. You will need to handle the row key as you create the PUT to emit. HTH, Tim [1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html On Tue, Feb 28, 2012 at 3:34 PM, T Vinod Gupta <[EMAIL PROTECTED]> wrote: > while doing map reduce on hbase tables, is it possible to do multiple puts > in the reducer? what i want is a way to be able to write multiple rows. if > its not possible, then what are the other alternatives? i mean like > creating a wider table in that case. > > thanks
-
Re: multiple puts in reducer?Ben Snively 2012-02-28, 14:45
I think the short answer to that is yes, but the complex portion I would be
worried about is the following: I guess along with that , how do manage speculative execution on the reducer (or is that only for map tasks)? I've always ended up creating import files and bringing it into HBase. Thanks, Ben On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta <[EMAIL PROTECTED]>wrote: > while doing map reduce on hbase tables, is it possible to do multiple puts > in the reducer? what i want is a way to be able to write multiple rows. if > its not possible, then what are the other alternatives? i mean like > creating a wider table in that case. > > thanks >
-
Re: multiple puts in reducer?T Vinod Gupta 2012-02-28, 14:50
I was looking at this page -
http://hbase.apache.org/book/mapreduce.example.html. specifically, section 7.2.4. so if i understand you correctly, i can pass a list of puts to context.write()? I haven't tried though. But is that the way to go? thanks On Tue, Feb 28, 2012 at 6:44 AM, Tim Robertson <[EMAIL PROTECTED]>wrote: > Hi, > > Assuming you use TableOutputFormat [1] you can emit as many PUTs as > you want from a reducer. You will need to handle the row key as you > create the PUT to emit. > > HTH, > Tim > > [1] > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html > > On Tue, Feb 28, 2012 at 3:34 PM, T Vinod Gupta <[EMAIL PROTECTED]> > wrote: > > while doing map reduce on hbase tables, is it possible to do multiple > puts > > in the reducer? what i want is a way to be able to write multiple rows. > if > > its not possible, then what are the other alternatives? i mean like > > creating a wider table in that case. > > > > thanks >
-
Re: multiple puts in reducer?T Vinod Gupta 2012-02-28, 14:51
Ben,
I didn't quite understand your concern? What speculative execution are you referring to? thanks vinod On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > I think the short answer to that is yes, but the complex portion I would be > worried about is the following: > > > I guess along with that , how do manage speculative execution on the > reducer (or is that only for map tasks)? > > I've always ended up creating import files and bringing it into HBase. > > Thanks, > Ben > > On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta <[EMAIL PROTECTED] > >wrote: > > > while doing map reduce on hbase tables, is it possible to do multiple > puts > > in the reducer? what i want is a way to be able to write multiple rows. > if > > its not possible, then what are the other alternatives? i mean like > > creating a wider table in that case. > > > > thanks > > >
-
Re: multiple puts in reducer?Tim Robertson 2012-02-28, 15:02
Hi,
You can call context.write() multiple times in the Reduce(), to emit more than one row. If you are creating the Puts in the Map function then you need to setMapSpeculativeExecution(false) on the job conf, or else Hadoop *might* spawn more than 1 attempt for a given task, meaning you'll get duplicate data. HTH, Tim On Tue, Feb 28, 2012 at 3:51 PM, T Vinod Gupta <[EMAIL PROTECTED]> wrote: > Ben, > I didn't quite understand your concern? What speculative execution are you > referring to? > > thanks > vinod > > On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > >> I think the short answer to that is yes, but the complex portion I would be >> worried about is the following: >> >> >> I guess along with that , how do manage speculative execution on the >> reducer (or is that only for map tasks)? >> >> I've always ended up creating import files and bringing it into HBase. >> >> Thanks, >> Ben >> >> On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta <[EMAIL PROTECTED] >> >wrote: >> >> > while doing map reduce on hbase tables, is it possible to do multiple >> puts >> > in the reducer? what i want is a way to be able to write multiple rows. >> if >> > its not possible, then what are the other alternatives? i mean like >> > creating a wider table in that case. >> > >> > thanks >> > >>
-
Re: multiple puts in reducer?T Vinod Gupta 2012-02-28, 15:06
thanks, that helps!!
On Tue, Feb 28, 2012 at 7:02 AM, Tim Robertson <[EMAIL PROTECTED]>wrote: > Hi, > > You can call context.write() multiple times in the Reduce(), to emit > more than one row. > > If you are creating the Puts in the Map function then you need to > setMapSpeculativeExecution(false) on the job conf, or else Hadoop > *might* spawn more than 1 attempt for a given task, meaning you'll get > duplicate data. > > HTH, > Tim > > > > On Tue, Feb 28, 2012 at 3:51 PM, T Vinod Gupta <[EMAIL PROTECTED]> > wrote: > > Ben, > > I didn't quite understand your concern? What speculative execution are > you > > referring to? > > > > thanks > > vinod > > > > On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > > > >> I think the short answer to that is yes, but the complex portion I > would be > >> worried about is the following: > >> > >> > >> I guess along with that , how do manage speculative execution on the > >> reducer (or is that only for map tasks)? > >> > >> I've always ended up creating import files and bringing it into HBase. > >> > >> Thanks, > >> Ben > >> > >> On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta <[EMAIL PROTECTED] > >> >wrote: > >> > >> > while doing map reduce on hbase tables, is it possible to do multiple > >> puts > >> > in the reducer? what i want is a way to be able to write multiple > rows. > >> if > >> > its not possible, then what are the other alternatives? i mean like > >> > creating a wider table in that case. > >> > > >> > thanks > >> > > >> >
-
Re: multiple puts in reducer?Ben Snively 2012-02-28, 15:22
I think you just need to turn the speculative execution off for that job?
The speculative execution that I am referring to is when the job tracker executes multiple instances of the same task operations across the cluster. It will do this when the cluster isn't busy and particular tasks are taking to long, to see if it can get the task completed quicker on another node in the cluster. My fear was that if there was a mapreduce job running, where a reduce task was being executed. Speculative execution could cause two instances of that same reduce job to get executed -- to see which one would finish first. That could have different impact based on the use case and how the timestamp for the data being ingested into hbase was generated. Is this an issue or just me pretending to know more than I do? Thanks, Ben On Tue, Feb 28, 2012 at 10:06 AM, T Vinod Gupta <[EMAIL PROTECTED]>wrote: > thanks, that helps!! > > On Tue, Feb 28, 2012 at 7:02 AM, Tim Robertson <[EMAIL PROTECTED] > >wrote: > > > Hi, > > > > You can call context.write() multiple times in the Reduce(), to emit > > more than one row. > > > > If you are creating the Puts in the Map function then you need to > > setMapSpeculativeExecution(false) on the job conf, or else Hadoop > > *might* spawn more than 1 attempt for a given task, meaning you'll get > > duplicate data. > > > > HTH, > > Tim > > > > > > > > On Tue, Feb 28, 2012 at 3:51 PM, T Vinod Gupta <[EMAIL PROTECTED]> > > wrote: > > > Ben, > > > I didn't quite understand your concern? What speculative execution are > > you > > > referring to? > > > > > > thanks > > > vinod > > > > > > On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> > wrote: > > > > > >> I think the short answer to that is yes, but the complex portion I > > would be > > >> worried about is the following: > > >> > > >> > > >> I guess along with that , how do manage speculative execution on the > > >> reducer (or is that only for map tasks)? > > >> > > >> I've always ended up creating import files and bringing it into HBase. > > >> > > >> Thanks, > > >> Ben > > >> > > >> On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta <[EMAIL PROTECTED] > > >> >wrote: > > >> > > >> > while doing map reduce on hbase tables, is it possible to do > multiple > > >> puts > > >> > in the reducer? what i want is a way to be able to write multiple > > rows. > > >> if > > >> > its not possible, then what are the other alternatives? i mean like > > >> > creating a wider table in that case. > > >> > > > >> > thanks > > >> > > > >> > > >
-
Re: multiple puts in reducer?T Vinod Gupta 2012-02-28, 15:25
Thanks, I didn't know about this! so this is always useful. I'll keep this
in mind when implementing. On Tue, Feb 28, 2012 at 7:22 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > I think you just need to turn the speculative execution off for that job? > The speculative execution that I am referring to is when the job tracker > executes multiple instances of the same task operations across the cluster. > It will do this when the cluster isn't busy and particular tasks are > taking to long, to see if it can get the task completed quicker on another > node in the cluster. > > My fear was that if there was a mapreduce job running, where a reduce task > was being executed. Speculative execution could cause two instances of > that same reduce job to get executed -- to see which one would finish > first. That could have different impact based on the use case and how the > timestamp for the data being ingested into hbase was generated. > > Is this an issue or just me pretending to know more than I do? > > Thanks, > Ben > > > > On Tue, Feb 28, 2012 at 10:06 AM, T Vinod Gupta <[EMAIL PROTECTED] > >wrote: > > > thanks, that helps!! > > > > On Tue, Feb 28, 2012 at 7:02 AM, Tim Robertson < > [EMAIL PROTECTED] > > >wrote: > > > > > Hi, > > > > > > You can call context.write() multiple times in the Reduce(), to emit > > > more than one row. > > > > > > If you are creating the Puts in the Map function then you need to > > > setMapSpeculativeExecution(false) on the job conf, or else Hadoop > > > *might* spawn more than 1 attempt for a given task, meaning you'll get > > > duplicate data. > > > > > > HTH, > > > Tim > > > > > > > > > > > > On Tue, Feb 28, 2012 at 3:51 PM, T Vinod Gupta <[EMAIL PROTECTED]> > > > wrote: > > > > Ben, > > > > I didn't quite understand your concern? What speculative execution > are > > > you > > > > referring to? > > > > > > > > thanks > > > > vinod > > > > > > > > On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> > > wrote: > > > > > > > >> I think the short answer to that is yes, but the complex portion I > > > would be > > > >> worried about is the following: > > > >> > > > >> > > > >> I guess along with that , how do manage speculative execution on > the > > > >> reducer (or is that only for map tasks)? > > > >> > > > >> I've always ended up creating import files and bringing it into > HBase. > > > >> > > > >> Thanks, > > > >> Ben > > > >> > > > >> On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta < > [EMAIL PROTECTED] > > > >> >wrote: > > > >> > > > >> > while doing map reduce on hbase tables, is it possible to do > > multiple > > > >> puts > > > >> > in the reducer? what i want is a way to be able to write multiple > > > rows. > > > >> if > > > >> > its not possible, then what are the other alternatives? i mean > like > > > >> > creating a wider table in that case. > > > >> > > > > >> > thanks > > > >> > > > > >> > > > > > >
-
Re: multiple puts in reducer?Michel Segel 2012-02-28, 15:44
Yes you can do it.
But why do you have a reducer when running a m/r job against HBase? The trick in writing multiple rows... You do it independently of the output from the map() method. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> wrote: > while doing map reduce on hbase tables, is it possible to do multiple puts > in the reducer? what i want is a way to be able to write multiple rows. if > its not possible, then what are the other alternatives? i mean like > creating a wider table in that case. > > thanks
-
Re: multiple puts in reducer?T Vinod Gupta 2012-02-28, 16:14
Mike,
I didn't understand - why would I not need reducer in hbase m/r? there can be cases right. My use case is very similar to Sujee's blog on frequency counting - http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ So in the reducer, I can do all the aggregations. Is there a better way? I can think of another way - to use increments in the map job itself. i have to figure out if thats possible though. thanks On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel <[EMAIL PROTECTED]>wrote: > Yes you can do it. > But why do you have a reducer when running a m/r job against HBase? > > The trick in writing multiple rows... You do it independently of the > output from the map() method. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> wrote: > > > while doing map reduce on hbase tables, is it possible to do multiple > puts > > in the reducer? what i want is a way to be able to write multiple rows. > if > > its not possible, then what are the other alternatives? i mean like > > creating a wider table in that case. > > > > thanks >
-
Re: multiple puts in reducer?Jacques 2012-02-28, 16:15
The key is that there are two output commit strategies for a map reduce
job. Those that follow the map reduce paradigm and those that work outside of it. Option 1: Rely on map-reduce for committing your output: If you only only work within an existing FileOutputFormat and associated FileOutputCommitter, you don't have to worry about your outputs being double created. Speculative execution is automatically dealt with at the map reduce layer. (Only if a phase is succesful is the output pushed to the next stage). Option 2: Rely on your own semantics. For example, generate your own HTable and start running puts and deletes. In this case, you better make sure that your actions are idempotent. Speculative execution means the same action may run multiple times. Even if you disable spec. ex., a task may fail due to other problems and get restarted. (For example if a tasktracker node is over committed on memory.) In this case, the first part of your job may run multiple times even if you disable speculative execution. The only way to make this work correctly is to ensure that your job actions are idempotent. On Tue, Feb 28, 2012 at 7:22 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > I think you just need to turn the speculative execution off for that job? > The speculative execution that I am referring to is when the job tracker > executes multiple instances of the same task operations across the cluster. > It will do this when the cluster isn't busy and particular tasks are > taking to long, to see if it can get the task completed quicker on another > node in the cluster. > > My fear was that if there was a mapreduce job running, where a reduce task > was being executed. Speculative execution could cause two instances of > that same reduce job to get executed -- to see which one would finish > first. That could have different impact based on the use case and how the > timestamp for the data being ingested into hbase was generated. > > Is this an issue or just me pretending to know more than I do? > > Thanks, > Ben > > > > On Tue, Feb 28, 2012 at 10:06 AM, T Vinod Gupta <[EMAIL PROTECTED] > >wrote: > > > thanks, that helps!! > > > > On Tue, Feb 28, 2012 at 7:02 AM, Tim Robertson < > [EMAIL PROTECTED] > > >wrote: > > > > > Hi, > > > > > > You can call context.write() multiple times in the Reduce(), to emit > > > more than one row. > > > > > > If you are creating the Puts in the Map function then you need to > > > setMapSpeculativeExecution(false) on the job conf, or else Hadoop > > > *might* spawn more than 1 attempt for a given task, meaning you'll get > > > duplicate data. > > > > > > HTH, > > > Tim > > > > > > > > > > > > On Tue, Feb 28, 2012 at 3:51 PM, T Vinod Gupta <[EMAIL PROTECTED]> > > > wrote: > > > > Ben, > > > > I didn't quite understand your concern? What speculative execution > are > > > you > > > > referring to? > > > > > > > > thanks > > > > vinod > > > > > > > > On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> > > wrote: > > > > > > > >> I think the short answer to that is yes, but the complex portion I > > > would be > > > >> worried about is the following: > > > >> > > > >> > > > >> I guess along with that , how do manage speculative execution on > the > > > >> reducer (or is that only for map tasks)? > > > >> > > > >> I've always ended up creating import files and bringing it into > HBase. > > > >> > > > >> Thanks, > > > >> Ben > > > >> > > > >> On Tue, Feb 28, 2012 at 9:34 AM, T Vinod Gupta < > [EMAIL PROTECTED] > > > >> >wrote: > > > >> > > > >> > while doing map reduce on hbase tables, is it possible to do > > multiple > > > >> puts > > > >> > in the reducer? what i want is a way to be able to write multiple > > > rows. > > > >> if > > > >> > its not possible, then what are the other alternatives? i mean > like > > > >> > creating a wider table in that case. > > > >> > > > > >> > thanks > > > >> > > > > >> > > > > > >
-
Re: multiple puts in reducer?Michael Segel 2012-02-28, 16:20
The better question is why would you need a reducer?
That's a bit cryptic, I understand, but you have to ask yourself when do you need to use a reducer when you are writing to a database... ;-) Sent from my iPhone On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> wrote: > Mike, > I didn't understand - why would I not need reducer in hbase m/r? there can > be cases right. > My use case is very similar to Sujee's blog on frequency counting - > http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ > So in the reducer, I can do all the aggregations. Is there a better way? I > can think of another way - to use increments in the map job itself. i have > to figure out if thats possible though. > > thanks > > On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel <[EMAIL PROTECTED]>wrote: > >> Yes you can do it. >> But why do you have a reducer when running a m/r job against HBase? >> >> The trick in writing multiple rows... You do it independently of the >> output from the map() method. >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> wrote: >> >>> while doing map reduce on hbase tables, is it possible to do multiple >> puts >>> in the reducer? what i want is a way to be able to write multiple rows. >> if >>> its not possible, then what are the other alternatives? i mean like >>> creating a wider table in that case. >>> >>> thanks >>
-
Re: multiple puts in reducer?Jacques 2012-02-28, 16:21
Let me append this.
Having just looked at the code for TableOutputFormat, I must correct myself. TableOutputFormat does a direct commit so it falls under case 2. So the only way to ensure that your output from a job is safe using TableOutputFormat is to make sure the actions you're doing are indempotent. To avoid this problem, you would need to use an output that correctly supports commit. Jacques On Tue, Feb 28, 2012 at 8:15 AM, Jacques <[EMAIL PROTECTED]> wrote: > The key is that there are two output commit strategies for a map reduce > job. Those that follow the map reduce paradigm and those that work outside > of it. > > Option 1: Rely on map-reduce for committing your output: If you only only > work within an existing FileOutputFormat and associated > FileOutputCommitter, you don't have to worry about your outputs being > double created. Speculative execution is automatically dealt with at the > map reduce layer. (Only if a phase is succesful is the output pushed to > the next stage). > > Option 2: Rely on your own semantics. For example, generate your own > HTable and start running puts and deletes. In this case, you better make > sure that your actions are idempotent. Speculative execution means the > same action may run multiple times. Even if you disable spec. ex., a task > may fail due to other problems and get restarted. (For example if a > tasktracker node is over committed on memory.) In this case, the first > part of your job may run multiple times even if you disable speculative > execution. The only way to make this work correctly is to ensure that your > job actions are idempotent. > > > > On Tue, Feb 28, 2012 at 7:22 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > >> I think you just need to turn the speculative execution off for that job? >> The speculative execution that I am referring to is when the job tracker >> executes multiple instances of the same task operations across the >> cluster. >> It will do this when the cluster isn't busy and particular tasks are >> taking to long, to see if it can get the task completed quicker on another >> node in the cluster. >> >> My fear was that if there was a mapreduce job running, where a reduce task >> was being executed. Speculative execution could cause two instances of >> that same reduce job to get executed -- to see which one would finish >> first. That could have different impact based on the use case and how the >> timestamp for the data being ingested into hbase was generated. >> >> Is this an issue or just me pretending to know more than I do? >> >> Thanks, >> Ben >> >> >> >> On Tue, Feb 28, 2012 at 10:06 AM, T Vinod Gupta <[EMAIL PROTECTED] >> >wrote: >> >> > thanks, that helps!! >> > >> > On Tue, Feb 28, 2012 at 7:02 AM, Tim Robertson < >> [EMAIL PROTECTED] >> > >wrote: >> > >> > > Hi, >> > > >> > > You can call context.write() multiple times in the Reduce(), to emit >> > > more than one row. >> > > >> > > If you are creating the Puts in the Map function then you need to >> > > setMapSpeculativeExecution(false) on the job conf, or else Hadoop >> > > *might* spawn more than 1 attempt for a given task, meaning you'll get >> > > duplicate data. >> > > >> > > HTH, >> > > Tim >> > > >> > > >> > > >> > > On Tue, Feb 28, 2012 at 3:51 PM, T Vinod Gupta <[EMAIL PROTECTED] >> > >> > > wrote: >> > > > Ben, >> > > > I didn't quite understand your concern? What speculative execution >> are >> > > you >> > > > referring to? >> > > > >> > > > thanks >> > > > vinod >> > > > >> > > > On Tue, Feb 28, 2012 at 6:45 AM, Ben Snively <[EMAIL PROTECTED]> >> > wrote: >> > > > >> > > >> I think the short answer to that is yes, but the complex portion I >> > > would be >> > > >> worried about is the following: >> > > >> >> > > >> >> > > >> I guess along with that , how do manage speculative execution on >> the >> > > >> reducer (or is that only for map tasks)? >> > > >> >> > > >> I've always ended up creating import files and bringing it into >
-
Re: multiple puts in reducer?Ben Snively 2012-02-28, 17:40
Is there an assertion that you would never need to run a reducer when
writing to the DB? It seems that there are cases when you would not need one, but the general statement doesn't apply to all use cases. If you were trying to process data where you may have two a map task (or set of map tasks) output the same key, you could have a case where you need to reduce the data for that key prior to insert the result into hbase. Am I missing something, but to me, that would be the deciding factor. If the key/values output in the map task are the exact values that need to be inserted into HBase versus multiple values aggregated together and the results put into the hbase entry? Thanks, Ben On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > The better question is why would you need a reducer? > > That's a bit cryptic, I understand, but you have to ask yourself when do > you need to use a reducer when you are writing to a database... ;-) > > > Sent from my iPhone > > On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> > wrote: > > > Mike, > > I didn't understand - why would I not need reducer in hbase m/r? there > can > > be cases right. > > My use case is very similar to Sujee's blog on frequency counting - > > http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ > > So in the reducer, I can do all the aggregations. Is there a better way? > I > > can think of another way - to use increments in the map job itself. i > have > > to figure out if thats possible though. > > > > thanks > > > > On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel <[EMAIL PROTECTED] > >wrote: > > > >> Yes you can do it. > >> But why do you have a reducer when running a m/r job against HBase? > >> > >> The trick in writing multiple rows... You do it independently of the > >> output from the map() method. > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> > wrote: > >> > >>> while doing map reduce on hbase tables, is it possible to do multiple > >> puts > >>> in the reducer? what i want is a way to be able to write multiple rows. > >> if > >>> its not possible, then what are the other alternatives? i mean like > >>> creating a wider table in that case. > >>> > >>> thanks > >> >
-
Re: multiple puts in reducer?Jacques 2012-02-29, 05:16
I see nothing wrong with using the output of the reducer into hbase. You
just need to make sure duplicated operations wouldn't cause problems. If using tableoutputformat, don't use random seeded keys. If working straight against htable, don't use increment. We do this for some situations and either don't care about overwrites or use checkAndPut with a skip option in the application code. On Feb 28, 2012 9:40 AM, "Ben Snively" <[EMAIL PROTECTED]> wrote: > Is there an assertion that you would never need to run a reducer when > writing to the DB? > > It seems that there are cases when you would not need one, but the general > statement doesn't apply to all use cases. > > If you were trying to process data where you may have two a map task (or > set of map tasks) output the same key, you could have a case where you > need to reduce the data for that key prior to insert the result into hbase. > > Am I missing something, but to me, that would be the deciding factor. If > the key/values output in the map task are the exact values that need to be > inserted into HBase versus multiple values aggregated together and the > results put into the hbase entry? > > Thanks, > Ben > > > On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel > <[EMAIL PROTECTED]>wrote: > > > The better question is why would you need a reducer? > > > > That's a bit cryptic, I understand, but you have to ask yourself when do > > you need to use a reducer when you are writing to a database... ;-) > > > > > > Sent from my iPhone > > > > On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> > > wrote: > > > > > Mike, > > > I didn't understand - why would I not need reducer in hbase m/r? there > > can > > > be cases right. > > > My use case is very similar to Sujee's blog on frequency counting - > > > http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ > > > So in the reducer, I can do all the aggregations. Is there a better > way? > > I > > > can think of another way - to use increments in the map job itself. i > > have > > > to figure out if thats possible though. > > > > > > thanks > > > > > > On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel < > [EMAIL PROTECTED] > > >wrote: > > > > > >> Yes you can do it. > > >> But why do you have a reducer when running a m/r job against HBase? > > >> > > >> The trick in writing multiple rows... You do it independently of the > > >> output from the map() method. > > >> > > >> > > >> Sent from a remote device. Please excuse any typos... > > >> > > >> Mike Segel > > >> > > >> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> > > wrote: > > >> > > >>> while doing map reduce on hbase tables, is it possible to do multiple > > >> puts > > >>> in the reducer? what i want is a way to be able to write multiple > rows. > > >> if > > >>> its not possible, then what are the other alternatives? i mean like > > >>> creating a wider table in that case. > > >>> > > >>> thanks > > >> > > >
-
Re: multiple puts in reducer?Michel Segel 2012-02-29, 13:04
The assertion is that for most cases you shouldn't need one. That the rule of thumb is that you should have to defend your use of one.
Reducers are expensive. Running multiple mappers in a job can be cheaper. All I am saying is that you need to rethink your solution if you insist on using a reducer. Sent from a remote device. Please excuse any typos... Mike Segel On Feb 28, 2012, at 11:40 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > Is there an assertion that you would never need to run a reducer when > writing to the DB? > > It seems that there are cases when you would not need one, but the general > statement doesn't apply to all use cases. > > If you were trying to process data where you may have two a map task (or > set of map tasks) output the same key, you could have a case where you > need to reduce the data for that key prior to insert the result into hbase. > > Am I missing something, but to me, that would be the deciding factor. If > the key/values output in the map task are the exact values that need to be > inserted into HBase versus multiple values aggregated together and the > results put into the hbase entry? > > Thanks, > Ben > > > On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel > <[EMAIL PROTECTED]>wrote: > >> The better question is why would you need a reducer? >> >> That's a bit cryptic, I understand, but you have to ask yourself when do >> you need to use a reducer when you are writing to a database... ;-) >> >> >> Sent from my iPhone >> >> On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> >> wrote: >> >>> Mike, >>> I didn't understand - why would I not need reducer in hbase m/r? there >> can >>> be cases right. >>> My use case is very similar to Sujee's blog on frequency counting - >>> http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ >>> So in the reducer, I can do all the aggregations. Is there a better way? >> I >>> can think of another way - to use increments in the map job itself. i >> have >>> to figure out if thats possible though. >>> >>> thanks >>> >>> On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel <[EMAIL PROTECTED] >>> wrote: >>> >>>> Yes you can do it. >>>> But why do you have a reducer when running a m/r job against HBase? >>>> >>>> The trick in writing multiple rows... You do it independently of the >>>> output from the map() method. >>>> >>>> >>>> Sent from a remote device. Please excuse any typos... >>>> >>>> Mike Segel >>>> >>>> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> >> wrote: >>>> >>>>> while doing map reduce on hbase tables, is it possible to do multiple >>>> puts >>>>> in the reducer? what i want is a way to be able to write multiple rows. >>>> if >>>>> its not possible, then what are the other alternatives? i mean like >>>>> creating a wider table in that case. >>>>> >>>>> thanks >>>> >>
-
Re: multiple puts in reducer?Michel Segel 2012-02-29, 13:18
There is nothing wrong in writing the output from a reducer to HBase.
The question you have to ask yourself is why are you using a reducer in the first place. ;-) Look, you have a database. Why do you need a reducer? It's a simple question... Right? ;-) Look, I apologize for being cryptic. This is one of those philosophical design questions where you the developer/architect have to figure out the answer for yourself. Maybe I should submit this as an HBaseconn topic for a presentation? Sort of like how to do an efficient table join in HBase.... HTH Sent from a remote device. Please excuse any typos... Mike Segel On Feb 28, 2012, at 11:16 PM, Jacques <[EMAIL PROTECTED]> wrote: > I see nothing wrong with using the output of the reducer into hbase. You > just need to make sure duplicated operations wouldn't cause problems. If > using tableoutputformat, don't use random seeded keys. If working straight > against htable, don't use increment. We do this for some situations and > either don't care about overwrites or use checkAndPut with a skip option in > the application code. > On Feb 28, 2012 9:40 AM, "Ben Snively" <[EMAIL PROTECTED]> wrote: > >> Is there an assertion that you would never need to run a reducer when >> writing to the DB? >> >> It seems that there are cases when you would not need one, but the general >> statement doesn't apply to all use cases. >> >> If you were trying to process data where you may have two a map task (or >> set of map tasks) output the same key, you could have a case where you >> need to reduce the data for that key prior to insert the result into hbase. >> >> Am I missing something, but to me, that would be the deciding factor. If >> the key/values output in the map task are the exact values that need to be >> inserted into HBase versus multiple values aggregated together and the >> results put into the hbase entry? >> >> Thanks, >> Ben >> >> >> On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel >> <[EMAIL PROTECTED]>wrote: >> >>> The better question is why would you need a reducer? >>> >>> That's a bit cryptic, I understand, but you have to ask yourself when do >>> you need to use a reducer when you are writing to a database... ;-) >>> >>> >>> Sent from my iPhone >>> >>> On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Mike, >>>> I didn't understand - why would I not need reducer in hbase m/r? there >>> can >>>> be cases right. >>>> My use case is very similar to Sujee's blog on frequency counting - >>>> http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ >>>> So in the reducer, I can do all the aggregations. Is there a better >> way? >>> I >>>> can think of another way - to use increments in the map job itself. i >>> have >>>> to figure out if thats possible though. >>>> >>>> thanks >>>> >>>> On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel < >> [EMAIL PROTECTED] >>>> wrote: >>>> >>>>> Yes you can do it. >>>>> But why do you have a reducer when running a m/r job against HBase? >>>>> >>>>> The trick in writing multiple rows... You do it independently of the >>>>> output from the map() method. >>>>> >>>>> >>>>> Sent from a remote device. Please excuse any typos... >>>>> >>>>> Mike Segel >>>>> >>>>> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> >>> wrote: >>>>> >>>>>> while doing map reduce on hbase tables, is it possible to do multiple >>>>> puts >>>>>> in the reducer? what i want is a way to be able to write multiple >> rows. >>>>> if >>>>>> its not possible, then what are the other alternatives? i mean like >>>>>> creating a wider table in that case. >>>>>> >>>>>> thanks >>>>> >>> >>
-
Re: multiple puts in reducer?Ben Snively 2012-02-29, 13:21
I would enjoy seeing this:
" Maybe I should submit this as an HBaseconn topic for a presentation? " Thanks, Ben On Wed, Feb 29, 2012 at 8:18 AM, Michel Segel <[EMAIL PROTECTED]>wrote: > There is nothing wrong in writing the output from a reducer to HBase. > > The question you have to ask yourself is why are you using a reducer in > the first place. ;-) > > Look, you have a database. Why do you need a reducer? > > It's a simple question... Right? ;-) > > Look, I apologize for being cryptic. This is one of those philosophical > design questions where you the developer/architect have to figure out the > answer for yourself. Maybe I should submit this as an HBaseconn topic for > a presentation? > > Sort of like how to do an efficient table join in HBase.... > > HTH > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 28, 2012, at 11:16 PM, Jacques <[EMAIL PROTECTED]> wrote: > > > I see nothing wrong with using the output of the reducer into hbase. > You > > just need to make sure duplicated operations wouldn't cause problems. If > > using tableoutputformat, don't use random seeded keys. If working > straight > > against htable, don't use increment. We do this for some situations and > > either don't care about overwrites or use checkAndPut with a skip option > in > > the application code. > > On Feb 28, 2012 9:40 AM, "Ben Snively" <[EMAIL PROTECTED]> wrote: > > > >> Is there an assertion that you would never need to run a reducer when > >> writing to the DB? > >> > >> It seems that there are cases when you would not need one, but the > general > >> statement doesn't apply to all use cases. > >> > >> If you were trying to process data where you may have two a map task (or > >> set of map tasks) output the same key, you could have a case where you > >> need to reduce the data for that key prior to insert the result into > hbase. > >> > >> Am I missing something, but to me, that would be the deciding factor. > If > >> the key/values output in the map task are the exact values that need to > be > >> inserted into HBase versus multiple values aggregated together and the > >> results put into the hbase entry? > >> > >> Thanks, > >> Ben > >> > >> > >> On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel > >> <[EMAIL PROTECTED]>wrote: > >> > >>> The better question is why would you need a reducer? > >>> > >>> That's a bit cryptic, I understand, but you have to ask yourself when > do > >>> you need to use a reducer when you are writing to a database... ;-) > >>> > >>> > >>> Sent from my iPhone > >>> > >>> On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> > >>> wrote: > >>> > >>>> Mike, > >>>> I didn't understand - why would I not need reducer in hbase m/r? there > >>> can > >>>> be cases right. > >>>> My use case is very similar to Sujee's blog on frequency counting - > >>>> http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ > >>>> So in the reducer, I can do all the aggregations. Is there a better > >> way? > >>> I > >>>> can think of another way - to use increments in the map job itself. i > >>> have > >>>> to figure out if thats possible though. > >>>> > >>>> thanks > >>>> > >>>> On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel < > >> [EMAIL PROTECTED] > >>>> wrote: > >>>> > >>>>> Yes you can do it. > >>>>> But why do you have a reducer when running a m/r job against HBase? > >>>>> > >>>>> The trick in writing multiple rows... You do it independently of the > >>>>> output from the map() method. > >>>>> > >>>>> > >>>>> Sent from a remote device. Please excuse any typos... > >>>>> > >>>>> Mike Segel > >>>>> > >>>>> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> > >>> wrote: > >>>>> > >>>>>> while doing map reduce on hbase tables, is it possible to do > multiple > >>>>> puts > >>>>>> in the reducer? what i want is a way to be able to write multiple > >> rows. > >>>>> if > >>>>>> its not possible, then what are the other alternatives? i mean like
-
Re: multiple puts in reducer?Jacques 2012-03-01, 17:28
The data flow is what matters. The reduce phase is about sorting output.
If you push puts to HBase, the input to HBase doesn't have to be sorted since HBase does a sort no matter what. So using a reducer to sort an output is overkill if you're simply putting those same objects into HBase. On the flip side, if your reducer is doing real work that can't be done in your mapper and the HBase client can't do, then go ahead and use the reducer. >Reducers are expensive. Running multiple mappers in a job can be cheaper. Expounding, all reducers by definition have to wait until all mappers are done before they can actually starting running the reduce method. (Shuffle can start before this). If you don't pick your partitioner correctly, a small number of reducers may be doing lots of work--basically making your job less paralell than you imagined. A simple rule that one could use is don't use a reducer unless you must do a parallel sort of a large amount of data. (Smaller sorts e.g. the map-side join work if one of the join sides fit into memory). On Wed, Feb 29, 2012 at 5:04 AM, Michel Segel <[EMAIL PROTECTED]>wrote: > The assertion is that for most cases you shouldn't need one. That the rule > of thumb is that you should have to defend your use of one. > > Reducers are expensive. Running multiple mappers in a job can be cheaper. > > All I am saying is that you need to rethink your solution if you insist on > using a reducer. > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 28, 2012, at 11:40 AM, Ben Snively <[EMAIL PROTECTED]> wrote: > > > Is there an assertion that you would never need to run a reducer when > > writing to the DB? > > > > It seems that there are cases when you would not need one, but the > general > > statement doesn't apply to all use cases. > > > > If you were trying to process data where you may have two a map task (or > > set of map tasks) output the same key, you could have a case where you > > need to reduce the data for that key prior to insert the result into > hbase. > > > > Am I missing something, but to me, that would be the deciding factor. If > > the key/values output in the map task are the exact values that need to > be > > inserted into HBase versus multiple values aggregated together and the > > results put into the hbase entry? > > > > Thanks, > > Ben > > > > > > On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel > > <[EMAIL PROTECTED]>wrote: > > > >> The better question is why would you need a reducer? > >> > >> That's a bit cryptic, I understand, but you have to ask yourself when do > >> you need to use a reducer when you are writing to a database... ;-) > >> > >> > >> Sent from my iPhone > >> > >> On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]> > >> wrote: > >> > >>> Mike, > >>> I didn't understand - why would I not need reducer in hbase m/r? there > >> can > >>> be cases right. > >>> My use case is very similar to Sujee's blog on frequency counting - > >>> http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ > >>> So in the reducer, I can do all the aggregations. Is there a better > way? > >> I > >>> can think of another way - to use increments in the map job itself. i > >> have > >>> to figure out if thats possible though. > >>> > >>> thanks > >>> > >>> On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel < > [EMAIL PROTECTED] > >>> wrote: > >>> > >>>> Yes you can do it. > >>>> But why do you have a reducer when running a m/r job against HBase? > >>>> > >>>> The trick in writing multiple rows... You do it independently of the > >>>> output from the map() method. > >>>> > >>>> > >>>> Sent from a remote device. Please excuse any typos... > >>>> > >>>> Mike Segel > >>>> > >>>> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]> > >> wrote: > >>>> > >>>>> while doing map reduce on hbase tables, is it possible to do multiple > >>>> puts > >>>>> in the reducer? what i want is a way to be able to write multiple |