|
Steinmaurer Thomas
2011-09-16, 05:25
Sonal Goyal
2011-09-16, 07:22
Michel Segel
2011-09-16, 09:05
Sonal Goyal
2011-09-16, 16:06
Sonal Goyal
2011-09-16, 16:11
Michael Segel
2011-09-16, 17:05
Sonal Goyal
2011-09-16, 17:30
Michael Segel
2011-09-16, 18:43
Chris Tarnas
2011-09-16, 18:58
Doug Meil
2011-09-16, 19:41
Michael Segel
2011-09-16, 20:11
Michael Segel
2011-09-16, 20:24
Chris Tarnas
2011-09-16, 21:54
Chris Tarnas
2011-09-16, 22:34
Sam Seigal
2011-09-17, 00:16
Doug Meil
2011-09-17, 00:22
Doug Meil
2011-09-17, 00:24
Sam Seigal
2011-09-17, 01:00
Doug Meil
2011-09-17, 01:14
Sam Seigal
2011-09-17, 01:39
Sam Seigal
2011-09-17, 01:44
Doug Meil
2011-09-17, 01:47
Michel Segel
2011-09-17, 13:12
Steinmaurer Thomas
2011-09-19, 05:35
Steinmaurer Thomas
2011-09-19, 05:41
Doug Meil
2011-09-19, 13:35
Steinmaurer Thomas
2011-09-19, 13:44
|
-
Writing MR-Job: Something like OracleReducer, JDBCReducer ...Steinmaurer Thomas 2011-09-16, 05:25
Hello,
writing a MR-Job to process HBase data and store aggregated data in Oracle. How would you do that in a MR-job? Currently, for test purposes we write the result into a HBase table again by using a TableReducer. Is there something like a OracleReducer, RelationalReducer, JDBCReducer or whatever? Or should one simply use plan JDBC code in the reduce step? Thanks! Thomas
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sonal Goyal 2011-09-16, 07:22
There is a DBOutputFormat class in the org.apache,hadoop.mapreduce.lib.db
package, you could use that. Or you could write to the hdfs and then use something like HIHO[1] to export to the db. I have been working extensively in this area, you can write to me directly if you need any help. 1. https://github.com/sonalgoyal/hiho Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < [EMAIL PROTECTED]> wrote: > Hello, > > > > writing a MR-Job to process HBase data and store aggregated data in > Oracle. How would you do that in a MR-job? > > > > Currently, for test purposes we write the result into a HBase table > again by using a TableReducer. Is there something like a OracleReducer, > RelationalReducer, JDBCReducer or whatever? Or should one simply use > plan JDBC code in the reduce step? > > > > Thanks! > > > > Thomas > > > >
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Michel Segel 2011-09-16, 09:05
I think you need to get a little bit more information.
Reducers are expensive. When Thomas says that he is aggregating data, what exactly does he mean? When dealing w HBase, you really don't want to use a reducer. You may want to run two map jobs and it could be that just dumping the output via jdbc makes the most sense. We are starting to see a lot of questions where the OP isn't providing enough information so that the recommendation could be wrong... Sent from a remote device. Please excuse any typos... Mike Segel On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > There is a DBOutputFormat class in the org.apache,hadoop.mapreduce.lib.db > package, you could use that. Or you could write to the hdfs and then use > something like HIHO[1] to export to the db. I have been working extensively > in this area, you can write to me directly if you need any help. > > 1. https://github.com/sonalgoyal/hiho > > Best Regards, > Sonal > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > [EMAIL PROTECTED]> wrote: > >> Hello, >> >> >> >> writing a MR-Job to process HBase data and store aggregated data in >> Oracle. How would you do that in a MR-job? >> >> >> >> Currently, for test purposes we write the result into a HBase table >> again by using a TableReducer. Is there something like a OracleReducer, >> RelationalReducer, JDBCReducer or whatever? Or should one simply use >> plan JDBC code in the reduce step? >> >> >> >> Thanks! >> >> >> >> Thomas >> >> >> >>
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sonal Goyal 2011-09-16, 16:06
Hi Thomas,
I just assumed that you are already using reducers. From what I understood, please correct me if I am mistaken, You have data in HBase and you are running a MR job to aggregate the data. You have the map as well as reduce phase and as part of the final output, you want to send the data to Oracle. is that correct? Is there any information you would like to share regarding your flow and data? How big is your data, how often do you need to aggregate, what do your mappers emit? Are you already using reducers for aggregations? Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > I think you need to get a little bit more information. > Reducers are expensive. > When Thomas says that he is aggregating data, what exactly does he mean? > When dealing w HBase, you really don't want to use a reducer. > > You may want to run two map jobs and it could be that just dumping the > output via jdbc makes the most sense. > > We are starting to see a lot of questions where the OP isn't providing > enough information so that the recommendation could be wrong... > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > > There is a DBOutputFormat class in the org.apache,hadoop.mapreduce.lib.db > > package, you could use that. Or you could write to the hdfs and then use > > something like HIHO[1] to export to the db. I have been working > extensively > > in this area, you can write to me directly if you need any help. > > > > 1. https://github.com/sonalgoyal/hiho > > > > Best Regards, > > Sonal > > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > Nube Technologies <http://www.nubetech.co> > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > > [EMAIL PROTECTED]> wrote: > > > >> Hello, > >> > >> > >> > >> writing a MR-Job to process HBase data and store aggregated data in > >> Oracle. How would you do that in a MR-job? > >> > >> > >> > >> Currently, for test purposes we write the result into a HBase table > >> again by using a TableReducer. Is there something like a OracleReducer, > >> RelationalReducer, JDBCReducer or whatever? Or should one simply use > >> plan JDBC code in the reduce step? > >> > >> > >> > >> Thanks! > >> > >> > >> > >> Thomas > >> > >> > >> > >> >
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sonal Goyal 2011-09-16, 16:11
Michel,
Sorry can you please help me understand what you mean when you say that when dealing with HBase, you really dont want to use a reducer? Here, Hbase is being used as the input to the MR job. Thanks Sonal On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > I think you need to get a little bit more information. > Reducers are expensive. > When Thomas says that he is aggregating data, what exactly does he mean? > When dealing w HBase, you really don't want to use a reducer. > > You may want to run two map jobs and it could be that just dumping the > output via jdbc makes the most sense. > > We are starting to see a lot of questions where the OP isn't providing > enough information so that the recommendation could be wrong... > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > > There is a DBOutputFormat class in the org.apache,hadoop.mapreduce.lib.db > > package, you could use that. Or you could write to the hdfs and then use > > something like HIHO[1] to export to the db. I have been working > extensively > > in this area, you can write to me directly if you need any help. > > > > 1. https://github.com/sonalgoyal/hiho > > > > Best Regards, > > Sonal > > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > Nube Technologies <http://www.nubetech.co> > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > > [EMAIL PROTECTED]> wrote: > > > >> Hello, > >> > >> > >> > >> writing a MR-Job to process HBase data and store aggregated data in > >> Oracle. How would you do that in a MR-job? > >> > >> > >> > >> Currently, for test purposes we write the result into a HBase table > >> again by using a TableReducer. Is there something like a OracleReducer, > >> RelationalReducer, JDBCReducer or whatever? Or should one simply use > >> plan JDBC code in the reduce step? > >> > >> > >> > >> Thanks! > >> > >> > >> > >> Thomas > >> > >> > >> > >> >
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Michael Segel 2011-09-16, 17:05
Sonal, Just because you have a m/r job doesn't mean that you need to reduce anything. You can have a job that contains only a mapper. Or your job runner can have a series of map jobs in serial. Most if not all of the map/reduce jobs where we pull data from HBase, don't require a reducer. To give you a simple example... if I want to determine the table schema where I am storing some sort of structured data... I just write a m/r job which opens a table, scan's the table counting the occurrence of each column name via dynamic counters. There is no need for a reducer. Does that help? > Date: Fri, 16 Sep 2011 21:41:01 +0530 > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Michel, > > Sorry can you please help me understand what you mean when you say that when > dealing with HBase, you really dont want to use a reducer? Here, Hbase is > being used as the input to the MR job. > > Thanks > Sonal > > > On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > > > I think you need to get a little bit more information. > > Reducers are expensive. > > When Thomas says that he is aggregating data, what exactly does he mean? > > When dealing w HBase, you really don't want to use a reducer. > > > > You may want to run two map jobs and it could be that just dumping the > > output via jdbc makes the most sense. > > > > We are starting to see a lot of questions where the OP isn't providing > > enough information so that the recommendation could be wrong... > > > > > > Sent from a remote device. Please excuse any typos... > > > > Mike Segel > > > > On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > > > > There is a DBOutputFormat class in the org.apache,hadoop.mapreduce.lib.db > > > package, you could use that. Or you could write to the hdfs and then use > > > something like HIHO[1] to export to the db. I have been working > > extensively > > > in this area, you can write to me directly if you need any help. > > > > > > 1. https://github.com/sonalgoyal/hiho > > > > > > Best Regards, > > > Sonal > > > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > > Nube Technologies <http://www.nubetech.co> > > > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > > > > > > > On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > > > [EMAIL PROTECTED]> wrote: > > > > > >> Hello, > > >> > > >> > > >> > > >> writing a MR-Job to process HBase data and store aggregated data in > > >> Oracle. How would you do that in a MR-job? > > >> > > >> > > >> > > >> Currently, for test purposes we write the result into a HBase table > > >> again by using a TableReducer. Is there something like a OracleReducer, > > >> RelationalReducer, JDBCReducer or whatever? Or should one simply use > > >> plan JDBC code in the reduce step? > > >> > > >> > > >> > > >> Thanks! > > >> > > >> > > >> > > >> Thomas > > >> > > >> > > >> > > >> > >
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sonal Goyal 2011-09-16, 17:30
Hi Michael,
Yes, thanks, I understand the fact that reducers can be expensive with all the shuffling and the sorting, and you may not need them always. At the same time, there are many cases where reducers are useful, like secondary sorting. In many cases, one can have multiple map phases and not have a reduce phase at all. Again, there will be many cases where one may want a reducer, say trying to count the occurrence of words in a particular column. With this thought chain, I do not feel ready to say that when dealing with HBase, I really dont want to use a reducer. Please correct me if I am wrong. Thanks again. Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > > Sonal, > > Just because you have a m/r job doesn't mean that you need to reduce > anything. You can have a job that contains only a mapper. > Or your job runner can have a series of map jobs in serial. > > Most if not all of the map/reduce jobs where we pull data from HBase, don't > require a reducer. > > To give you a simple example... if I want to determine the table schema > where I am storing some sort of structured data... > I just write a m/r job which opens a table, scan's the table counting the > occurrence of each column name via dynamic counters. > > There is no need for a reducer. > > Does that help? > > > > Date: Fri, 16 Sep 2011 21:41:01 +0530 > > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer > ... > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > Michel, > > > > Sorry can you please help me understand what you mean when you say that > when > > dealing with HBase, you really dont want to use a reducer? Here, Hbase is > > being used as the input to the MR job. > > > > Thanks > > Sonal > > > > > > On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[EMAIL PROTECTED] > >wrote: > > > > > I think you need to get a little bit more information. > > > Reducers are expensive. > > > When Thomas says that he is aggregating data, what exactly does he > mean? > > > When dealing w HBase, you really don't want to use a reducer. > > > > > > You may want to run two map jobs and it could be that just dumping the > > > output via jdbc makes the most sense. > > > > > > We are starting to see a lot of questions where the OP isn't providing > > > enough information so that the recommendation could be wrong... > > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > > > Mike Segel > > > > > > On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[EMAIL PROTECTED]> > wrote: > > > > > > > There is a DBOutputFormat class in the > org.apache,hadoop.mapreduce.lib.db > > > > package, you could use that. Or you could write to the hdfs and then > use > > > > something like HIHO[1] to export to the db. I have been working > > > extensively > > > > in this area, you can write to me directly if you need any help. > > > > > > > > 1. https://github.com/sonalgoyal/hiho > > > > > > > > Best Regards, > > > > Sonal > > > > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > > > Nube Technologies <http://www.nubetech.co> > > > > > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > >> Hello, > > > >> > > > >> > > > >> > > > >> writing a MR-Job to process HBase data and store aggregated data in > > > >> Oracle. How would you do that in a MR-job? > > > >> > > > >> > > > >> > > > >> Currently, for test purposes we write the result into a HBase table > > > >> again by using a TableReducer. Is there something like a > OracleReducer, > > > >> RelationalReducer, JDBCReducer or whatever? Or should one simply use > > > >> plan JDBC code in the reduce step? > > > >> > > > >>
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Michael Segel 2011-09-16, 18:43
Sonal, You do realize that HBase is a "database", right? ;-) So again, why do you need a reducer? ;-) Using your example... "Again, there will be many cases where one may want a reducer, say trying to count the occurrence of words in a particular column." You can do this one of two ways... 1) Dynamic Counters in Hadoop. 2) Use a temp table and auto increment the value in a column which contains the word count. (Fat row where rowkey is doc_id and column is word or rowkey is doc_id|word) I'm sorry but if you go through all of your examples of why you would want to use a reducer, you end up finding out that writing to an HBase table would be faster than a reduce job. (Again we haven't done an exhaustive search, but in all of the HBase jobs we've run... no reducers were necessary.) The point I'm trying to make is that you want to avoid using a reducer whenever possible and if you think about your problem... you can probably come up with a solution that avoids the reducer... HTH -Mike PS. I haven't looked at *all* of the potential use cases of HBase which is why I don't want to say you'll never need a reducer. I will say that based on what we've done at my client's site, we try very hard to avoid reducers. [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. :-) ] > Date: Fri, 16 Sep 2011 23:00:49 +0530 > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Hi Michael, > > Yes, thanks, I understand the fact that reducers can be expensive with all > the shuffling and the sorting, and you may not need them always. At the same > time, there are many cases where reducers are useful, like secondary > sorting. In many cases, one can have multiple map phases and not have a > reduce phase at all. Again, there will be many cases where one may want a > reducer, say trying to count the occurrence of words in a particular column. > > > With this thought chain, I do not feel ready to say that when dealing with > HBase, I really dont want to use a reducer. Please correct me if I am > wrong. > > Thanks again. > > Best Regards, > Sonal > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel > <[EMAIL PROTECTED]>wrote: > > > > > Sonal, > > > > Just because you have a m/r job doesn't mean that you need to reduce > > anything. You can have a job that contains only a mapper. > > Or your job runner can have a series of map jobs in serial. > > > > Most if not all of the map/reduce jobs where we pull data from HBase, don't > > require a reducer. > > > > To give you a simple example... if I want to determine the table schema > > where I am storing some sort of structured data... > > I just write a m/r job which opens a table, scan's the table counting the > > occurrence of each column name via dynamic counters. > > > > There is no need for a reducer. > > > > Does that help? > > > > > > > Date: Fri, 16 Sep 2011 21:41:01 +0530 > > > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer > > ... > > > From: [EMAIL PROTECTED] > > > To: [EMAIL PROTECTED] > > > > > > Michel, > > > > > > Sorry can you please help me understand what you mean when you say that > > when > > > dealing with HBase, you really dont want to use a reducer? Here, Hbase is > > > being used as the input to the MR job. > > > > > > Thanks > > > Sonal > > > > > > > > > On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[EMAIL PROTECTED] > > >wrote: > > > > > > > I think you need to get a little bit more information. > > > > Reducers are expensive. > > > > When Thomas says that he is aggregating data, what exactly does he > > mean? > > > > When dealing w HBase, you really don't want to use a reducer. > > > > > > > > You may want to run two map jobs and it could be that just dumping the
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Chris Tarnas 2011-09-16, 18:58
If only I could make NY in Nov :)
We extract out large numbers of DNA sequence reads from HBase, run them through M/R pipelines to analyze and aggregate and then we load the results back in. Definitely specialized usage, but I could see other perfectly valid uses for reducers with HBase. -chris On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > > Sonal, > > You do realize that HBase is a "database", right? ;-) > > So again, why do you need a reducer? ;-) > > Using your example... > "Again, there will be many cases where one may want a reducer, say trying to count the occurrence of words in a particular column." > > You can do this one of two ways... > 1) Dynamic Counters in Hadoop. > 2) Use a temp table and auto increment the value in a column which contains the word count. (Fat row where rowkey is doc_id and column is word or rowkey is doc_id|word) > > I'm sorry but if you go through all of your examples of why you would want to use a reducer, you end up finding out that writing to an HBase table would be faster than a reduce job. > (Again we haven't done an exhaustive search, but in all of the HBase jobs we've run... no reducers were necessary.) > > The point I'm trying to make is that you want to avoid using a reducer whenever possible and if you think about your problem... you can probably come up with a solution that avoids the reducer... > > > HTH > > -Mike > PS. I haven't looked at *all* of the potential use cases of HBase which is why I don't want to say you'll never need a reducer. I will say that based on what we've done at my client's site, we try very hard to avoid reducers. > [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. :-) ] > > >> Date: Fri, 16 Sep 2011 23:00:49 +0530 >> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> >> Hi Michael, >> >> Yes, thanks, I understand the fact that reducers can be expensive with all >> the shuffling and the sorting, and you may not need them always. At the same >> time, there are many cases where reducers are useful, like secondary >> sorting. In many cases, one can have multiple map phases and not have a >> reduce phase at all. Again, there will be many cases where one may want a >> reducer, say trying to count the occurrence of words in a particular column. >> >> >> With this thought chain, I do not feel ready to say that when dealing with >> HBase, I really dont want to use a reducer. Please correct me if I am >> wrong. >> >> Thanks again. >> >> Best Regards, >> Sonal >> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >> Nube Technologies <http://www.nubetech.co> >> >> <http://in.linkedin.com/in/sonalgoyal> >> >> >> >> >> >> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >> <[EMAIL PROTECTED]>wrote: >> >>> >>> Sonal, >>> >>> Just because you have a m/r job doesn't mean that you need to reduce >>> anything. You can have a job that contains only a mapper. >>> Or your job runner can have a series of map jobs in serial. >>> >>> Most if not all of the map/reduce jobs where we pull data from HBase, don't >>> require a reducer. >>> >>> To give you a simple example... if I want to determine the table schema >>> where I am storing some sort of structured data... >>> I just write a m/r job which opens a table, scan's the table counting the >>> occurrence of each column name via dynamic counters. >>> >>> There is no need for a reducer. >>> >>> Does that help? >>> >>> >>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer >>> ... >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> >>>> Michel, >>>> >>>> Sorry can you please help me understand what you mean when you say that >>> when >>>> dealing with HBase, you really dont want to use a reducer? Here, Hbase is >>>> being used as the input to the MR job. >>>> >>>> Thanks >>>
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Doug Meil 2011-09-16, 19:41
Chris, agreed... There are sometimes that reducers aren't required, and then situations where they are useful. We have both kinds of jobs. For others following the thread, I updated the book recently with more MR examples (read-only, read-write, read-summary) http://hbase.apache.org/book.html#mapreduce.example As to the question that started this thread... re: "Store aggregated data in Oracle. " To me, that sounds a like the "read-summary" example with JDBC-Oracle in the reduce step. On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: >If only I could make NY in Nov :) > >We extract out large numbers of DNA sequence reads from HBase, run them >through M/R pipelines to analyze and aggregate and then we load the >results back in. Definitely specialized usage, but I could see other >perfectly valid uses for reducers with HBase. > >-chris > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > >> >> Sonal, >> >> You do realize that HBase is a "database", right? ;-) >> >> So again, why do you need a reducer? ;-) >> >> Using your example... >> "Again, there will be many cases where one may want a reducer, say >>trying to count the occurrence of words in a particular column." >> >> You can do this one of two ways... >> 1) Dynamic Counters in Hadoop. >> 2) Use a temp table and auto increment the value in a column which >>contains the word count. (Fat row where rowkey is doc_id and column is >>word or rowkey is doc_id|word) >> >> I'm sorry but if you go through all of your examples of why you would >>want to use a reducer, you end up finding out that writing to an HBase >>table would be faster than a reduce job. >> (Again we haven't done an exhaustive search, but in all of the HBase >>jobs we've run... no reducers were necessary.) >> >> The point I'm trying to make is that you want to avoid using a reducer >>whenever possible and if you think about your problem... you can >>probably come up with a solution that avoids the reducer... >> >> >> HTH >> >> -Mike >> PS. I haven't looked at *all* of the potential use cases of HBase which >>is why I don't want to say you'll never need a reducer. I will say that >>based on what we've done at my client's site, we try very hard to avoid >>reducers. >> [Note, I'm sure I'm going to get hammered on this when I head to NY in >>Nov. :-) ] >> >> >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer >>>... >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> >>> Hi Michael, >>> >>> Yes, thanks, I understand the fact that reducers can be expensive with >>>all >>> the shuffling and the sorting, and you may not need them always. At >>>the same >>> time, there are many cases where reducers are useful, like secondary >>> sorting. In many cases, one can have multiple map phases and not have a >>> reduce phase at all. Again, there will be many cases where one may >>>want a >>> reducer, say trying to count the occurrence of words in a particular >>>column. >>> >>> >>> With this thought chain, I do not feel ready to say that when dealing >>>with >>> HBase, I really dont want to use a reducer. Please correct me if I am >>> wrong. >>> >>> Thanks again. >>> >>> Best Regards, >>> Sonal >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>> Nube Technologies <http://www.nubetech.co> >>> >>> <http://in.linkedin.com/in/sonalgoyal> >>> >>> >>> >>> >>> >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>> <[EMAIL PROTECTED]>wrote: >>> >>>> >>>> Sonal, >>>> >>>> Just because you have a m/r job doesn't mean that you need to reduce >>>> anything. You can have a job that contains only a mapper. >>>> Or your job runner can have a series of map jobs in serial. >>>> >>>> Most if not all of the map/reduce jobs where we pull data from HBase, >>>>don't >>>> require a reducer. >>>> >>>> To give you a simple example... if I want to determine the table >>>>schema >>>> where I am storing some sort of structured data...
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Michael Segel 2011-09-16, 20:11
Chris, I don't know what sort of aggregation you are doing, but again, why not write to a temp table instead of using a reducer? > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > From: [EMAIL PROTECTED] > Date: Fri, 16 Sep 2011 11:58:05 -0700 > To: [EMAIL PROTECTED] > > If only I could make NY in Nov :) > > We extract out large numbers of DNA sequence reads from HBase, run them through M/R pipelines to analyze and aggregate and then we load the results back in. Definitely specialized usage, but I could see other perfectly valid uses for reducers with HBase. > > -chris > > On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > > > > > Sonal, > > > > You do realize that HBase is a "database", right? ;-) > > > > So again, why do you need a reducer? ;-) > > > > Using your example... > > "Again, there will be many cases where one may want a reducer, say trying to count the occurrence of words in a particular column." > > > > You can do this one of two ways... > > 1) Dynamic Counters in Hadoop. > > 2) Use a temp table and auto increment the value in a column which contains the word count. (Fat row where rowkey is doc_id and column is word or rowkey is doc_id|word) > > > > I'm sorry but if you go through all of your examples of why you would want to use a reducer, you end up finding out that writing to an HBase table would be faster than a reduce job. > > (Again we haven't done an exhaustive search, but in all of the HBase jobs we've run... no reducers were necessary.) > > > > The point I'm trying to make is that you want to avoid using a reducer whenever possible and if you think about your problem... you can probably come up with a solution that avoids the reducer... > > > > > > HTH > > > > -Mike > > PS. I haven't looked at *all* of the potential use cases of HBase which is why I don't want to say you'll never need a reducer. I will say that based on what we've done at my client's site, we try very hard to avoid reducers. > > [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. :-) ] > > > > > >> Date: Fri, 16 Sep 2011 23:00:49 +0530 > >> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > >> From: [EMAIL PROTECTED] > >> To: [EMAIL PROTECTED] > >> > >> Hi Michael, > >> > >> Yes, thanks, I understand the fact that reducers can be expensive with all > >> the shuffling and the sorting, and you may not need them always. At the same > >> time, there are many cases where reducers are useful, like secondary > >> sorting. In many cases, one can have multiple map phases and not have a > >> reduce phase at all. Again, there will be many cases where one may want a > >> reducer, say trying to count the occurrence of words in a particular column. > >> > >> > >> With this thought chain, I do not feel ready to say that when dealing with > >> HBase, I really dont want to use a reducer. Please correct me if I am > >> wrong. > >> > >> Thanks again. > >> > >> Best Regards, > >> Sonal > >> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > >> Nube Technologies <http://www.nubetech.co> > >> > >> <http://in.linkedin.com/in/sonalgoyal> > >> > >> > >> > >> > >> > >> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel > >> <[EMAIL PROTECTED]>wrote: > >> > >>> > >>> Sonal, > >>> > >>> Just because you have a m/r job doesn't mean that you need to reduce > >>> anything. You can have a job that contains only a mapper. > >>> Or your job runner can have a series of map jobs in serial. > >>> > >>> Most if not all of the map/reduce jobs where we pull data from HBase, don't > >>> require a reducer. > >>> > >>> To give you a simple example... if I want to determine the table schema > >>> where I am storing some sort of structured data... > >>> I just write a m/r job which opens a table, scan's the table counting the > >>> occurrence of each column name via dynamic counters. > >>> > >>> There is no need for a reducer. > >>
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Michael Segel 2011-09-16, 20:24
Doug and company... Look, I'm not saying that there aren't m/r jobs were you might need reducers when working w HBase. What I am saying is that if we look at what you're attempting to do, you may end up getting better performance if you created a temp table in HBase and let HBase do some of the heavy lifting where you are currently using a reducer. From the jobs that we run, when we looked at what we were doing, there wasn't any need for a reducer. I suspect that its true of other jobs. Remember that HBase is much more than just an HFile format to persist stuff. Even looking at Sonal's example... you have other ways of doing the record counts like dynamic counters or using a temp table in HBase which I believe will give you better performance numbers, although I haven't benchmarked either against a reducer. Does that make sense? -Mike > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Date: Fri, 16 Sep 2011 15:41:44 -0400 > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > > > Chris, agreed... There are sometimes that reducers aren't required, and > then situations where they are useful. We have both kinds of jobs. > > For others following the thread, I updated the book recently with more MR > examples (read-only, read-write, read-summary) > > http://hbase.apache.org/book.html#mapreduce.example > > > As to the question that started this thread... > > > re: "Store aggregated data in Oracle. " > > To me, that sounds a like the "read-summary" example with JDBC-Oracle in > the reduce step. > > > > > > On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: > > >If only I could make NY in Nov :) > > > >We extract out large numbers of DNA sequence reads from HBase, run them > >through M/R pipelines to analyze and aggregate and then we load the > >results back in. Definitely specialized usage, but I could see other > >perfectly valid uses for reducers with HBase. > > > >-chris > > > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > > > >> > >> Sonal, > >> > >> You do realize that HBase is a "database", right? ;-) > >> > >> So again, why do you need a reducer? ;-) > >> > >> Using your example... > >> "Again, there will be many cases where one may want a reducer, say > >>trying to count the occurrence of words in a particular column." > >> > >> You can do this one of two ways... > >> 1) Dynamic Counters in Hadoop. > >> 2) Use a temp table and auto increment the value in a column which > >>contains the word count. (Fat row where rowkey is doc_id and column is > >>word or rowkey is doc_id|word) > >> > >> I'm sorry but if you go through all of your examples of why you would > >>want to use a reducer, you end up finding out that writing to an HBase > >>table would be faster than a reduce job. > >> (Again we haven't done an exhaustive search, but in all of the HBase > >>jobs we've run... no reducers were necessary.) > >> > >> The point I'm trying to make is that you want to avoid using a reducer > >>whenever possible and if you think about your problem... you can > >>probably come up with a solution that avoids the reducer... > >> > >> > >> HTH > >> > >> -Mike > >> PS. I haven't looked at *all* of the potential use cases of HBase which > >>is why I don't want to say you'll never need a reducer. I will say that > >>based on what we've done at my client's site, we try very hard to avoid > >>reducers. > >> [Note, I'm sure I'm going to get hammered on this when I head to NY in > >>Nov. :-) ] > >> > >> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer > >>>... > >>> From: [EMAIL PROTECTED] > >>> To: [EMAIL PROTECTED] > >>> > >>> Hi Michael, > >>> > >>> Yes, thanks, I understand the fact that reducers can be expensive with > >>>all > >>> the shuffling and the sorting, and you may not need them always. At > >>>the same > >>> time, there are many cases where reducers are useful, like secondary
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Chris Tarnas 2011-09-16, 21:54
Hi Mike,
It's analysis* and aggregation, not just aggregation so it's a bit more complex. Each row in the input generates at least one new row of data when we are done. For our data sizes (~1 billion 2-3kb rows per job now and growing) we originally did normal inserts, but then we switched to bulk imports - it was much faster and a lot less stress on the regionservers. Bulk importing uses a reducer, so even if we went through and changed our M/R pipelines to use a temp table for organized intermediate data, the most efficient way to populate the temp table would be via the bulk loader - using a reducer anyway. -chris * Sorry to be broad but for business reasons I can't talk to much about the analysis details. On Sep 16, 2011, at 1:11 PM, Michael Segel wrote: > > Chris, > > I don't know what sort of aggregation you are doing, but again, why not write to a temp table instead of using a reducer? > > > > >> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >> From: [EMAIL PROTECTED] >> Date: Fri, 16 Sep 2011 11:58:05 -0700 >> To: [EMAIL PROTECTED] >> >> If only I could make NY in Nov :) >> >> We extract out large numbers of DNA sequence reads from HBase, run them through M/R pipelines to analyze and aggregate and then we load the results back in. Definitely specialized usage, but I could see other perfectly valid uses for reducers with HBase. >> >> -chris >> >> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >> >>> >>> Sonal, >>> >>> You do realize that HBase is a "database", right? ;-) >>> >>> So again, why do you need a reducer? ;-) >>> >>> Using your example... >>> "Again, there will be many cases where one may want a reducer, say trying to count the occurrence of words in a particular column." >>> >>> You can do this one of two ways... >>> 1) Dynamic Counters in Hadoop. >>> 2) Use a temp table and auto increment the value in a column which contains the word count. (Fat row where rowkey is doc_id and column is word or rowkey is doc_id|word) >>> >>> I'm sorry but if you go through all of your examples of why you would want to use a reducer, you end up finding out that writing to an HBase table would be faster than a reduce job. >>> (Again we haven't done an exhaustive search, but in all of the HBase jobs we've run... no reducers were necessary.) >>> >>> The point I'm trying to make is that you want to avoid using a reducer whenever possible and if you think about your problem... you can probably come up with a solution that avoids the reducer... >>> >>> >>> HTH >>> >>> -Mike >>> PS. I haven't looked at *all* of the potential use cases of HBase which is why I don't want to say you'll never need a reducer. I will say that based on what we've done at my client's site, we try very hard to avoid reducers. >>> [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. :-) ] >>> >>> >>>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> >>>> Hi Michael, >>>> >>>> Yes, thanks, I understand the fact that reducers can be expensive with all >>>> the shuffling and the sorting, and you may not need them always. At the same >>>> time, there are many cases where reducers are useful, like secondary >>>> sorting. In many cases, one can have multiple map phases and not have a >>>> reduce phase at all. Again, there will be many cases where one may want a >>>> reducer, say trying to count the occurrence of words in a particular column. >>>> >>>> >>>> With this thought chain, I do not feel ready to say that when dealing with >>>> HBase, I really dont want to use a reducer. Please correct me if I am >>>> wrong. >>>> >>>> Thanks again. >>>> >>>> Best Regards, >>>> Sonal >>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>> Nube Technologies <http://www.nubetech.co> >>>> >>>> <http://in.linkedin.com/in/sonalgoyal>
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Chris Tarnas 2011-09-16, 22:34
But - if I may follow up on myself - I'll definitely keep my eyes more open for times when we really don't need a reducer. I can see what you are saying and that people should think a bit more laterally and use hbase for different and potentially more efficient workflows.
-chris On Sep 16, 2011, at 2:54 PM, Chris Tarnas wrote: > Hi Mike, > > It's analysis* and aggregation, not just aggregation so it's a bit more complex. Each row in the input generates at least one new row of data when we are done. > > For our data sizes (~1 billion 2-3kb rows per job now and growing) we originally did normal inserts, but then we switched to bulk imports - it was much faster and a lot less stress on the regionservers. Bulk importing uses a reducer, so even if we went through and changed our M/R pipelines to use a temp table for organized intermediate data, the most efficient way to populate the temp table would be via the bulk loader - using a reducer anyway. > > -chris > > * Sorry to be broad but for business reasons I can't talk to much about the analysis details. > > > On Sep 16, 2011, at 1:11 PM, Michael Segel wrote: > >> >> Chris, >> >> I don't know what sort of aggregation you are doing, but again, why not write to a temp table instead of using a reducer? >> >> >> >> >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >>> From: [EMAIL PROTECTED] >>> Date: Fri, 16 Sep 2011 11:58:05 -0700 >>> To: [EMAIL PROTECTED] >>> >>> If only I could make NY in Nov :) >>> >>> We extract out large numbers of DNA sequence reads from HBase, run them through M/R pipelines to analyze and aggregate and then we load the results back in. Definitely specialized usage, but I could see other perfectly valid uses for reducers with HBase. >>> >>> -chris >>> >>> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>> >>>> >>>> Sonal, >>>> >>>> You do realize that HBase is a "database", right? ;-) >>>> >>>> So again, why do you need a reducer? ;-) >>>> >>>> Using your example... >>>> "Again, there will be many cases where one may want a reducer, say trying to count the occurrence of words in a particular column." >>>> >>>> You can do this one of two ways... >>>> 1) Dynamic Counters in Hadoop. >>>> 2) Use a temp table and auto increment the value in a column which contains the word count. (Fat row where rowkey is doc_id and column is word or rowkey is doc_id|word) >>>> >>>> I'm sorry but if you go through all of your examples of why you would want to use a reducer, you end up finding out that writing to an HBase table would be faster than a reduce job. >>>> (Again we haven't done an exhaustive search, but in all of the HBase jobs we've run... no reducers were necessary.) >>>> >>>> The point I'm trying to make is that you want to avoid using a reducer whenever possible and if you think about your problem... you can probably come up with a solution that avoids the reducer... >>>> >>>> >>>> HTH >>>> >>>> -Mike >>>> PS. I haven't looked at *all* of the potential use cases of HBase which is why I don't want to say you'll never need a reducer. I will say that based on what we've done at my client's site, we try very hard to avoid reducers. >>>> [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. :-) ] >>>> >>>> >>>>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >>>>> From: [EMAIL PROTECTED] >>>>> To: [EMAIL PROTECTED] >>>>> >>>>> Hi Michael, >>>>> >>>>> Yes, thanks, I understand the fact that reducers can be expensive with all >>>>> the shuffling and the sorting, and you may not need them always. At the same >>>>> time, there are many cases where reducers are useful, like secondary >>>>> sorting. In many cases, one can have multiple map phases and not have a >>>>> reduce phase at all. Again, there will be many cases where one may want a >>>>> reducer, say trying to count the occurrence of words in a particular column.
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sam Seigal 2011-09-17, 00:16
I am trying to do something similar with HBase Map/Reduce.
I have event ids and amounts stored in hbase in the following format: prefix-event_id_type-timestamp-event_id as the row key and amount as the value I want to be able to aggregate the amounts based on the event id type and for this I am using a reducer. I basically reduce on the eventidtype from the incoming row in the map phase, and perform the aggregation in the reducer on the amounts for the event types. Then I write back the results into HBase. I hadn't thought about writing values directly into a temp HBase table as suggested by Mike in the map phase. For this case, each mapper can declare its own mapperId_event_type row with totalAmount and for each row it receives, do a get , add the current amount, and then a put. We are basically then doing a get/add/put for every row that a mapper receives. Is this any more efficient when compared to the overhead of sorting/partitioning for a reducer ? At the end of the mapping phase, aggregating the output of all the mappers should be trivial. On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > > Doug and company... > > Look, I'm not saying that there aren't m/r jobs were you might need reducers when working w HBase. What I am saying is that if we look at what you're attempting to do, you may end up getting better performance if you created a temp table in HBase and let HBase do some of the heavy lifting where you are currently using a reducer. From the jobs that we run, when we looked at what we were doing, there wasn't any need for a reducer. I suspect that its true of other jobs. > > Remember that HBase is much more than just an HFile format to persist stuff. > > Even looking at Sonal's example... you have other ways of doing the record counts like dynamic counters or using a temp table in HBase which I believe will give you better performance numbers, although I haven't benchmarked either against a reducer. > > Does that make sense? > > -Mike > > > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > Date: Fri, 16 Sep 2011 15:41:44 -0400 > > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > > > > > > Chris, agreed... There are sometimes that reducers aren't required, and > > then situations where they are useful. We have both kinds of jobs. > > > > For others following the thread, I updated the book recently with more MR > > examples (read-only, read-write, read-summary) > > > > http://hbase.apache.org/book.html#mapreduce.example > > > > > > As to the question that started this thread... > > > > > > re: "Store aggregated data in Oracle. " > > > > To me, that sounds a like the "read-summary" example with JDBC-Oracle in > > the reduce step. > > > > > > > > > > > > On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: > > > > >If only I could make NY in Nov :) > > > > > >We extract out large numbers of DNA sequence reads from HBase, run them > > >through M/R pipelines to analyze and aggregate and then we load the > > >results back in. Definitely specialized usage, but I could see other > > >perfectly valid uses for reducers with HBase. > > > > > >-chris > > > > > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > > > > > >> > > >> Sonal, > > >> > > >> You do realize that HBase is a "database", right? ;-) > > >> > > >> So again, why do you need a reducer? ;-) > > >> > > >> Using your example... > > >> "Again, there will be many cases where one may want a reducer, say > > >>trying to count the occurrence of words in a particular column." > > >> > > >> You can do this one of two ways... > > >> 1) Dynamic Counters in Hadoop. > > >> 2) Use a temp table and auto increment the value in a column which > > >>contains the word count. (Fat row where rowkey is doc_id and column is > > >>word or rowkey is doc_id|word) > > >> > > >> I'm sorry but if you go through all of your examples of why you would > > >>want to use a reducer, you end up finding out that writing to an HBase
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Doug Meil 2011-09-17, 00:22
I was in the middle of responding to Mike's email when yours arrived, so I'll respond to both. I think the temp-table idea is interesting. The caution is that a default temp-table creation will be hosted on a single RS and thus be a bottleneck for aggregation. So I would imagine that you would need to tune the temp-table for the job and pre-create regions. Doug On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >I am trying to do something similar with HBase Map/Reduce. > >I have event ids and amounts stored in hbase in the following format: >prefix-event_id_type-timestamp-event_id as the row key and amount as >the value >I want to be able to aggregate the amounts based on the event id type >and for this I am using a reducer. I basically reduce on the >eventidtype from the incoming row in the map phase, and perform the >aggregation in the reducer on the amounts for the event types. Then I >write back the results into HBase. > >I hadn't thought about writing values directly into a temp HBase table >as suggested by Mike in the map phase. > >For this case, each mapper can declare its own mapperId_event_type row >with totalAmount and for each row it receives, do a get , add the >current amount, and then a put. We are basically then doing a >get/add/put for every row that a mapper receives. Is this any more >efficient when compared to the overhead of sorting/partitioning for a >reducer ? > >At the end of the mapping phase, aggregating the output of all the >mappers should be trivial. > > > >On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel ><[EMAIL PROTECTED]> wrote: >> >> Doug and company... >> >> Look, I'm not saying that there aren't m/r jobs were you might need >>reducers when working w HBase. What I am saying is that if we look at >>what you're attempting to do, you may end up getting better performance >>if you created a temp table in HBase and let HBase do some of the heavy >>lifting where you are currently using a reducer. From the jobs that we >>run, when we looked at what we were doing, there wasn't any need for a >>reducer. I suspect that its true of other jobs. >> >> Remember that HBase is much more than just an HFile format to persist >>stuff. >> >> Even looking at Sonal's example... you have other ways of doing the >>record counts like dynamic counters or using a temp table in HBase which >>I believe will give you better performance numbers, although I haven't >>benchmarked either against a reducer. >> >> Does that make sense? >> >> -Mike >> >> >> > From: [EMAIL PROTECTED] >> > To: [EMAIL PROTECTED] >> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>JDBCReducer ... >> > >> > >> > Chris, agreed... There are sometimes that reducers aren't required, >>and >> > then situations where they are useful. We have both kinds of jobs. >> > >> > For others following the thread, I updated the book recently with >>more MR >> > examples (read-only, read-write, read-summary) >> > >> > http://hbase.apache.org/book.html#mapreduce.example >> > >> > >> > As to the question that started this thread... >> > >> > >> > re: "Store aggregated data in Oracle. " >> > >> > To me, that sounds a like the "read-summary" example with JDBC-Oracle >>in >> > the reduce step. >> > >> > >> > >> > >> > >> > On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: >> > >> > >If only I could make NY in Nov :) >> > > >> > >We extract out large numbers of DNA sequence reads from HBase, run >>them >> > >through M/R pipelines to analyze and aggregate and then we load the >> > >results back in. Definitely specialized usage, but I could see other >> > >perfectly valid uses for reducers with HBase. >> > > >> > >-chris >> > > >> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >> > > >> > >> >> > >> Sonal, >> > >> >> > >> You do realize that HBase is a "database", right? ;-) >> > >> >> > >> So again, why do you need a reducer? ;-) >> > >> >> > >> Using your example...
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Doug Meil 2011-09-17, 00:24
I'll add this to the book in the MR section. On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: > >I was in the middle of responding to Mike's email when yours arrived, so >I'll respond to both. > >I think the temp-table idea is interesting. The caution is that a default >temp-table creation will be hosted on a single RS and thus be a bottleneck >for aggregation. So I would imagine that you would need to tune the >temp-table for the job and pre-create regions. > >Doug > > > >On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: > >>I am trying to do something similar with HBase Map/Reduce. >> >>I have event ids and amounts stored in hbase in the following format: >>prefix-event_id_type-timestamp-event_id as the row key and amount as >>the value >>I want to be able to aggregate the amounts based on the event id type >>and for this I am using a reducer. I basically reduce on the >>eventidtype from the incoming row in the map phase, and perform the >>aggregation in the reducer on the amounts for the event types. Then I >>write back the results into HBase. >> >>I hadn't thought about writing values directly into a temp HBase table >>as suggested by Mike in the map phase. >> >>For this case, each mapper can declare its own mapperId_event_type row >>with totalAmount and for each row it receives, do a get , add the >>current amount, and then a put. We are basically then doing a >>get/add/put for every row that a mapper receives. Is this any more >>efficient when compared to the overhead of sorting/partitioning for a >>reducer ? >> >>At the end of the mapping phase, aggregating the output of all the >>mappers should be trivial. >> >> >> >>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >><[EMAIL PROTECTED]> wrote: >>> >>> Doug and company... >>> >>> Look, I'm not saying that there aren't m/r jobs were you might need >>>reducers when working w HBase. What I am saying is that if we look at >>>what you're attempting to do, you may end up getting better performance >>>if you created a temp table in HBase and let HBase do some of the heavy >>>lifting where you are currently using a reducer. From the jobs that we >>>run, when we looked at what we were doing, there wasn't any need for a >>>reducer. I suspect that its true of other jobs. >>> >>> Remember that HBase is much more than just an HFile format to persist >>>stuff. >>> >>> Even looking at Sonal's example... you have other ways of doing the >>>record counts like dynamic counters or using a temp table in HBase which >>>I believe will give you better performance numbers, although I haven't >>>benchmarked either against a reducer. >>> >>> Does that make sense? >>> >>> -Mike >>> >>> >>> > From: [EMAIL PROTECTED] >>> > To: [EMAIL PROTECTED] >>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>JDBCReducer ... >>> > >>> > >>> > Chris, agreed... There are sometimes that reducers aren't required, >>>and >>> > then situations where they are useful. We have both kinds of jobs. >>> > >>> > For others following the thread, I updated the book recently with >>>more MR >>> > examples (read-only, read-write, read-summary) >>> > >>> > http://hbase.apache.org/book.html#mapreduce.example >>> > >>> > >>> > As to the question that started this thread... >>> > >>> > >>> > re: "Store aggregated data in Oracle. " >>> > >>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle >>>in >>> > the reduce step. >>> > >>> > >>> > >>> > >>> > >>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: >>> > >>> > >If only I could make NY in Nov :) >>> > > >>> > >We extract out large numbers of DNA sequence reads from HBase, run >>>them >>> > >through M/R pipelines to analyze and aggregate and then we load the >>> > >results back in. Definitely specialized usage, but I could see other >>> > >perfectly valid uses for reducers with HBase. >>> > > >>> > >-chris >>> > > >>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sam Seigal 2011-09-17, 01:00
I see what you are saying about the temp table being hosted at a
single regions server - especially for a limited set of rows that just care about the aggregations, but receive a lot of traffic. I wonder if this will also be the case, if I was to use the source table to maintain these temporary records, and not create a temp table on the fly ... On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil <[EMAIL PROTECTED]> wrote: > > I'll add this to the book in the MR section. > > > > > > On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: > >> >>I was in the middle of responding to Mike's email when yours arrived, so >>I'll respond to both. >> >>I think the temp-table idea is interesting. The caution is that a default >>temp-table creation will be hosted on a single RS and thus be a bottleneck >>for aggregation. So I would imagine that you would need to tune the >>temp-table for the job and pre-create regions. >> >>Doug >> >> >> >>On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >> >>>I am trying to do something similar with HBase Map/Reduce. >>> >>>I have event ids and amounts stored in hbase in the following format: >>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>the value >>>I want to be able to aggregate the amounts based on the event id type >>>and for this I am using a reducer. I basically reduce on the >>>eventidtype from the incoming row in the map phase, and perform the >>>aggregation in the reducer on the amounts for the event types. Then I >>>write back the results into HBase. >>> >>>I hadn't thought about writing values directly into a temp HBase table >>>as suggested by Mike in the map phase. >>> >>>For this case, each mapper can declare its own mapperId_event_type row >>>with totalAmount and for each row it receives, do a get , add the >>>current amount, and then a put. We are basically then doing a >>>get/add/put for every row that a mapper receives. Is this any more >>>efficient when compared to the overhead of sorting/partitioning for a >>>reducer ? >>> >>>At the end of the mapping phase, aggregating the output of all the >>>mappers should be trivial. >>> >>> >>> >>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>><[EMAIL PROTECTED]> wrote: >>>> >>>> Doug and company... >>>> >>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>reducers when working w HBase. What I am saying is that if we look at >>>>what you're attempting to do, you may end up getting better performance >>>>if you created a temp table in HBase and let HBase do some of the heavy >>>>lifting where you are currently using a reducer. From the jobs that we >>>>run, when we looked at what we were doing, there wasn't any need for a >>>>reducer. I suspect that its true of other jobs. >>>> >>>> Remember that HBase is much more than just an HFile format to persist >>>>stuff. >>>> >>>> Even looking at Sonal's example... you have other ways of doing the >>>>record counts like dynamic counters or using a temp table in HBase which >>>>I believe will give you better performance numbers, although I haven't >>>>benchmarked either against a reducer. >>>> >>>> Does that make sense? >>>> >>>> -Mike >>>> >>>> >>>> > From: [EMAIL PROTECTED] >>>> > To: [EMAIL PROTECTED] >>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>JDBCReducer ... >>>> > >>>> > >>>> > Chris, agreed... There are sometimes that reducers aren't required, >>>>and >>>> > then situations where they are useful. We have both kinds of jobs. >>>> > >>>> > For others following the thread, I updated the book recently with >>>>more MR >>>> > examples (read-only, read-write, read-summary) >>>> > >>>> > http://hbase.apache.org/book.html#mapreduce.example >>>> > >>>> > >>>> > As to the question that started this thread... >>>> > >>>> > >>>> > re: "Store aggregated data in Oracle. " >>>> > >>>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Doug Meil 2011-09-17, 01:14
However, if the aggregations in the mapper were kept in a HashMap (key being the aggregate, value being the count), and then the mapper made a single pass over this map during the cleanup method and then did the checkAndPuts, it would mean that the writes would only happen once per map-task, and not do it on a per-row basis (which would be really expensive). A single region on a single RS could handle that no problem. On 9/16/11 9:00 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >I see what you are saying about the temp table being hosted at a >single regions server - especially for a limited set of rows that >just care about the aggregations, but receive a lot of traffic. I >wonder if this will also be the case, if I was to use the source table >to maintain these temporary records, and not create a temp table on >the fly ... > >On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil ><[EMAIL PROTECTED]> wrote: >> >> I'll add this to the book in the MR section. >> >> >> >> >> >> On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: >> >>> >>>I was in the middle of responding to Mike's email when yours arrived, so >>>I'll respond to both. >>> >>>I think the temp-table idea is interesting. The caution is that a >>>default >>>temp-table creation will be hosted on a single RS and thus be a >>>bottleneck >>>for aggregation. So I would imagine that you would need to tune the >>>temp-table for the job and pre-create regions. >>> >>>Doug >>> >>> >>> >>>On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >>> >>>>I am trying to do something similar with HBase Map/Reduce. >>>> >>>>I have event ids and amounts stored in hbase in the following format: >>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>the value >>>>I want to be able to aggregate the amounts based on the event id type >>>>and for this I am using a reducer. I basically reduce on the >>>>eventidtype from the incoming row in the map phase, and perform the >>>>aggregation in the reducer on the amounts for the event types. Then I >>>>write back the results into HBase. >>>> >>>>I hadn't thought about writing values directly into a temp HBase table >>>>as suggested by Mike in the map phase. >>>> >>>>For this case, each mapper can declare its own mapperId_event_type row >>>>with totalAmount and for each row it receives, do a get , add the >>>>current amount, and then a put. We are basically then doing a >>>>get/add/put for every row that a mapper receives. Is this any more >>>>efficient when compared to the overhead of sorting/partitioning for a >>>>reducer ? >>>> >>>>At the end of the mapping phase, aggregating the output of all the >>>>mappers should be trivial. >>>> >>>> >>>> >>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>><[EMAIL PROTECTED]> wrote: >>>>> >>>>> Doug and company... >>>>> >>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>reducers when working w HBase. What I am saying is that if we look at >>>>>what you're attempting to do, you may end up getting better >>>>>performance >>>>>if you created a temp table in HBase and let HBase do some of the >>>>>heavy >>>>>lifting where you are currently using a reducer. From the jobs that we >>>>>run, when we looked at what we were doing, there wasn't any need for a >>>>>reducer. I suspect that its true of other jobs. >>>>> >>>>> Remember that HBase is much more than just an HFile format to persist >>>>>stuff. >>>>> >>>>> Even looking at Sonal's example... you have other ways of doing the >>>>>record counts like dynamic counters or using a temp table in HBase >>>>>which >>>>>I believe will give you better performance numbers, although I haven't >>>>>benchmarked either against a reducer. >>>>> >>>>> Does that make sense? >>>>> >>>>> -Mike >>>>> >>>>> >>>>> > From: [EMAIL PROTECTED] >>>>> > To: [EMAIL PROTECTED] >>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sam Seigal 2011-09-17, 01:39
Aren't there memory considerations with this approach ? I would assume
the HashMap can get pretty big , if it retains in memory every record that passes through .. (Apologies, if I am being ignorant with my limited knowledge of hadoop's internal workings ... ) On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil <[EMAIL PROTECTED]> wrote: > > However, if the aggregations in the mapper were kept in a HashMap (key > being the aggregate, value being the count), and then the mapper made a > single pass over this map during the cleanup method and then did the > checkAndPuts, it would mean that the writes would only happen once per > map-task, and not do it on a per-row basis (which would be really > expensive). > > A single region on a single RS could handle that no problem. > > > > > On 9/16/11 9:00 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: > >>I see what you are saying about the temp table being hosted at a >>single regions server - especially for a limited set of rows that >>just care about the aggregations, but receive a lot of traffic. I >>wonder if this will also be the case, if I was to use the source table >>to maintain these temporary records, and not create a temp table on >>the fly ... >> >>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >><[EMAIL PROTECTED]> wrote: >>> >>> I'll add this to the book in the MR section. >>> >>> >>> >>> >>> >>> On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: >>> >>>> >>>>I was in the middle of responding to Mike's email when yours arrived, so >>>>I'll respond to both. >>>> >>>>I think the temp-table idea is interesting. The caution is that a >>>>default >>>>temp-table creation will be hosted on a single RS and thus be a >>>>bottleneck >>>>for aggregation. So I would imagine that you would need to tune the >>>>temp-table for the job and pre-create regions. >>>> >>>>Doug >>>> >>>> >>>> >>>>On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >>>> >>>>>I am trying to do something similar with HBase Map/Reduce. >>>>> >>>>>I have event ids and amounts stored in hbase in the following format: >>>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>>the value >>>>>I want to be able to aggregate the amounts based on the event id type >>>>>and for this I am using a reducer. I basically reduce on the >>>>>eventidtype from the incoming row in the map phase, and perform the >>>>>aggregation in the reducer on the amounts for the event types. Then I >>>>>write back the results into HBase. >>>>> >>>>>I hadn't thought about writing values directly into a temp HBase table >>>>>as suggested by Mike in the map phase. >>>>> >>>>>For this case, each mapper can declare its own mapperId_event_type row >>>>>with totalAmount and for each row it receives, do a get , add the >>>>>current amount, and then a put. We are basically then doing a >>>>>get/add/put for every row that a mapper receives. Is this any more >>>>>efficient when compared to the overhead of sorting/partitioning for a >>>>>reducer ? >>>>> >>>>>At the end of the mapping phase, aggregating the output of all the >>>>>mappers should be trivial. >>>>> >>>>> >>>>> >>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>>><[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> Doug and company... >>>>>> >>>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>>reducers when working w HBase. What I am saying is that if we look at >>>>>>what you're attempting to do, you may end up getting better >>>>>>performance >>>>>>if you created a temp table in HBase and let HBase do some of the >>>>>>heavy >>>>>>lifting where you are currently using a reducer. From the jobs that we >>>>>>run, when we looked at what we were doing, there wasn't any need for a >>>>>>reducer. I suspect that its true of other jobs. >>>>>> >>>>>> Remember that HBase is much more than just an HFile format to persist >>>>>>stuff. >>>>>> >>>>>> Even looking at Sonal's example... you have other ways of doing the
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Sam Seigal 2011-09-17, 01:44
If an input split is too large and memory a concern, we can surely
address this in TableInputFormat.getSplits() and limit the size ... On Fri, Sep 16, 2011 at 6:39 PM, Sam Seigal <[EMAIL PROTECTED]> wrote: > Aren't there memory considerations with this approach ? I would assume > the HashMap can get pretty big , if it retains in memory every record > that passes through .. (Apologies, if I am being ignorant with my > limited knowledge of hadoop's internal workings ... ) > > On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil > <[EMAIL PROTECTED]> wrote: >> >> However, if the aggregations in the mapper were kept in a HashMap (key >> being the aggregate, value being the count), and then the mapper made a >> single pass over this map during the cleanup method and then did the >> checkAndPuts, it would mean that the writes would only happen once per >> map-task, and not do it on a per-row basis (which would be really >> expensive). >> >> A single region on a single RS could handle that no problem. >> >> >> >> >> On 9/16/11 9:00 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >> >>>I see what you are saying about the temp table being hosted at a >>>single regions server - especially for a limited set of rows that >>>just care about the aggregations, but receive a lot of traffic. I >>>wonder if this will also be the case, if I was to use the source table >>>to maintain these temporary records, and not create a temp table on >>>the fly ... >>> >>>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >>><[EMAIL PROTECTED]> wrote: >>>> >>>> I'll add this to the book in the MR section. >>>> >>>> >>>> >>>> >>>> >>>> On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: >>>> >>>>> >>>>>I was in the middle of responding to Mike's email when yours arrived, so >>>>>I'll respond to both. >>>>> >>>>>I think the temp-table idea is interesting. The caution is that a >>>>>default >>>>>temp-table creation will be hosted on a single RS and thus be a >>>>>bottleneck >>>>>for aggregation. So I would imagine that you would need to tune the >>>>>temp-table for the job and pre-create regions. >>>>> >>>>>Doug >>>>> >>>>> >>>>> >>>>>On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >>>>> >>>>>>I am trying to do something similar with HBase Map/Reduce. >>>>>> >>>>>>I have event ids and amounts stored in hbase in the following format: >>>>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>>>the value >>>>>>I want to be able to aggregate the amounts based on the event id type >>>>>>and for this I am using a reducer. I basically reduce on the >>>>>>eventidtype from the incoming row in the map phase, and perform the >>>>>>aggregation in the reducer on the amounts for the event types. Then I >>>>>>write back the results into HBase. >>>>>> >>>>>>I hadn't thought about writing values directly into a temp HBase table >>>>>>as suggested by Mike in the map phase. >>>>>> >>>>>>For this case, each mapper can declare its own mapperId_event_type row >>>>>>with totalAmount and for each row it receives, do a get , add the >>>>>>current amount, and then a put. We are basically then doing a >>>>>>get/add/put for every row that a mapper receives. Is this any more >>>>>>efficient when compared to the overhead of sorting/partitioning for a >>>>>>reducer ? >>>>>> >>>>>>At the end of the mapping phase, aggregating the output of all the >>>>>>mappers should be trivial. >>>>>> >>>>>> >>>>>> >>>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>>>><[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> Doug and company... >>>>>>> >>>>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>>>reducers when working w HBase. What I am saying is that if we look at >>>>>>>what you're attempting to do, you may end up getting better >>>>>>>performance >>>>>>>if you created a temp table in HBase and let HBase do some of the >>>>>>>heavy >>>>>>>lifting where you are currently using a reducer. From the jobs that we >
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Doug Meil 2011-09-17, 01:47
Map-task heap size would definitely be a concern, but since the hashmap would only contain aggregations, ostensibly this map would be holding a far smaller number of the rows that were passed into the mapper. At least that's how I'd use it. On 9/16/11 9:39 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >Aren't there memory considerations with this approach ? I would assume >the HashMap can get pretty big , if it retains in memory every record >that passes through .. (Apologies, if I am being ignorant with my >limited knowledge of hadoop's internal workings ... ) > >On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil ><[EMAIL PROTECTED]> wrote: >> >> However, if the aggregations in the mapper were kept in a HashMap (key >> being the aggregate, value being the count), and then the mapper made a >> single pass over this map during the cleanup method and then did the >> checkAndPuts, it would mean that the writes would only happen once per >> map-task, and not do it on a per-row basis (which would be really >> expensive). >> >> A single region on a single RS could handle that no problem. >> >> >> >> >> On 9/16/11 9:00 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >> >>>I see what you are saying about the temp table being hosted at a >>>single regions server - especially for a limited set of rows that >>>just care about the aggregations, but receive a lot of traffic. I >>>wonder if this will also be the case, if I was to use the source table >>>to maintain these temporary records, and not create a temp table on >>>the fly ... >>> >>>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >>><[EMAIL PROTECTED]> wrote: >>>> >>>> I'll add this to the book in the MR section. >>>> >>>> >>>> >>>> >>>> >>>> On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: >>>> >>>>> >>>>>I was in the middle of responding to Mike's email when yours arrived, >>>>>so >>>>>I'll respond to both. >>>>> >>>>>I think the temp-table idea is interesting. The caution is that a >>>>>default >>>>>temp-table creation will be hosted on a single RS and thus be a >>>>>bottleneck >>>>>for aggregation. So I would imagine that you would need to tune the >>>>>temp-table for the job and pre-create regions. >>>>> >>>>>Doug >>>>> >>>>> >>>>> >>>>>On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >>>>> >>>>>>I am trying to do something similar with HBase Map/Reduce. >>>>>> >>>>>>I have event ids and amounts stored in hbase in the following format: >>>>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>>>the value >>>>>>I want to be able to aggregate the amounts based on the event id type >>>>>>and for this I am using a reducer. I basically reduce on the >>>>>>eventidtype from the incoming row in the map phase, and perform the >>>>>>aggregation in the reducer on the amounts for the event types. Then I >>>>>>write back the results into HBase. >>>>>> >>>>>>I hadn't thought about writing values directly into a temp HBase >>>>>>table >>>>>>as suggested by Mike in the map phase. >>>>>> >>>>>>For this case, each mapper can declare its own mapperId_event_type >>>>>>row >>>>>>with totalAmount and for each row it receives, do a get , add the >>>>>>current amount, and then a put. We are basically then doing a >>>>>>get/add/put for every row that a mapper receives. Is this any more >>>>>>efficient when compared to the overhead of sorting/partitioning for a >>>>>>reducer ? >>>>>> >>>>>>At the end of the mapping phase, aggregating the output of all the >>>>>>mappers should be trivial. >>>>>> >>>>>> >>>>>> >>>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>>>><[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> Doug and company... >>>>>>> >>>>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>>>reducers when working w HBase. What I am saying is that if we look >>>>>>>at >>>>>>>what you're attempting to do, you may end up getting better >>>>>>>performance >>>>>>>if you created a temp table in HBase and let HBase do some of the
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Michel Segel 2011-09-17, 13:12
Guys,
Ok... You're putting a lot of thought in to this, which is a good thing. I really haven't looked at the bulk load, so I have some homework :-) In response to your discussion... 1) how fast is fast enough? I mean sure if you create a temp table on the fly, you could end up w a single region becoming a hot spot. Is it more than just a bottleneck, or can you hurt you RS and HBase? If it's only a bottleneck, remember that this is only a temp table. You have control of setting the max file size and pre splitting. 2) KISS. The first step is starting to realize that you have a database so why do you not want to take advantage of it? :-) Your first iteration may not be the most efficient solution, but it should be faster than using a reducer and/or combiner/reducer. Sure, there's no free lunch, but using the HBase tables should be more efficient. I'm not suggesting that this is always going to be faster, or better, but that from the problem sets we have worked with... It made more sense. ( ok, I'm an old database guy... So my opinion is skewed... ) 3) Keeping data till the end of the task, may work for some jobs. In the cleanup() method you could write out the data, provided you have enough memory... I'm sure there are pros and cons to it... But it's a good design idea to think about. It's really cool that people are now thinking about this... Sent from a remote device. Please excuse any typos... Mike Segel On Sep 16, 2011, at 8:47 PM, Doug Meil <[EMAIL PROTECTED]> wrote: > > Map-task heap size would definitely be a concern, but since the hashmap > would only contain aggregations, ostensibly this map would be holding a > far smaller number of the rows that were passed into the mapper. > > At least that's how I'd use it. > > > > On 9/16/11 9:39 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: > >> Aren't there memory considerations with this approach ? I would assume >> the HashMap can get pretty big , if it retains in memory every record >> that passes through .. (Apologies, if I am being ignorant with my >> limited knowledge of hadoop's internal workings ... ) >> >> On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil >> <[EMAIL PROTECTED]> wrote: >>> >>> However, if the aggregations in the mapper were kept in a HashMap (key >>> being the aggregate, value being the count), and then the mapper made a >>> single pass over this map during the cleanup method and then did the >>> checkAndPuts, it would mean that the writes would only happen once per >>> map-task, and not do it on a per-row basis (which would be really >>> expensive). >>> >>> A single region on a single RS could handle that no problem. >>> >>> >>> >>> >>> On 9/16/11 9:00 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >>> >>>> I see what you are saying about the temp table being hosted at a >>>> single regions server - especially for a limited set of rows that >>>> just care about the aggregations, but receive a lot of traffic. I >>>> wonder if this will also be the case, if I was to use the source table >>>> to maintain these temporary records, and not create a temp table on >>>> the fly ... >>>> >>>> On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >>>> <[EMAIL PROTECTED]> wrote: >>>>> >>>>> I'll add this to the book in the MR section. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> >>>>>> I was in the middle of responding to Mike's email when yours arrived, >>>>>> so >>>>>> I'll respond to both. >>>>>> >>>>>> I think the temp-table idea is interesting. The caution is that a >>>>>> default >>>>>> temp-table creation will be hosted on a single RS and thus be a >>>>>> bottleneck >>>>>> for aggregation. So I would imagine that you would need to tune the >>>>>> temp-table for the job and pre-create regions. >>>>>> >>>>>> Doug >>>>>> >>>>>> >>>>>> >>>>>> On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> I am trying to do something similar with HBase Map/Reduce.
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Steinmaurer Thomas 2011-09-19, 05:35
Your assumption is correct. As final output, we want to have aggregated
data in an Oracle database. We are using both, the map and reduce phase. The row key looks like that: <datasource-id>-<device-id>-<timestamp> We basically want to have daily aggregated data, basically measured values for datasource-id/device-id. We already have a proof-of-concept implementation, what does exactly that, but as final output, aggregated data is written into a HBase table again by extending the TableReducer as our reducer implementation. See also my thread "MR-Job: Exception in DBOutputFormat". Thanks again! Thomas -----Original Message----- From: Sonal Goyal [mailto:[EMAIL PROTECTED]] Sent: Freitag, 16. September 2011 18:07 To: [EMAIL PROTECTED] Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... Hi Thomas, I just assumed that you are already using reducers. From what I understood, please correct me if I am mistaken, You have data in HBase and you are running a MR job to aggregate the data. You have the map as well as reduce phase and as part of the final output, you want to send the data to Oracle. is that correct? Is there any information you would like to share regarding your flow and data? How big is your data, how often do you need to aggregate, what do your mappers emit? Are you already using reducers for aggregations? Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > I think you need to get a little bit more information. > Reducers are expensive. > When Thomas says that he is aggregating data, what exactly does he mean? > When dealing w HBase, you really don't want to use a reducer. > > You may want to run two map jobs and it could be that just dumping the > output via jdbc makes the most sense. > > We are starting to see a lot of questions where the OP isn't providing > enough information so that the recommendation could be wrong... > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > > There is a DBOutputFormat class in the > > org.apache,hadoop.mapreduce.lib.db > > package, you could use that. Or you could write to the hdfs and then > > use something like HIHO[1] to export to the db. I have been working > extensively > > in this area, you can write to me directly if you need any help. > > > > 1. https://github.com/sonalgoyal/hiho > > > > Best Regards, > > Sonal > > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > Nube Technologies <http://www.nubetech.co> > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > > [EMAIL PROTECTED]> wrote: > > > >> Hello, > >> > >> > >> > >> writing a MR-Job to process HBase data and store aggregated data in > >> Oracle. How would you do that in a MR-job? > >> > >> > >> > >> Currently, for test purposes we write the result into a HBase table > >> again by using a TableReducer. Is there something like a > >> OracleReducer, RelationalReducer, JDBCReducer or whatever? Or > >> should one simply use plan JDBC code in the reduce step? > >> > >> > >> > >> Thanks! > >> > >> > >> > >> Thomas > >> > >> > >> > >> >
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Steinmaurer Thomas 2011-09-19, 05:41
Hi Doug,
looked at your example and this looks pretty much what we have been done in our proof-of-concept implementation writing back to another HBase table by using a TableReducer. This works fine. We want to change that in a way that the final result is written to Oracle. When doing that, we end up with the following exception in the reduce step (see also my post "MR-Job: Exception in DBOutputFormat"): java.io.IOException at org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutp utFormat.java:180) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:559) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio n.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Your examples a very welcome, because they are based on the mapreduce package, right? Pretty much all examples out there are based on mapred, which is AFAIK the "old" way to write MR-Jobs. Regards, Thomas -----Original Message----- From: Doug Meil [mailto:[EMAIL PROTECTED]] Sent: Freitag, 16. September 2011 21:42 To: [EMAIL PROTECTED] Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... Chris, agreed... There are sometimes that reducers aren't required, and then situations where they are useful. We have both kinds of jobs. For others following the thread, I updated the book recently with more MR examples (read-only, read-write, read-summary) http://hbase.apache.org/book.html#mapreduce.example As to the question that started this thread... re: "Store aggregated data in Oracle. " To me, that sounds a like the "read-summary" example with JDBC-Oracle in the reduce step. On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: >If only I could make NY in Nov :) > >We extract out large numbers of DNA sequence reads from HBase, run them >through M/R pipelines to analyze and aggregate and then we load the >results back in. Definitely specialized usage, but I could see other >perfectly valid uses for reducers with HBase. > >-chris > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > >> >> Sonal, >> >> You do realize that HBase is a "database", right? ;-) >> >> So again, why do you need a reducer? ;-) >> >> Using your example... >> "Again, there will be many cases where one may want a reducer, say >>trying to count the occurrence of words in a particular column." >> >> You can do this one of two ways... >> 1) Dynamic Counters in Hadoop. >> 2) Use a temp table and auto increment the value in a column which >>contains the word count. (Fat row where rowkey is doc_id and column >>is word or rowkey is doc_id|word) >> >> I'm sorry but if you go through all of your examples of why you would >>want to use a reducer, you end up finding out that writing to an HBase >>table would be faster than a reduce job. >> (Again we haven't done an exhaustive search, but in all of the HBase >>jobs we've run... no reducers were necessary.) >> >> The point I'm trying to make is that you want to avoid using a >>reducer whenever possible and if you think about your problem... you >>can probably come up with a solution that avoids the reducer... >> >> >> HTH >> >> -Mike >> PS. I haven't looked at *all* of the potential use cases of HBase >>which is why I don't want to say you'll never need a reducer. I will >>say that based on what we've done at my client's site, we try very >>hard to avoid reducers. >> [Note, I'm sure I'm going to get hammered on this when I head to NY in >>Nov. :-) ] >> >> >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>JDBCReducer ... >>> From: [EMAIL PROTECTED]
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Doug Meil 2011-09-19, 13:35
Those were all from 'mapreduce', not 'mapred' packages. This seems like it's an issue with DBOutputFormat... org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutp utFormat.java:180) On 9/19/11 1:41 AM, "Steinmaurer Thomas" <[EMAIL PROTECTED]> wrote: >Hi Doug, > >looked at your example and this looks pretty much what we have been done >in our proof-of-concept implementation writing back to another HBase >table by using a TableReducer. This works fine. We want to change that >in a way that the final result is written to Oracle. > >When doing that, we end up with the following exception in the reduce >step (see also my post "MR-Job: Exception in DBOutputFormat"): > > >java.io.IOException > at >org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutp >utFormat.java:180) > at >org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:559) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio >n.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > >Your examples a very welcome, because they are based on the mapreduce >package, right? Pretty much all examples out there are based on mapred, >which is AFAIK the "old" way to write MR-Jobs. > > >Regards, >Thomas > > > >-----Original Message----- >From: Doug Meil [mailto:[EMAIL PROTECTED]] >Sent: Freitag, 16. September 2011 21:42 >To: [EMAIL PROTECTED] >Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer >... > > >Chris, agreed... There are sometimes that reducers aren't required, and >then situations where they are useful. We have both kinds of jobs. > >For others following the thread, I updated the book recently with more >MR examples (read-only, read-write, read-summary) > >http://hbase.apache.org/book.html#mapreduce.example > > >As to the question that started this thread... > > >re: "Store aggregated data in Oracle. " > >To me, that sounds a like the "read-summary" example with JDBC-Oracle in >the reduce step. > > > > > >On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: > >>If only I could make NY in Nov :) >> >>We extract out large numbers of DNA sequence reads from HBase, run them > >>through M/R pipelines to analyze and aggregate and then we load the >>results back in. Definitely specialized usage, but I could see other >>perfectly valid uses for reducers with HBase. >> >>-chris >> >>On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >> >>> >>> Sonal, >>> >>> You do realize that HBase is a "database", right? ;-) >>> >>> So again, why do you need a reducer? ;-) >>> >>> Using your example... >>> "Again, there will be many cases where one may want a reducer, say >>>trying to count the occurrence of words in a particular column." >>> >>> You can do this one of two ways... >>> 1) Dynamic Counters in Hadoop. >>> 2) Use a temp table and auto increment the value in a column which >>>contains the word count. (Fat row where rowkey is doc_id and column >>>is word or rowkey is doc_id|word) >>> >>> I'm sorry but if you go through all of your examples of why you would > >>>want to use a reducer, you end up finding out that writing to an HBase > >>>table would be faster than a reduce job. >>> (Again we haven't done an exhaustive search, but in all of the HBase >>>jobs we've run... no reducers were necessary.) >>> >>> The point I'm trying to make is that you want to avoid using a >>>reducer whenever possible and if you think about your problem... you >>>can probably come up with a solution that avoids the reducer... >>> >>> >>> HTH >>> >>> -Mike >>> PS. I haven't looked at *all* of the potential use cases of HBase >>>which is why I don't want to say you'll never need a reducer. I will
-
RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...Steinmaurer Thomas 2011-09-19, 13:44
Hi Doug,
I know. The re-raised generic IOException is a bit unlucky, because it could be the JDBC driver class can't be found or preparing the statement failed. I now took pretty much the same code as in DBOutputFormat.getRecordWriter and tried that code in my implemented ToolRunner.run method. Loading the JDBC driver class and preparing the generated statement based on the provided table and field names set by DBOutputFormat.setOutput(...) worked fine there, so I guess the generated IOException isn't from a missing JDBC library etc ... Any further ideas? Btw: I'm using the Cloudera distribution available as VMWare. Thanks! Thomas -----Original Message----- From: Doug Meil [mailto:[EMAIL PROTECTED]] Sent: Montag, 19. September 2011 15:35 To: [EMAIL PROTECTED] Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... Those were all from 'mapreduce', not 'mapred' packages. This seems like it's an issue with DBOutputFormat... org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutp utFormat.java:180) On 9/19/11 1:41 AM, "Steinmaurer Thomas" <[EMAIL PROTECTED]> wrote: >Hi Doug, > >looked at your example and this looks pretty much what we have been >done in our proof-of-concept implementation writing back to another >HBase table by using a TableReducer. This works fine. We want to change >that in a way that the final result is written to Oracle. > >When doing that, we end up with the following exception in the reduce >step (see also my post "MR-Job: Exception in DBOutputFormat"): > > >java.io.IOException > at >org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOut >p >utFormat.java:180) > at >org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:559) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati >o >n.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > >Your examples a very welcome, because they are based on the mapreduce >package, right? Pretty much all examples out there are based on mapred, >which is AFAIK the "old" way to write MR-Jobs. > > >Regards, >Thomas > > > >-----Original Message----- >From: Doug Meil [mailto:[EMAIL PROTECTED]] >Sent: Freitag, 16. September 2011 21:42 >To: [EMAIL PROTECTED] >Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer >... > > >Chris, agreed... There are sometimes that reducers aren't required, and >then situations where they are useful. We have both kinds of jobs. > >For others following the thread, I updated the book recently with more >MR examples (read-only, read-write, read-summary) > >http://hbase.apache.org/book.html#mapreduce.example > > >As to the question that started this thread... > > >re: "Store aggregated data in Oracle. " > >To me, that sounds a like the "read-summary" example with JDBC-Oracle >in the reduce step. > > > > > >On 9/16/11 2:58 PM, "Chris Tarnas" <[EMAIL PROTECTED]> wrote: > >>If only I could make NY in Nov :) >> >>We extract out large numbers of DNA sequence reads from HBase, run >>them > >>through M/R pipelines to analyze and aggregate and then we load the >>results back in. Definitely specialized usage, but I could see other >>perfectly valid uses for reducers with HBase. >> >>-chris >> >>On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >> >>> >>> Sonal, >>> >>> You do realize that HBase is a "database", right? ;-) >>> >>> So again, why do you need a reducer? ;-) >>> >>> Using your example... >>> "Again, there will be many cases where one may want a reducer, say >>>trying to count the occurrence of words in a particular column." >>> >>> You can do this one of two ways... help. |