|
|
-
Question about MapReduce
Jean-Marc Spaggiari 2012-10-27, 20:30
Hi,
I'm thinking about my firs MapReduce class and I have some questions.
The goal of it will be to move some rows from one table to another one based on the timestamp only.
Since this is pretty new for me, I'm starting from the RowCounter class to have a baseline.
There are few things I will have to update. First, the createSumittableJob method to get timestamp range instead of key range, and "play2 with the parameters. This part is fine.
Next, I need to update the map method, and this is where I have some questions.
I'm able to find the timestamp of all the cf:c from the context.getCurrentValue() method, that's fine. Now, my concern is on the way to get access to the table to store this field, and the table to delete it. Should I instantiate an HTable for the source table, and execute and delete on it, then do an insert on another HTable instance? Should I use an HTablePool? Also, since I’m already on the row, can’t I just mark it as deleted instead of calling a new HTable?
Also, instead of calling the delete and put one by one, I would like to put them on a list and execute it only when it’s over 10 members. How can I make sure that at the end of the job, this is flushed? Else, I will lose some operations. Is there a kind of “dispose” method called on the region when the job is done?
Thanks,
JM
+
Jean-Marc Spaggiari 2012-10-27, 20:30
-
Re: Question about MapReduce
Jean-Marc Spaggiari 2012-10-29, 15:11
I'm replying to myself ;)
I found "cleanup" and "setup" methods from the TableMapper table. So I think those are the methods I was looking for. I will init the HTablePool there. Please let me know if I'm wrong.
Now, I still have few other questions.
1) context.getCurrentValue() can throw a InterrruptedException, but when can this occur? Is there a timeout on the Mapper side? Of it's if the region is going down while the job is running? 2) How can I pass parameters to the Map method? Can I use job.getConfiguration().put to add some properties there, can get them back in context.getConfiguration.get? 3) What's the best way to log results/exceptions/traces from the map method?
I will search on my side, but some help will be welcome because it seems there is not much documentation when we start to dig a bit :(
JM
2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > Hi, > > I'm thinking about my firs MapReduce class and I have some questions. > > The goal of it will be to move some rows from one table to another one > based on the timestamp only. > > Since this is pretty new for me, I'm starting from the RowCounter > class to have a baseline. > > There are few things I will have to update. First, the > createSumittableJob method to get timestamp range instead of key > range, and "play2 with the parameters. This part is fine. > > Next, I need to update the map method, and this is where I have some > questions. > > I'm able to find the timestamp of all the cf:c from the > context.getCurrentValue() method, that's fine. Now, my concern is on > the way to get access to the table to store this field, and the table > to delete it. Should I instantiate an HTable for the source table, and > execute and delete on it, then do an insert on another HTable > instance? Should I use an HTablePool? Also, since I’m already on the > row, can’t I just mark it as deleted instead of calling a new HTable? > > Also, instead of calling the delete and put one by one, I would like > to put them on a list and execute it only when it’s over 10 members. > How can I make sure that at the end of the job, this is flushed? Else, > I will lose some operations. Is there a kind of “dispose” method > called on the region when the job is done? > > Thanks, > > JM >
+
Jean-Marc Spaggiari 2012-10-29, 15:11
-
Re: Question about MapReduce
Shrijeet Paliwal 2012-10-29, 17:03
In line.
On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote:
> I'm replying to myself ;) > > I found "cleanup" and "setup" methods from the TableMapper table. So I > think those are the methods I was looking for. I will init the > HTablePool there. Please let me know if I'm wrong. > > Now, I still have few other questions. > > 1) context.getCurrentValue() can throw a InterrruptedException, but > when can this occur? Is there a timeout on the Mapper side? Of it's if > the region is going down while the job is running? >
You do not need to call context.getCurrentValue(). The 'value' argument to map method[1] has the information you are looking for. > 2) How can I pass parameters to the Map method? Can I use > job.getConfiguration().put to add some properties there, can get them > back in context.getConfiguration.get? >
Yes, thats how it is done. > 3) What's the best way to log results/exceptions/traces from the map > method? >
In most cases, you'll have mapper and reducer classes as nested static classes within some enclosing class. You can get handle to the Logger from the enclosing class and do your usual LOG.info, LOG.warn yada yada.
Hope it helps.
[1] map(KEYIN key, *VALUEIN value*, Context context)
> > I will search on my side, but some help will be welcome because it > seems there is not much documentation when we start to dig a bit :( > > JM > > 2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > > Hi, > > > > I'm thinking about my firs MapReduce class and I have some questions. > > > > The goal of it will be to move some rows from one table to another one > > based on the timestamp only. > > > > Since this is pretty new for me, I'm starting from the RowCounter > > class to have a baseline. > > > > There are few things I will have to update. First, the > > createSumittableJob method to get timestamp range instead of key > > range, and "play2 with the parameters. This part is fine. > > > > Next, I need to update the map method, and this is where I have some > > questions. > > > > I'm able to find the timestamp of all the cf:c from the > > context.getCurrentValue() method, that's fine. Now, my concern is on > > the way to get access to the table to store this field, and the table > > to delete it. Should I instantiate an HTable for the source table, and > > execute and delete on it, then do an insert on another HTable > > instance? Should I use an HTablePool? Also, since I’m already on the > > row, can’t I just mark it as deleted instead of calling a new HTable? > > > > Also, instead of calling the delete and put one by one, I would like > > to put them on a list and execute it only when it’s over 10 members. > > How can I make sure that at the end of the job, this is flushed? Else, > > I will lose some operations. Is there a kind of “dispose” method > > called on the region when the job is done? > > > > Thanks, > > > > JM > > >
+
Shrijeet Paliwal 2012-10-29, 17:03
-
Re: Question about MapReduce
Jean-Marc Spaggiari 2012-11-02, 18:56
Hi Shrijeet,
Helped a lot! Thanks!
Now, the only think I need is to know where's the best place to put my JAR on the server. Should I put it on the hadoop lib directory? Or somewhere on the HBase structure?
Thanks,
JM
2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: > In line. > > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > >> I'm replying to myself ;) >> >> I found "cleanup" and "setup" methods from the TableMapper table. So I >> think those are the methods I was looking for. I will init the >> HTablePool there. Please let me know if I'm wrong. >> >> Now, I still have few other questions. >> >> 1) context.getCurrentValue() can throw a InterrruptedException, but >> when can this occur? Is there a timeout on the Mapper side? Of it's if >> the region is going down while the job is running? >> > > You do not need to call context.getCurrentValue(). The 'value' argument to > map method[1] has the information you are looking for. > > >> 2) How can I pass parameters to the Map method? Can I use >> job.getConfiguration().put to add some properties there, can get them >> back in context.getConfiguration.get? >> > > Yes, thats how it is done. > > >> 3) What's the best way to log results/exceptions/traces from the map >> method? >> > > In most cases, you'll have mapper and reducer classes as nested static > classes within some enclosing class. You can get handle to the Logger from > the enclosing class and do your usual LOG.info, LOG.warn yada yada. > > Hope it helps. > > [1] map(KEYIN key, *VALUEIN value*, Context context) > >> >> I will search on my side, but some help will be welcome because it >> seems there is not much documentation when we start to dig a bit :( >> >> JM >> >> 2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >> > Hi, >> > >> > I'm thinking about my firs MapReduce class and I have some questions. >> > >> > The goal of it will be to move some rows from one table to another one >> > based on the timestamp only. >> > >> > Since this is pretty new for me, I'm starting from the RowCounter >> > class to have a baseline. >> > >> > There are few things I will have to update. First, the >> > createSumittableJob method to get timestamp range instead of key >> > range, and "play2 with the parameters. This part is fine. >> > >> > Next, I need to update the map method, and this is where I have some >> > questions. >> > >> > I'm able to find the timestamp of all the cf:c from the >> > context.getCurrentValue() method, that's fine. Now, my concern is on >> > the way to get access to the table to store this field, and the table >> > to delete it. Should I instantiate an HTable for the source table, and >> > execute and delete on it, then do an insert on another HTable >> > instance? Should I use an HTablePool? Also, since I’m already on the >> > row, can’t I just mark it as deleted instead of calling a new HTable? >> > >> > Also, instead of calling the delete and put one by one, I would like >> > to put them on a list and execute it only when it’s over 10 members. >> > How can I make sure that at the end of the job, this is flushed? Else, >> > I will lose some operations. Is there a kind of “dispose�� method >> > called on the region when the job is done? >> > >> > Thanks, >> > >> > JM >> > >> >
+
Jean-Marc Spaggiari 2012-11-02, 18:56
-
Re: Question about MapReduce
Shrijeet Paliwal 2012-11-02, 19:06
JM,
I personally would chose to put it neither hadoop libs nor hbase libs. Have them go to your application's own install directory.
Then you could sent the variable HADOOP_CLASSPATH to have your jar (also include hbase jars, hbase dependencies and dependencies your program needs) And to execute fire 'hadoop jar' command.
An example[1]:
Set classpath: export HADOOP_CLASSPATH=`hbase classpath`:mycool.jar:mycooldependency.jar
Fire following to launch your job: hadoop jar mycool.jar hbase.experiments.MyCoolProgram -Dmapred.running.map.limit=50 -Dmapred.map.tasks.speculative.execution=false aCommandLineArg Did I get your question right?
[1] In the example I gave `hbase classpath` gets you set with all hbase jars.
On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote:
> Hi Shrijeet, > > Helped a lot! Thanks! > > Now, the only think I need is to know where's the best place to put my > JAR on the server. Should I put it on the hadoop lib directory? Or > somewhere on the HBase structure? > > Thanks, > > JM > > 2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: > > In line. > > > > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < > > [EMAIL PROTECTED]> wrote: > > > >> I'm replying to myself ;) > >> > >> I found "cleanup" and "setup" methods from the TableMapper table. So I > >> think those are the methods I was looking for. I will init the > >> HTablePool there. Please let me know if I'm wrong. > >> > >> Now, I still have few other questions. > >> > >> 1) context.getCurrentValue() can throw a InterrruptedException, but > >> when can this occur? Is there a timeout on the Mapper side? Of it's if > >> the region is going down while the job is running? > >> > > > > You do not need to call context.getCurrentValue(). The 'value' argument > to > > map method[1] has the information you are looking for. > > > > > >> 2) How can I pass parameters to the Map method? Can I use > >> job.getConfiguration().put to add some properties there, can get them > >> back in context.getConfiguration.get? > >> > > > > Yes, thats how it is done. > > > > > >> 3) What's the best way to log results/exceptions/traces from the map > >> method? > >> > > > > In most cases, you'll have mapper and reducer classes as nested static > > classes within some enclosing class. You can get handle to the Logger > from > > the enclosing class and do your usual LOG.info, LOG.warn yada yada. > > > > Hope it helps. > > > > [1] map(KEYIN key, *VALUEIN value*, Context context) > > > >> > >> I will search on my side, but some help will be welcome because it > >> seems there is not much documentation when we start to dig a bit :( > >> > >> JM > >> > >> 2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > >> > Hi, > >> > > >> > I'm thinking about my firs MapReduce class and I have some questions. > >> > > >> > The goal of it will be to move some rows from one table to another one > >> > based on the timestamp only. > >> > > >> > Since this is pretty new for me, I'm starting from the RowCounter > >> > class to have a baseline. > >> > > >> > There are few things I will have to update. First, the > >> > createSumittableJob method to get timestamp range instead of key > >> > range, and "play2 with the parameters. This part is fine. > >> > > >> > Next, I need to update the map method, and this is where I have some > >> > questions. > >> > > >> > I'm able to find the timestamp of all the cf:c from the > >> > context.getCurrentValue() method, that's fine. Now, my concern is on > >> > the way to get access to the table to store this field, and the table > >> > to delete it. Should I instantiate an HTable for the source table, and > >> > execute and delete on it, then do an insert on another HTable > >> > instance? Should I use an HTablePool? Also, since I’m already on the > >> > row, can’t I just mark it as deleted instead of calling a new HTable? > >> > > >> > Also, instead of calling the delete and put one by one, I would like
+
Shrijeet Paliwal 2012-11-02, 19:06
-
Re: Question about MapReduce
Jean-Marc Spaggiari 2012-11-02, 19:31
Yep, you perfectly got my question.
I just tried and it's working perfectly!
Thanks a lot! I now have a lot to play with.
JM
2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: > JM, > > I personally would chose to put it neither hadoop libs nor hbase libs. Have > them go to your application's own install directory. > > Then you could sent the variable HADOOP_CLASSPATH to have your jar (also > include hbase jars, hbase dependencies and dependencies your program needs) > And to execute fire 'hadoop jar' command. > > An example[1]: > > Set classpath: > export HADOOP_CLASSPATH=`hbase classpath`:mycool.jar:mycooldependency.jar > > Fire following to launch your job: > hadoop jar mycool.jar hbase.experiments.MyCoolProgram > -Dmapred.running.map.limit=50 > -Dmapred.map.tasks.speculative.execution=false aCommandLineArg > > > Did I get your question right? > > [1] In the example I gave `hbase classpath` gets you set with all hbase > jars. > > > > On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > >> Hi Shrijeet, >> >> Helped a lot! Thanks! >> >> Now, the only think I need is to know where's the best place to put my >> JAR on the server. Should I put it on the hadoop lib directory? Or >> somewhere on the HBase structure? >> >> Thanks, >> >> JM >> >> 2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: >> > In line. >> > >> > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < >> > [EMAIL PROTECTED]> wrote: >> > >> >> I'm replying to myself ;) >> >> >> >> I found "cleanup" and "setup" methods from the TableMapper table. So I >> >> think those are the methods I was looking for. I will init the >> >> HTablePool there. Please let me know if I'm wrong. >> >> >> >> Now, I still have few other questions. >> >> >> >> 1) context.getCurrentValue() can throw a InterrruptedException, but >> >> when can this occur? Is there a timeout on the Mapper side? Of it's if >> >> the region is going down while the job is running? >> >> >> > >> > You do not need to call context.getCurrentValue(). The 'value' >> > argument >> to >> > map method[1] has the information you are looking for. >> > >> > >> >> 2) How can I pass parameters to the Map method? Can I use >> >> job.getConfiguration().put to add some properties there, can get them >> >> back in context.getConfiguration.get? >> >> >> > >> > Yes, thats how it is done. >> > >> > >> >> 3) What's the best way to log results/exceptions/traces from the map >> >> method? >> >> >> > >> > In most cases, you'll have mapper and reducer classes as nested static >> > classes within some enclosing class. You can get handle to the Logger >> from >> > the enclosing class and do your usual LOG.info, LOG.warn yada yada. >> > >> > Hope it helps. >> > >> > [1] map(KEYIN key, *VALUEIN value*, Context context) >> > >> >> >> >> I will search on my side, but some help will be welcome because it >> >> seems there is not much documentation when we start to dig a bit :( >> >> >> >> JM >> >> >> >> 2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >> >> > Hi, >> >> > >> >> > I'm thinking about my firs MapReduce class and I have some >> >> > questions. >> >> > >> >> > The goal of it will be to move some rows from one table to another >> >> > one >> >> > based on the timestamp only. >> >> > >> >> > Since this is pretty new for me, I'm starting from the RowCounter >> >> > class to have a baseline. >> >> > >> >> > There are few things I will have to update. First, the >> >> > createSumittableJob method to get timestamp range instead of key >> >> > range, and "play2 with the parameters. This part is fine. >> >> > >> >> > Next, I need to update the map method, and this is where I have some >> >> > questions. >> >> > >> >> > I'm able to find the timestamp of all the cf:c from the >> >> > context.getCurrentValue() method, that's fine. Now, my concern is on >> >> > the way to get access to the table to store this field, and the >> >> > table >> >> > to delete it. Should I instantiate an HTable for the source table,
+
Jean-Marc Spaggiari 2012-11-02, 19:31
-
Re: Question about MapReduce
Jean-Marc Spaggiari 2012-11-02, 19:47
Sorry, one last question.
On the map method, I have access to the row using the values parameter. Now, based on the value content, I might want to delete it. Do I have access to the table directly from one of the parameters? Or should I call the delete using an HTableInterface from my pool?
Thanks,
JM
2012/11/2, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > Yep, you perfectly got my question. > > I just tried and it's working perfectly! > > Thanks a lot! I now have a lot to play with. > > JM > > 2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: >> JM, >> >> I personally would chose to put it neither hadoop libs nor hbase libs. >> Have >> them go to your application's own install directory. >> >> Then you could sent the variable HADOOP_CLASSPATH to have your jar (also >> include hbase jars, hbase dependencies and dependencies your program >> needs) >> And to execute fire 'hadoop jar' command. >> >> An example[1]: >> >> Set classpath: >> export HADOOP_CLASSPATH=`hbase classpath`:mycool.jar:mycooldependency.jar >> >> Fire following to launch your job: >> hadoop jar mycool.jar hbase.experiments.MyCoolProgram >> -Dmapred.running.map.limit=50 >> -Dmapred.map.tasks.speculative.execution=false aCommandLineArg >> >> >> Did I get your question right? >> >> [1] In the example I gave `hbase classpath` gets you set with all hbase >> jars. >> >> >> >> On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Shrijeet, >>> >>> Helped a lot! Thanks! >>> >>> Now, the only think I need is to know where's the best place to put my >>> JAR on the server. Should I put it on the hadoop lib directory? Or >>> somewhere on the HBase structure? >>> >>> Thanks, >>> >>> JM >>> >>> 2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: >>> > In line. >>> > >>> > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < >>> > [EMAIL PROTECTED]> wrote: >>> > >>> >> I'm replying to myself ;) >>> >> >>> >> I found "cleanup" and "setup" methods from the TableMapper table. So >>> >> I >>> >> think those are the methods I was looking for. I will init the >>> >> HTablePool there. Please let me know if I'm wrong. >>> >> >>> >> Now, I still have few other questions. >>> >> >>> >> 1) context.getCurrentValue() can throw a InterrruptedException, but >>> >> when can this occur? Is there a timeout on the Mapper side? Of it's >>> >> if >>> >> the region is going down while the job is running? >>> >> >>> > >>> > You do not need to call context.getCurrentValue(). The 'value' >>> > argument >>> to >>> > map method[1] has the information you are looking for. >>> > >>> > >>> >> 2) How can I pass parameters to the Map method? Can I use >>> >> job.getConfiguration().put to add some properties there, can get them >>> >> back in context.getConfiguration.get? >>> >> >>> > >>> > Yes, thats how it is done. >>> > >>> > >>> >> 3) What's the best way to log results/exceptions/traces from the map >>> >> method? >>> >> >>> > >>> > In most cases, you'll have mapper and reducer classes as nested static >>> > classes within some enclosing class. You can get handle to the Logger >>> from >>> > the enclosing class and do your usual LOG.info, LOG.warn yada yada. >>> > >>> > Hope it helps. >>> > >>> > [1] map(KEYIN key, *VALUEIN value*, Context context) >>> > >>> >> >>> >> I will search on my side, but some help will be welcome because it >>> >> seems there is not much documentation when we start to dig a bit :( >>> >> >>> >> JM >>> >> >>> >> 2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >>> >> > Hi, >>> >> > >>> >> > I'm thinking about my firs MapReduce class and I have some >>> >> > questions. >>> >> > >>> >> > The goal of it will be to move some rows from one table to another >>> >> > one >>> >> > based on the timestamp only. >>> >> > >>> >> > Since this is pretty new for me, I'm starting from the RowCounter >>> >> > class to have a baseline. >>> >> > >>> >> > There are few things I will have to update. First, the >>> >>
+
Jean-Marc Spaggiari 2012-11-02, 19:47
-
Re: Question about MapReduce
Shrijeet Paliwal 2012-11-02, 19:51
Not sure what exactly is happening in your job. But in one of the delete jobs I wrote I was creating an instance of HTable in setup method of my mapper
delTab = new HTable(conf, conf.get(TABLE_NAME));
And performing delete in map() call using delTab. So no, you do not have access to table directly *usually*. -Shrijeet On Fri, Nov 2, 2012 at 12:47 PM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote:
> Sorry, one last question. > > On the map method, I have access to the row using the values > parameter. Now, based on the value content, I might want to delete it. > Do I have access to the table directly from one of the parameters? Or > should I call the delete using an HTableInterface from my pool? > > Thanks, > > JM > > 2012/11/2, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > > Yep, you perfectly got my question. > > > > I just tried and it's working perfectly! > > > > Thanks a lot! I now have a lot to play with. > > > > JM > > > > 2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: > >> JM, > >> > >> I personally would chose to put it neither hadoop libs nor hbase libs. > >> Have > >> them go to your application's own install directory. > >> > >> Then you could sent the variable HADOOP_CLASSPATH to have your jar (also > >> include hbase jars, hbase dependencies and dependencies your program > >> needs) > >> And to execute fire 'hadoop jar' command. > >> > >> An example[1]: > >> > >> Set classpath: > >> export HADOOP_CLASSPATH=`hbase > classpath`:mycool.jar:mycooldependency.jar > >> > >> Fire following to launch your job: > >> hadoop jar mycool.jar hbase.experiments.MyCoolProgram > >> -Dmapred.running.map.limit=50 > >> -Dmapred.map.tasks.speculative.execution=false aCommandLineArg > >> > >> > >> Did I get your question right? > >> > >> [1] In the example I gave `hbase classpath` gets you set with all hbase > >> jars. > >> > >> > >> > >> On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Hi Shrijeet, > >>> > >>> Helped a lot! Thanks! > >>> > >>> Now, the only think I need is to know where's the best place to put my > >>> JAR on the server. Should I put it on the hadoop lib directory? Or > >>> somewhere on the HBase structure? > >>> > >>> Thanks, > >>> > >>> JM > >>> > >>> 2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: > >>> > In line. > >>> > > >>> > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < > >>> > [EMAIL PROTECTED]> wrote: > >>> > > >>> >> I'm replying to myself ;) > >>> >> > >>> >> I found "cleanup" and "setup" methods from the TableMapper table. So > >>> >> I > >>> >> think those are the methods I was looking for. I will init the > >>> >> HTablePool there. Please let me know if I'm wrong. > >>> >> > >>> >> Now, I still have few other questions. > >>> >> > >>> >> 1) context.getCurrentValue() can throw a InterrruptedException, but > >>> >> when can this occur? Is there a timeout on the Mapper side? Of it's > >>> >> if > >>> >> the region is going down while the job is running? > >>> >> > >>> > > >>> > You do not need to call context.getCurrentValue(). The 'value' > >>> > argument > >>> to > >>> > map method[1] has the information you are looking for. > >>> > > >>> > > >>> >> 2) How can I pass parameters to the Map method? Can I use > >>> >> job.getConfiguration().put to add some properties there, can get > them > >>> >> back in context.getConfiguration.get? > >>> >> > >>> > > >>> > Yes, thats how it is done. > >>> > > >>> > > >>> >> 3) What's the best way to log results/exceptions/traces from the map > >>> >> method? > >>> >> > >>> > > >>> > In most cases, you'll have mapper and reducer classes as nested > static > >>> > classes within some enclosing class. You can get handle to the Logger > >>> from > >>> > the enclosing class and do your usual LOG.info, LOG.warn yada yada. > >>> > > >>> > Hope it helps. > >>> > > >>> > [1] map(KEYIN key, *VALUEIN value*, Context context) > >>> > > >>> >> > >>> >> I will search on my side, but some help will be welcome because it
+
Shrijeet Paliwal 2012-11-02, 19:51
-
Re: Question about MapReduce
Jean-Marc Spaggiari 2012-11-02, 20:01
That was my initial plan too, but I was wondering if there was any other best practice about the delete. So I will go that way.
Thanks,
JM
2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: > Not sure what exactly is happening in your job. But in one of the delete > jobs I wrote I was creating an instance of HTable in setup method of my > mapper > > delTab = new HTable(conf, conf.get(TABLE_NAME)); > > And performing delete in map() call using delTab. So no, you do not have > access to table directly *usually*. > > > -Shrijeet > > > On Fri, Nov 2, 2012 at 12:47 PM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > >> Sorry, one last question. >> >> On the map method, I have access to the row using the values >> parameter. Now, based on the value content, I might want to delete it. >> Do I have access to the table directly from one of the parameters? Or >> should I call the delete using an HTableInterface from my pool? >> >> Thanks, >> >> JM >> >> 2012/11/2, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >> > Yep, you perfectly got my question. >> > >> > I just tried and it's working perfectly! >> > >> > Thanks a lot! I now have a lot to play with. >> > >> > JM >> > >> > 2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: >> >> JM, >> >> >> >> I personally would chose to put it neither hadoop libs nor hbase libs. >> >> Have >> >> them go to your application's own install directory. >> >> >> >> Then you could sent the variable HADOOP_CLASSPATH to have your jar >> >> (also >> >> include hbase jars, hbase dependencies and dependencies your program >> >> needs) >> >> And to execute fire 'hadoop jar' command. >> >> >> >> An example[1]: >> >> >> >> Set classpath: >> >> export HADOOP_CLASSPATH=`hbase >> classpath`:mycool.jar:mycooldependency.jar >> >> >> >> Fire following to launch your job: >> >> hadoop jar mycool.jar hbase.experiments.MyCoolProgram >> >> -Dmapred.running.map.limit=50 >> >> -Dmapred.map.tasks.speculative.execution=false aCommandLineArg >> >> >> >> >> >> Did I get your question right? >> >> >> >> [1] In the example I gave `hbase classpath` gets you set with all >> >> hbase >> >> jars. >> >> >> >> >> >> >> >> On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari < >> >> [EMAIL PROTECTED]> wrote: >> >> >> >>> Hi Shrijeet, >> >>> >> >>> Helped a lot! Thanks! >> >>> >> >>> Now, the only think I need is to know where's the best place to put >> >>> my >> >>> JAR on the server. Should I put it on the hadoop lib directory? Or >> >>> somewhere on the HBase structure? >> >>> >> >>> Thanks, >> >>> >> >>> JM >> >>> >> >>> 2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: >> >>> > In line. >> >>> > >> >>> > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < >> >>> > [EMAIL PROTECTED]> wrote: >> >>> > >> >>> >> I'm replying to myself ;) >> >>> >> >> >>> >> I found "cleanup" and "setup" methods from the TableMapper table. >> >>> >> So >> >>> >> I >> >>> >> think those are the methods I was looking for. I will init the >> >>> >> HTablePool there. Please let me know if I'm wrong. >> >>> >> >> >>> >> Now, I still have few other questions. >> >>> >> >> >>> >> 1) context.getCurrentValue() can throw a InterrruptedException, >> >>> >> but >> >>> >> when can this occur? Is there a timeout on the Mapper side? Of >> >>> >> it's >> >>> >> if >> >>> >> the region is going down while the job is running? >> >>> >> >> >>> > >> >>> > You do not need to call context.getCurrentValue(). The 'value' >> >>> > argument >> >>> to >> >>> > map method[1] has the information you are looking for. >> >>> > >> >>> > >> >>> >> 2) How can I pass parameters to the Map method? Can I use >> >>> >> job.getConfiguration().put to add some properties there, can get >> them >> >>> >> back in context.getConfiguration.get? >> >>> >> >> >>> > >> >>> > Yes, thats how it is done. >> >>> > >> >>> > >> >>> >> 3) What's the best way to log results/exceptions/traces from the >> >>> >> map >> >>> >> method? >> >>> >> >> >>> > >> >>> > In most cases, you'll have mapper and reducer classes as nested
+
Jean-Marc Spaggiari 2012-11-02, 20:01
-
Re: Question about MapReduce
Jean-Marc Spaggiari 2012-11-14, 02:41
One more question about MapReduce.
One of my servers is slower than the others. I don't have any time constraint for the job to finish.
But I'm getting this message:
"Task attempt_201211122318_0014_m_000021_0 failed to report status for 601 seconds. Killing!"
Where can I chance this timeout to something like 1800 seconds? Is it on the mapred-site.xml file? If so, which property should I insert?
Thanks,
JM
2012/11/2, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: > That was my initial plan too, but I was wondering if there was any > other best practice about the delete. So I will go that way. > > Thanks, > > JM > > 2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: >> Not sure what exactly is happening in your job. But in one of the delete >> jobs I wrote I was creating an instance of HTable in setup method of my >> mapper >> >> delTab = new HTable(conf, conf.get(TABLE_NAME)); >> >> And performing delete in map() call using delTab. So no, you do not have >> access to table directly *usually*. >> >> >> -Shrijeet >> >> >> On Fri, Nov 2, 2012 at 12:47 PM, Jean-Marc Spaggiari < >> [EMAIL PROTECTED]> wrote: >> >>> Sorry, one last question. >>> >>> On the map method, I have access to the row using the values >>> parameter. Now, based on the value content, I might want to delete it. >>> Do I have access to the table directly from one of the parameters? Or >>> should I call the delete using an HTableInterface from my pool? >>> >>> Thanks, >>> >>> JM >>> >>> 2012/11/2, Jean-Marc Spaggiari <[EMAIL PROTECTED]>: >>> > Yep, you perfectly got my question. >>> > >>> > I just tried and it's working perfectly! >>> > >>> > Thanks a lot! I now have a lot to play with. >>> > >>> > JM >>> > >>> > 2012/11/2, Shrijeet Paliwal <[EMAIL PROTECTED]>: >>> >> JM, >>> >> >>> >> I personally would chose to put it neither hadoop libs nor hbase >>> >> libs. >>> >> Have >>> >> them go to your application's own install directory. >>> >> >>> >> Then you could sent the variable HADOOP_CLASSPATH to have your jar >>> >> (also >>> >> include hbase jars, hbase dependencies and dependencies your program >>> >> needs) >>> >> And to execute fire 'hadoop jar' command. >>> >> >>> >> An example[1]: >>> >> >>> >> Set classpath: >>> >> export HADOOP_CLASSPATH=`hbase >>> classpath`:mycool.jar:mycooldependency.jar >>> >> >>> >> Fire following to launch your job: >>> >> hadoop jar mycool.jar hbase.experiments.MyCoolProgram >>> >> -Dmapred.running.map.limit=50 >>> >> -Dmapred.map.tasks.speculative.execution=false aCommandLineArg >>> >> >>> >> >>> >> Did I get your question right? >>> >> >>> >> [1] In the example I gave `hbase classpath` gets you set with all >>> >> hbase >>> >> jars. >>> >> >>> >> >>> >> >>> >> On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari < >>> >> [EMAIL PROTECTED]> wrote: >>> >> >>> >>> Hi Shrijeet, >>> >>> >>> >>> Helped a lot! Thanks! >>> >>> >>> >>> Now, the only think I need is to know where's the best place to put >>> >>> my >>> >>> JAR on the server. Should I put it on the hadoop lib directory? Or >>> >>> somewhere on the HBase structure? >>> >>> >>> >>> Thanks, >>> >>> >>> >>> JM >>> >>> >>> >>> 2012/10/29, Shrijeet Paliwal <[EMAIL PROTECTED]>: >>> >>> > In line. >>> >>> > >>> >>> > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari < >>> >>> > [EMAIL PROTECTED]> wrote: >>> >>> > >>> >>> >> I'm replying to myself ;) >>> >>> >> >>> >>> >> I found "cleanup" and "setup" methods from the TableMapper table. >>> >>> >> So >>> >>> >> I >>> >>> >> think those are the methods I was looking for. I will init the >>> >>> >> HTablePool there. Please let me know if I'm wrong. >>> >>> >> >>> >>> >> Now, I still have few other questions. >>> >>> >> >>> >>> >> 1) context.getCurrentValue() can throw a InterrruptedException, >>> >>> >> but >>> >>> >> when can this occur? Is there a timeout on the Mapper side? Of >>> >>> >> it's >>> >>> >> if >>> >>> >> the region is going down while the job is running? >>> >>> >> >>> >>> >
+
Jean-Marc Spaggiari 2012-11-14, 02:41
|
|