Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Question about MapReduce


Copy link to this message
-
Re: Question about MapReduce
I'm replying to myself ;)

I found "cleanup" and "setup" methods from the TableMapper table. So I
think those are the methods I was looking for. I will init the
HTablePool there. Please let me know if I'm wrong.

Now, I still have few other questions.

1) context.getCurrentValue() can throw a InterrruptedException, but
when can this occur? Is there a timeout on the Mapper side? Of it's if
the region is going down while the job is running?
2) How can I pass parameters to the Map method? Can I use
job.getConfiguration().put to add some properties there, can get them
back in context.getConfiguration.get?
3) What's the best way to log results/exceptions/traces from the map method?

I will search on my side, but some help will be welcome because it
seems there is not much documentation when we start to dig a bit :(

JM

2012/10/27, Jean-Marc Spaggiari <[EMAIL PROTECTED]>:
> Hi,
>
> I'm thinking about my firs MapReduce class and I have some questions.
>
> The goal of it will be to move some rows from one table to another one
> based on the timestamp only.
>
> Since this is pretty new for me, I'm starting from the RowCounter
> class to have a baseline.
>
> There are few things I will have to update. First, the
> createSumittableJob method to get timestamp range instead of key
> range, and "play2 with the parameters. This part is fine.
>
> Next, I need to update the map method, and this is where I have some
> questions.
>
> I'm able to find the timestamp of all the cf:c from the
> context.getCurrentValue() method, that's fine. Now, my concern is on
> the way to get access to the table to store this field, and the table
> to delete it. Should I instantiate an HTable for the source table, and
> execute and delete on it, then do an insert on another HTable
> instance?  Should I use an HTablePool? Also, since I’m already on the
> row, can’t I just mark it as deleted instead of calling a new HTable?
>
> Also, instead of calling the delete and put one by one, I would like
> to put them on a list and execute it only when it’s over 10 members.
> How can I make sure that at the end of the job, this is flushed? Else,
> I will lose some operations. Is there a kind of “dispose” method
> called on the region when the job is done?
>
> Thanks,
>
> JM
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB