Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> running MR job and puts on the same table


+
Rohit Kelkar 2013-06-22, 16:42
+
Jean-Marc Spaggiari 2013-06-22, 17:16
Copy link to this message
-
Re: running MR job and puts on the same table
Thanks JM, I am not so concerned about holding those rows in memory because
they are mostly ordered integers and I would be using a bitset. So I have
some leeway in that sense. My dilemma was
1. updating instantly within the map
2. bulk updating at the end of the map
Yes I do understand the drawback with 2 if map crashes. I am ready to incur
that penalty if that avoids any inconsistent behaviour on hbase.

- R
On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> Hi Rahit,
>
> The list is a bad idea. When you will have millions of lines per
> regions, are going to pu millions of them in memory in your list?
>
> Your MR will scan the entire table, row by row. If you modify the
> current row, when the scanner will search for the next one, it will
> not look at current one. So there is no real issue with that.
>
> Also, instead of doing puts one by one I will recommand you to buffer
> them (let's say, 100 by 100) and put them as a batch. Don't forget to
> push the remaining at the end of the job. The drawback is that if the
> MR crash you will have some rows already processed and not marked as
> processed...
>
> JM
>
> 2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
> > I have a usecase where I push data in my HTable in waves followed by
> > Mapper-only processing. Currently once a row is processed in map I
> > immediately mark it as processed=true. For this inside the map I execute
> a
> > table.put(isprocessed=true). I am not sure if modifying the table like
> this
> > is a good idea. I am also concerned that I am modifying the same table
> that
> > I am running the MR job on.
> > So I am thinking of another approach where I accumulate the processed
> rows
> > in a list (or a better compact data structure) and use the cleanup method
> > of the MR job to execute all the table.put(isprocessed=true) at once.
> > What is the suggested best practice?
> >
> > - R
>
+
Jean-Marc Spaggiari 2013-06-22, 19:40
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB