Thanks JM, I am not so concerned about holding those rows in memory because
they are mostly ordered integers and I would be using a bitset. So I have
some leeway in that sense. My dilemma was
1. updating instantly within the map
2. bulk updating at the end of the map
Yes I do understand the drawback with 2 if map crashes. I am ready to incur
that penalty if that avoids any inconsistent behaviour on hbase.
On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:
> Hi Rahit,
> The list is a bad idea. When you will have millions of lines per
> regions, are going to pu millions of them in memory in your list?
> Your MR will scan the entire table, row by row. If you modify the
> current row, when the scanner will search for the next one, it will
> not look at current one. So there is no real issue with that.
> Also, instead of doing puts one by one I will recommand you to buffer
> them (let's say, 100 by 100) and put them as a batch. Don't forget to
> push the remaining at the end of the job. The drawback is that if the
> MR crash you will have some rows already processed and not marked as
> 2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
> > I have a usecase where I push data in my HTable in waves followed by
> > Mapper-only processing. Currently once a row is processed in map I
> > immediately mark it as processed=true. For this inside the map I execute
> > table.put(isprocessed=true). I am not sure if modifying the table like
> > is a good idea. I am also concerned that I am modifying the same table
> > I am running the MR job on.
> > So I am thinking of another approach where I accumulate the processed
> > in a list (or a better compact data structure) and use the cleanup method
> > of the MR job to execute all the table.put(isprocessed=true) at once.
> > What is the suggested best practice?
> > - R