-Re: running MR job and puts on the same table
Jean-Marc Spaggiari 2013-06-22, 19:40
It will alway be consistent. I don't see why there will be any
un-consistency with the scenario your described below.
2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
> Thanks JM, I am not so concerned about holding those rows in memory because
> they are mostly ordered integers and I would be using a bitset. So I have
> some leeway in that sense. My dilemma was
> 1. updating instantly within the map
> 2. bulk updating at the end of the map
> Yes I do understand the drawback with 2 if map crashes. I am ready to incur
> that penalty if that avoids any inconsistent behaviour on hbase.
> - R
> On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>> Hi Rahit,
>> The list is a bad idea. When you will have millions of lines per
>> regions, are going to pu millions of them in memory in your list?
>> Your MR will scan the entire table, row by row. If you modify the
>> current row, when the scanner will search for the next one, it will
>> not look at current one. So there is no real issue with that.
>> Also, instead of doing puts one by one I will recommand you to buffer
>> them (let's say, 100 by 100) and put them as a batch. Don't forget to
>> push the remaining at the end of the job. The drawback is that if the
>> MR crash you will have some rows already processed and not marked as
>> 2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
>> > I have a usecase where I push data in my HTable in waves followed by
>> > Mapper-only processing. Currently once a row is processed in map I
>> > immediately mark it as processed=true. For this inside the map I execute
>> > table.put(isprocessed=true). I am not sure if modifying the table like
>> > is a good idea. I am also concerned that I am modifying the same table
>> > I am running the MR job on.
>> > So I am thinking of another approach where I accumulate the processed
>> > in a list (or a better compact data structure) and use the cleanup method
>> > of the MR job to execute all the table.put(isprocessed=true) at once.
>> > What is the suggested best practice?
>> > - R