Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> running MR job and puts on the same table


Copy link to this message
-
Re: running MR job and puts on the same table
Hi Rohit,

It will alway be consistent. I don't see why there will be any
un-consistency with the scenario your described below.

JM

2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
> Thanks JM, I am not so concerned about holding those rows in memory because
> they are mostly ordered integers and I would be using a bitset. So I have
> some leeway in that sense. My dilemma was
> 1. updating instantly within the map
> 2. bulk updating at the end of the map
> Yes I do understand the drawback with 2 if map crashes. I am ready to incur
> that penalty if that avoids any inconsistent behaviour on hbase.
>
> - R
>
>
> On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Rahit,
>>
>> The list is a bad idea. When you will have millions of lines per
>> regions, are going to pu millions of them in memory in your list?
>>
>> Your MR will scan the entire table, row by row. If you modify the
>> current row, when the scanner will search for the next one, it will
>> not look at current one. So there is no real issue with that.
>>
>> Also, instead of doing puts one by one I will recommand you to buffer
>> them (let's say, 100 by 100) and put them as a batch. Don't forget to
>> push the remaining at the end of the job. The drawback is that if the
>> MR crash you will have some rows already processed and not marked as
>> processed...
>>
>> JM
>>
>> 2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
>> > I have a usecase where I push data in my HTable in waves followed by
>> > Mapper-only processing. Currently once a row is processed in map I
>> > immediately mark it as processed=true. For this inside the map I execute
>> a
>> > table.put(isprocessed=true). I am not sure if modifying the table like
>> this
>> > is a good idea. I am also concerned that I am modifying the same table
>> that
>> > I am running the MR job on.
>> > So I am thinking of another approach where I accumulate the processed
>> rows
>> > in a list (or a better compact data structure) and use the cleanup method
>> > of the MR job to execute all the table.put(isprocessed=true) at once.
>> > What is the suggested best practice?
>> >
>> > - R
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB