Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Remove the row in MR job?


Copy link to this message
-
Re: Remove the row in MR job?

Just throwing an idea out there, but if you rotate tables you could
probably do what you want..

1) Table1 is being written throughout the day
2) It's time to kick off the MR job, but before the job is submitted
Table2 is now configured to be the 'write' table
3) MR job processes all the data in Table1.  Table1 is dropped/truncated
when finished.
4) Table2 continues to get writes
5) Now it's time to run the MR job again, Table1 is now configured to be
the 'write' table and Table2 is processed by the MR job.
6) Continue rotating between the tables

Something like this is probably going to be a lot easier to manage than
doing deletes of what you've read.
On 10/12/12 3:47 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]> wrote:

>Hi Doug,
>
>Thanks for the suggestion. I like the idea of simply deleting the
>table, however, I'm not sure if I can implement it.
>
>Basically, I have one process which is constantly feeding the table,
>and, once a day, I want to run a MR job to proccess this table (Which
>will emtpy it).
>
>While I'm processing it, I still want to other process to have the
>ability to store data.
>
>Since I can't rename the table because this functionnaly doesn't
>exist, I need to have the 2 working on the same table.
>
>Maybe what I can do is working on the colum name.... Like I store on a
>different column every day based on the day number and I just run MR
>on all the columns except today. After that, I can delete all the
>columns except the one for the current day. Issue is if the MR is
>taking more than 24h...
>
>Also, is that fast to delete a column?
>
>JM
>
>2012/10/12 Doug Meil <[EMAIL PROTECTED]>:
>>
>> I'm not entirely sure of the use-case, but here are some thoughts on
>>thisŠ
>>
>> re:  "should I take the table from the pool, and simply call the delete
>> method?"
>>
>> Yep, you can construct an HTable instance within a MR job.  But use the
>> delete that takes a list because the single-delete will invoke an RPC
>>for
>> each one (not great over an MR job).
>>
>> Construct the HTable instance at the Mapper level (not map-method level)
>> and keep a buffer of deletes in a List.  At the end of the job, send any
>> un-processed deletes in the cleanup method.
>>
>>
>> I'm not entirely sure why you'd want to delete every row in a table (as
>> opposed to processing all the rows in Table1 and generating an entirely
>> new Table2).  And then drop Table1 when you're done with it.  That seems
>> like it would be less hassle than deleting every row (since the table is
>> empty anyway).
>>
>>
>>
>>
>>
>>
>> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]>
>>wrote:
>>
>>>Hi,
>>>
>>>I have a table which I want to parse over a MR job.
>>>
>>>Today, I'm using a scan to parse all the rows. Each row is retrieve,
>>>removed, and the parsed (feeding 2 other tables)
>>>
>>>The goal is to parse all the content while some process might still be
>>>adding some more.
>>>
>>>On the map method from the MR job, can I delete the row I'm working
>>>with? If so, how should I do? should I take the table from the pool,
>>>and simply call the delete method? The issue is, doing a delete for
>>>each line will take a while. I would prefer to batch them, but I don't
>>>know when will be the last line, so it's difficult to know when to
>>>send the batch.  Is there a way to say to the MR job to delete this
>>>line? Also, what's the impact on the MR job if I delete the row it's
>>>working one?
>>>
>>>Or is the MR job not the best way to do that?
>>>
>>>Thanks,
>>>
>>>JM
>>>
>>
>>
>