-Re: Remove the row in MR job?
Jean-Marc Spaggiari 2012-10-12, 19:47
Thanks for the suggestion. I like the idea of simply deleting the
table, however, I'm not sure if I can implement it.
Basically, I have one process which is constantly feeding the table,
and, once a day, I want to run a MR job to proccess this table (Which
will emtpy it).
While I'm processing it, I still want to other process to have the
ability to store data.
Since I can't rename the table because this functionnaly doesn't
exist, I need to have the 2 working on the same table.
Maybe what I can do is working on the colum name.... Like I store on a
different column every day based on the day number and I just run MR
on all the columns except today. After that, I can delete all the
columns except the one for the current day. Issue is if the MR is
taking more than 24h...
Also, is that fast to delete a column?
2012/10/12 Doug Meil <[EMAIL PROTECTED]>:
> I'm not entirely sure of the use-case, but here are some thoughts on thisŠ
> re: "should I take the table from the pool, and simply call the delete
> Yep, you can construct an HTable instance within a MR job. But use the
> delete that takes a list because the single-delete will invoke an RPC for
> each one (not great over an MR job).
> Construct the HTable instance at the Mapper level (not map-method level)
> and keep a buffer of deletes in a List. At the end of the job, send any
> un-processed deletes in the cleanup method.
> I'm not entirely sure why you'd want to delete every row in a table (as
> opposed to processing all the rows in Table1 and generating an entirely
> new Table2). And then drop Table1 when you're done with it. That seems
> like it would be less hassle than deleting every row (since the table is
> empty anyway).
> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]> wrote:
>>I have a table which I want to parse over a MR job.
>>Today, I'm using a scan to parse all the rows. Each row is retrieve,
>>removed, and the parsed (feeding 2 other tables)
>>The goal is to parse all the content while some process might still be
>>adding some more.
>>On the map method from the MR job, can I delete the row I'm working
>>with? If so, how should I do? should I take the table from the pool,
>>and simply call the delete method? The issue is, doing a delete for
>>each line will take a while. I would prefer to batch them, but I don't
>>know when will be the last line, so it's difficult to know when to
>>send the batch. Is there a way to say to the MR job to delete this
>>line? Also, what's the impact on the MR job if I delete the row it's
>>Or is the MR job not the best way to do that?