Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Remove the row in MR job?


Copy link to this message
-
Re: Remove the row in MR job?
Jean-Marc Spaggiari 2012-10-12, 19:47
Hi Doug,

Thanks for the suggestion. I like the idea of simply deleting the
table, however, I'm not sure if I can implement it.

Basically, I have one process which is constantly feeding the table,
and, once a day, I want to run a MR job to proccess this table (Which
will emtpy it).

While I'm processing it, I still want to other process to have the
ability to store data.

Since I can't rename the table because this functionnaly doesn't
exist, I need to have the 2 working on the same table.

Maybe what I can do is working on the colum name.... Like I store on a
different column every day based on the day number and I just run MR
on all the columns except today. After that, I can delete all the
columns except the one for the current day. Issue is if the MR is
taking more than 24h...

Also, is that fast to delete a column?

JM

2012/10/12 Doug Meil <[EMAIL PROTECTED]>:
>
> I'm not entirely sure of the use-case, but here are some thoughts on thisŠ
>
> re:  "should I take the table from the pool, and simply call the delete
> method?"
>
> Yep, you can construct an HTable instance within a MR job.  But use the
> delete that takes a list because the single-delete will invoke an RPC for
> each one (not great over an MR job).
>
> Construct the HTable instance at the Mapper level (not map-method level)
> and keep a buffer of deletes in a List.  At the end of the job, send any
> un-processed deletes in the cleanup method.
>
>
> I'm not entirely sure why you'd want to delete every row in a table (as
> opposed to processing all the rows in Table1 and generating an entirely
> new Table2).  And then drop Table1 when you're done with it.  That seems
> like it would be less hassle than deleting every row (since the table is
> empty anyway).
>
>
>
>
>
>
> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]> wrote:
>
>>Hi,
>>
>>I have a table which I want to parse over a MR job.
>>
>>Today, I'm using a scan to parse all the rows. Each row is retrieve,
>>removed, and the parsed (feeding 2 other tables)
>>
>>The goal is to parse all the content while some process might still be
>>adding some more.
>>
>>On the map method from the MR job, can I delete the row I'm working
>>with? If so, how should I do? should I take the table from the pool,
>>and simply call the delete method? The issue is, doing a delete for
>>each line will take a while. I would prefer to batch them, but I don't
>>know when will be the last line, so it's difficult to know when to
>>send the batch.  Is there a way to say to the MR job to delete this
>>line? Also, what's the impact on the MR job if I delete the row it's
>>working one?
>>
>>Or is the MR job not the best way to do that?
>>
>>Thanks,
>>
>>JM
>>
>
>