Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Remove the row in MR job?

Copy link to this message
Re: Remove the row in MR job?
Jean-Marc Spaggiari 2012-10-13, 23:22
I think I get the idea. I can't do it like that because Job1 might try
to access the table as the same time as Job2 is trying to rename it,
or other same kind off issues, but I will work on something similar.


I will start another thread for another delete question I have ;)


2012/10/12 Doug Meil <[EMAIL PROTECTED]>:
> Just throwing an idea out there, but if you rotate tables you could
> probably do what you want..
> 1)      Table1 is being written throughout the day
> 2)      It's time to kick off the MR job, but before the job is submitted
> Table2 is now configured to be the 'write' table
> 3)      MR job processes all the data in Table1.  Table1 is dropped/truncated
> when finished.
> 4)      Table2 continues to get writes
> 5)      Now it's time to run the MR job again, Table1 is now configured to be
> the 'write' table and Table2 is processed by the MR job.
> 6)      Continue rotating between the tables
> Something like this is probably going to be a lot easier to manage than
> doing deletes of what you've read.
> On 10/12/12 3:47 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]> wrote:
>>Hi Doug,
>>Thanks for the suggestion. I like the idea of simply deleting the
>>table, however, I'm not sure if I can implement it.
>>Basically, I have one process which is constantly feeding the table,
>>and, once a day, I want to run a MR job to proccess this table (Which
>>will emtpy it).
>>While I'm processing it, I still want to other process to have the
>>ability to store data.
>>Since I can't rename the table because this functionnaly doesn't
>>exist, I need to have the 2 working on the same table.
>>Maybe what I can do is working on the colum name.... Like I store on a
>>different column every day based on the day number and I just run MR
>>on all the columns except today. After that, I can delete all the
>>columns except the one for the current day. Issue is if the MR is
>>taking more than 24h...
>>Also, is that fast to delete a column?
>>2012/10/12 Doug Meil <[EMAIL PROTECTED]>:
>>> I'm not entirely sure of the use-case, but here are some thoughts on
>>> re:  "should I take the table from the pool, and simply call the delete
>>> method?"
>>> Yep, you can construct an HTable instance within a MR job.  But use the
>>> delete that takes a list because the single-delete will invoke an RPC
>>> each one (not great over an MR job).
>>> Construct the HTable instance at the Mapper level (not map-method level)
>>> and keep a buffer of deletes in a List.  At the end of the job, send any
>>> un-processed deletes in the cleanup method.
>>> I'm not entirely sure why you'd want to delete every row in a table (as
>>> opposed to processing all the rows in Table1 and generating an entirely
>>> new Table2).  And then drop Table1 when you're done with it.  That seems
>>> like it would be less hassle than deleting every row (since the table is
>>> empty anyway).
>>> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]>
>>>>I have a table which I want to parse over a MR job.
>>>>Today, I'm using a scan to parse all the rows. Each row is retrieve,
>>>>removed, and the parsed (feeding 2 other tables)
>>>>The goal is to parse all the content while some process might still be
>>>>adding some more.
>>>>On the map method from the MR job, can I delete the row I'm working
>>>>with? If so, how should I do? should I take the table from the pool,
>>>>and simply call the delete method? The issue is, doing a delete for
>>>>each line will take a while. I would prefer to batch them, but I don't
>>>>know when will be the last line, so it's difficult to know when to
>>>>send the batch.  Is there a way to say to the MR job to delete this
>>>>line? Also, what's the impact on the MR job if I delete the row it's
>>>>working one?
>>>>Or is the MR job not the best way to do that?