Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Remove the row in MR job?


Copy link to this message
-
Re: Remove the row in MR job?
Hi Doug,

Thanks for the suggestion. I like the idea of simply deleting the
table, however, I'm not sure if I can implement it.

Basically, I have one process which is constantly feeding the table,
and, once a day, I want to run a MR job to proccess this table (Which
will emtpy it).

While I'm processing it, I still want to other process to have the
ability to store data.

Since I can't rename the table because this functionnaly doesn't
exist, I need to have the 2 working on the same table.

Maybe what I can do is working on the colum name.... Like I store on a
different column every day based on the day number and I just run MR
on all the columns except today. After that, I can delete all the
columns except the one for the current day. Issue is if the MR is
taking more than 24h...

Also, is that fast to delete a column?

JM

2012/10/12 Doug Meil <[EMAIL PROTECTED]>:
>
> I'm not entirely sure of the use-case, but here are some thoughts on thisŠ
>
> re:  "should I take the table from the pool, and simply call the delete
> method?"
>
> Yep, you can construct an HTable instance within a MR job.  But use the
> delete that takes a list because the single-delete will invoke an RPC for
> each one (not great over an MR job).
>
> Construct the HTable instance at the Mapper level (not map-method level)
> and keep a buffer of deletes in a List.  At the end of the job, send any
> un-processed deletes in the cleanup method.
>
>
> I'm not entirely sure why you'd want to delete every row in a table (as
> opposed to processing all the rows in Table1 and generating an entirely
> new Table2).  And then drop Table1 when you're done with it.  That seems
> like it would be less hassle than deleting every row (since the table is
> empty anyway).
>
>
>
>
>
>
> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[EMAIL PROTECTED]> wrote:
>
>>Hi,
>>
>>I have a table which I want to parse over a MR job.
>>
>>Today, I'm using a scan to parse all the rows. Each row is retrieve,
>>removed, and the parsed (feeding 2 other tables)
>>
>>The goal is to parse all the content while some process might still be
>>adding some more.
>>
>>On the map method from the MR job, can I delete the row I'm working
>>with? If so, how should I do? should I take the table from the pool,
>>and simply call the delete method? The issue is, doing a delete for
>>each line will take a while. I would prefer to batch them, but I don't
>>know when will be the last line, so it's difficult to know when to
>>send the batch.  Is there a way to say to the MR job to delete this
>>line? Also, what's the impact on the MR job if I delete the row it's
>>working one?
>>
>>Or is the MR job not the best way to do that?
>>
>>Thanks,
>>
>>JM
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB