Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Hbase + mapreduce -- operational design question


Copy link to this message
-
Re: Hbase + mapreduce -- operational design question
Eugene Kirpichov 2011-09-10, 09:23
I believe HBase has some kind of TTL (timeout-based expiry) for
records and it can clean them up on its own.

On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
<[EMAIL PROTECTED]> wrote:
> Hello,
> I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports.
>
> Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper.
>
> One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems..
> 1) Clients may have inserted events while the job was running.
> 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability.
>
>
> What do people using Hbase (live) + mapreduce typically do. ?
>
> Thanks!
> Chinmay
>
>

--
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/