Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Locks in M/R framework


Copy link to this message
-
Re: Locks in M/R framework
Hi David,

You are probably aware, but you can specify the location of the data in
Hive, so if you can keep it simple and manage directories you could rewrite
the Hive metastore at the same time (e.g. redefine the tables for hive or
just go the underlying Hive DB and change the SDS.location entry, but
beware of race conditions).

If your scenarios go beyond simple you might run into issues (deadlocks,
race conditions, herding etc).  If that happens, I'd still recommend
sectioning off that problem into the likes of ZK.  It's not particularly
difficult to use and other than another service running, is probably as
easy to code against as a directory managing solution would be; you might
try https://github.com/Netflix/curator, which comes recommended from a
colleague of mine although I have no experience with it.

Don't get me wrong, I am all for simple and less moving parts if it works -
just wanted to suggest something other systems are commonly using to
overcome this.  Your mv(...) example is classic ZK stuff.

Cheers,
Tim
On Mon, Aug 13, 2012 at 2:22 PM, David Ginzburg <[EMAIL PROTECTED]> wrote:

> Hi,
>
> My problem is that some of the jobs that reads the folder are not under my
> control, i.e: a client submits a hive job.
>
> I was thinking of something like an mv(source,target ,long timeout) which
> will block until the folder is not in used or time out is reached .
>
> Is it possible that this problem is not a common one ?
>
> > From: [EMAIL PROTECTED]
> > Date: Mon, 13 Aug 2012 17:33:02 +0530
> > Subject: Re: Locks in M/R framework
> > To: [EMAIL PROTECTED]
>
> >
> > David,
> >
> > While ZK can solve this, locking may only make you slower. Lets try to
> > keep it simple?
> >
> > Have you considered keeping two directories? One where the older data
> > is moved to (by the first job, instead of replacing files), for
> > consumption by the second job, which triggers by watching this
> > directory?
> >
> > That is,
> > MR Job #1 (the producer), moves existing data to /path/b/timestamp,
> > and writes new data to /path/a.
> > MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
> > of available set of timestamps under /path/b at that point) for its
> > input, and deletes it afterwards. Hence the #2 can monitor this
> > directory for triggering itself.
> >
> > On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg <[EMAIL PROTECTED]>
> wrote:
> > > Hi,
> > >
> > > I have an HDFS folder and M/R job that periodically updates it by
> replacing
> > > the data with newly generated data.
> > >
> > > I have a different M/R job that periodically or ad-hoc process the
> data in
> > > the folder.
> > >
> > > The second job ,naturally, fails sometime, when the data is replaced by
> > > newly generated data and the job plan including the input paths have
> already
> > > been submitted.
> > >
> > > Is there an elegant solution ?
> > >
> > > My current though is to query the jobtracker for running jobs and go
> over
> > > all the input files, in the job XML to know if The swap should block
> until
> > > the input path is no longer in any current executed input path job.
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > Harsh J
>