Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Best practice for automating jobs


Copy link to this message
-
Re: Best practice for automating jobs
If you know make and bash, have a look at Stampede for scheduling work:

https://github.com/ThinkBigAnalytics/stampede

(Full disclosure: I wrote it)
On Thu, Jan 10, 2013 at 4:11 PM, Sean McNamara
<[EMAIL PROTECTED]>wrote:

> > I want to know if there are any accepted patterns or best practices for
> >this?
>
> http://oozie.apache.org/
>
>
>
> With both Stampede and Oozie, you can tell them to watch for certain data
to show up, e.g., a _SUCCESS file marker in a directory getting new data
files, and then start a Hive query, etc. You can also add your partition
creation commands in the workflow, e.g., as soon as the data is present (or
even before; Hive won't care if it doesn't exist yet).
> > New partitions will be added regularly
>
> When you add a partition, that metadata goes into the metastore, so every
hive instance sharing that metastore will see it. Of course, you should
avoid scenarios where multiple processes attempt to create the same
partition, although if they are using exactly the same command, then adding
an IF NOT EXISTS clause will avoid error messages. Still, I wouldn't want
to torture test the metastore...
> What type of partitions are you adding? Why frequently?
>
>
>
>
> Sean
>
>
> On 1/10/13 3:03 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote:
>
> >All,
> >
> >I want to automate jobs against Hive (using an external table with
> >ever growing partitions), and I'm running into a few challenges:
> >
> >Concurrency - If I run Hive as a thrift server, I can only safely run
> >one job at a time. As such, it seems like my best bet will be to run
> >it from the command line and setup a brand new instance for each job.
> >That quite a bit of a hassle to solves a seemingly common problem, so
> >I want to know if there are any accepted patterns or best practices
> >for this?
> >
> >Partition management - New partitions will be added regularly. If I
> >have to setup multiple instances of Hive for each (potentially)
> >overlapping job, it will be difficult to keep track of the partitions
> >that have been added. In the context of the preceding question, what
> >is the best way to add metadata about new partitions?
> >
> >Thanks in advance!
> >
> >--Tom
>
>
--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330