Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> how to run jobs every 30 minutes?


Copy link to this message
-
Re: how to run jobs every 30 minutes?
The first recommendation (gluing all my command line apps) is what I am
currently using.
The other ones you mentioned are just out of my league right now, since I am
quite new to Java world, not to mention JRuby, Groovy, Jython, etc.
But when I get comfortable with the environment and start to look for more
options I'll refer to your message. Thanks for the advanced info :-)

2010/12/15 Chris K Wensel <[EMAIL PROTECTED]>

>
> I see it this way.
>
> You can glue a bunch of discrete command line apps together that may or may
> not have dependencies between one another in a new syntax. which is darn
> nice if you already have a bunch of discrete ready to run command line apps
> sitting around that need to be strung together, that can't be used as
> libraries and instantiated through their APIs.
>
> Or, you can string all your work together through the APIs with a turing
> complete language and run them all from a single command line interface (and
> hand that to cron, or some other tool).
>
> In this case you can use Java, or easier languages like JRuby, Groovy,
> Jython, Clojure, etc which were designed for this purpose. (They don't run
> on the cluster, they only run Hadoop client side).
>
> Think ant vs graddle (or any other build tool that uses a scripting
> language and not a configuration file) if you want a concrete example.
>
> Cascading itself is a query API (and query planner). But it also exposes to
> the user the ability to run discrete 'processes' in dependency order for
> you. Either Cascading (Hadoop) Flows or Riffle annotated process objects.
> They all can be intermingled and managed from the same dependency scheduler.
> Cascading has one, and Riffle has one.
>
> So you can run> Flow -> Mahout -> Pig -> Mahout -> Flow -> shell ->
> whattheheckever from the same application.
>
> Cascading also has the ability to only run 'stale' processes. Think 'make'
> file. When re-running a job where only one file of many has changed, this is
> a big win.
>
> I personally like parameterizing my applications via the command line and
> letting my cli options drive the workflows. for example, my testing,
> integration, production environments are much different, so its very easy to
> drive specific runs of the jobs by changing a cli arg. (args4j makes this
> darn simple)
>
> if I am chaining multiple CLI apps into a bigger production app,
> parameterizing that I suspect will be error prone, esp if the input/output
> data points (jdbc vs file) are different in different contexts.
>
> you can find Riffle here, https://github.com/cwensel/riffle  (its Apache
> Licensed, contributions welcomed)
>
> ckw
>
> On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote:
>
> > Ed,
> >
> > Actually Oozie is quite different from Cascading.
> >
> > * Cascading allows you to write 'queries' using a Java API and they get
> > translated into MR jobs.
> > * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a
> DAG
> > (workflow jobs) and has timer+data dependency triggers (coordinator
> jobs).
> >
> > Regards.
> >
> > Alejandro
> >
> > On Tue, Dec 14, 2010 at 1:26 PM, edward choi <[EMAIL PROTECTED]> wrote:
> >
> >> Thanks for the tip. I took a look at it.
> >> Looks similar to Cascading I guess...?
> >> Anyway thanks for the info!!
> >>
> >> Ed
> >>
> >> 2010/12/8 Alejandro Abdelnur <[EMAIL PROTECTED]>
> >>
> >>> Or, if you want to do it in a reliable way you could use an Oozie
> >>> coordinator job.
> >>>
> >>> On Wed, Dec 8, 2010 at 1:53 PM, edward choi <[EMAIL PROTECTED]> wrote:
> >>>> My mistake. Come to think about it, you are right, I can just make an
> >>>> infinite loop inside the Hadoop application.
> >>>> Thanks for the reply.
> >>>>
> >>>> 2010/12/7 Harsh J <[EMAIL PROTECTED]>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> On Tue, Dec 7, 2010 at 2:25 PM, edward choi <[EMAIL PROTECTED]>
> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm planning to crawl a certain web site every 30 minutes.
> >>>>>> How would I get it done in Hadoop?
> >>>>>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB