Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Programming Multiple rounds of mapreduce


Copy link to this message
-
Re: Programming Multiple rounds of mapreduce
Thanks Matt,

Arko, if you plan to use Oozie, you can have a simple coordinator job that
does does, for example (the following schedules a WF every 5 mins that
consumes the output produced by the previous run, you just have to have the
initial data)

Thxs.

Alejandro

----
<coordinator-app name="coord-1" frequency="${coord:minutes(5)}"
start="${start}" end="${end}" timezone="UTC"
                 xmlns="uri:oozie:coordinator:0.1">
  <controls>
    <concurrency>1</concurrency>
  </controls>

  <datasets>
    <dataset name="data" frequency="${coord:minutes(5)}"
initial-instance="${start}" timezone="UTC">

<uri-template>${nameNode}/user/${coord:user()}/examples/${dataRoot}/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}
      </uri-template>
    </dataset>
  </datasets>

  <input-events>
    <data-in name="input" dataset="data">
      <instance>${coord:current(0)}</instance>
    </data-in>
  </input-events>

  <output-events>
    <data-out name="output" dataset="data">
      <instance>${coord:current(1)}</instance>
    </data-out>
  </output-events>

  <action>
    <workflow>

<app-path>${nameNode}/user/${coord:user()}/examples/apps/subwf-1</app-path>
      <configuration>
        <property>
          <name>jobTracker</name>
          <value>${jobTracker}</value>
        </property>
        <property>
          <name>nameNode</name>
          <value>${nameNode}</value>
        </property>
        <property>
          <name>queueName</name>
          <value>${queueName}</value>
        </property>
        <property>
          <name>examplesRoot</name>
          <value>${examplesRoot}</value>
        </property>
        <property>
          <name>inputDir</name>
          <value>${coord:dataIn('input')}</value>
        </property>
        <property>
          <name>outputDir</name>
          <value>${coord:dataOut('output')}</value>
        </property>
      </configuration>
    </workflow>
  </action>
</coordinator-app>
------

On Mon, Jun 13, 2011 at 3:01 PM, GOEKE, MATTHEW (AG/1000) <
[EMAIL PROTECTED]> wrote:

> If you know for certain that it needs to be split into multiple work units
> I would suggest looking into Oozie. Easy to install, light weight, low
> learning curve... for my purposes it's been very helpful so far. I am also
> fairly certain you can chain multiple job confs into the same run but I have
> not actually tried that therefore I can't promise it is easy or possible.
>
> http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/
>
> If you are not running CDH3u0 then you can also get the tarball and
> documentation directly here:
> https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs
>
> Matt
>
> -----Original Message-----
> From: Marcos Ortiz [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 13, 2011 4:57 PM
> To: [EMAIL PROTECTED]
> Cc: Arko Provo Mukherjee
> Subject: Re: Programming Multiple rounds of mapreduce
>
> Well, you can define a job for each round and then, you can define the
> running workflow based in your implementation and to chain your jobs
>
> El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
> > Hello,
> >
> > I am trying to write a program where I need to write multiple rounds
> > of map and reduce.
> >
> > The output of the last round of map-reduce must be fed into the input
> > of the next round.
> >
> > Can anyone please guide me to any link / material that can teach me as
> > to how I can achieve this.
> >
> > Thanks a lot in advance!
> >
> > Thanks & regards
> > Arko
>
> --
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  http://marcosluis2186.posterous.com
>  http://twitter.com/marcosluis2186
>
>
> This e-mail message may contain privileged and/or confidential information,
> and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB