Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - Hadoop Data Sharing

Copy link to this message
Re: Hadoop Data Sharing
Aaron Kimball 2010-05-06, 01:04

In general if you need to perform a multi-pass MapReduce workflow, each pass
materializes its output to files. The subsequent pass then reads those same
files back in as input. This allows the workflow to start at the last
"checkpoint" if it gets interrupted. There is no persistent in-memory
distributed storage feature in Hadoop that would allow a MapReduce job to
post results to memory for consumption by a subsequent job.

So you would just read your initial data from /input, and write your interim
results to /iteration0. Then the next pass reads from /iteration0 and writes
to /iteration1, etc..

If your data is reasonably small and you think it could fit in memory
somewhere, then you could experiment with using other distributed key-value
stores (memcached[b], hbase, cassandra, etc..) to hold intermediate results.
But this will require some integration work on your part.
- Aaron

On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <

> Hi everyone, I have recently started to play around with hadoop, but I am
> getting some into some "design" problems.
> I need to make a loop to execute the same job several times, and in each
> iteration get the processed values (not using a file because I would need
> to
> read it). I was using an static vector in my main class (the one that
> iterates and executes the job in each iteration) to retrieve those values,
> and it did work while I was using a standalone mode. Now I tried to test it
> on a pseudo-distributed manner and obviously is not working.
> Any suggestions, please???
> Thanks in advance,
> Renato M.