|
|
Renato Marroquín Mogrovej... 2010-05-05, 15:29
Hi everyone, I have recently started to play around with hadoop, but I am getting some into some "design" problems. I need to make a loop to execute the same job several times, and in each iteration get the processed values (not using a file because I would need to read it). I was using an static vector in my main class (the one that iterates and executes the job in each iteration) to retrieve those values, and it did work while I was using a standalone mode. Now I tried to test it on a pseudo-distributed manner and obviously is not working. Any suggestions, please???
Thanks in advance, Renato M.
-
Re: Hadoop Data Sharing
Aaron Kimball 2010-05-06, 01:04
Renato,
In general if you need to perform a multi-pass MapReduce workflow, each pass materializes its output to files. The subsequent pass then reads those same files back in as input. This allows the workflow to start at the last "checkpoint" if it gets interrupted. There is no persistent in-memory distributed storage feature in Hadoop that would allow a MapReduce job to post results to memory for consumption by a subsequent job.
So you would just read your initial data from /input, and write your interim results to /iteration0. Then the next pass reads from /iteration0 and writes to /iteration1, etc..
If your data is reasonably small and you think it could fit in memory somewhere, then you could experiment with using other distributed key-value stores (memcached[b], hbase, cassandra, etc..) to hold intermediate results. But this will require some integration work on your part. - Aaron
On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < [EMAIL PROTECTED]> wrote:
> Hi everyone, I have recently started to play around with hadoop, but I am > getting some into some "design" problems. > I need to make a loop to execute the same job several times, and in each > iteration get the processed values (not using a file because I would need > to > read it). I was using an static vector in my main class (the one that > iterates and executes the job in each iteration) to retrieve those values, > and it did work while I was using a standalone mode. Now I tried to test it > on a pseudo-distributed manner and obviously is not working. > Any suggestions, please??? > > Thanks in advance, > > > Renato M. >
-
Re: Hadoop Data Sharing
Renato Marroquín Mogrovej... 2010-05-11, 13:38
Thanks Aaron! I was thinking the same after doing some reading. Man what about serialize the objects? Would you think that is a good idea? Thanks again.
Renato M. 2010/5/5 Aaron Kimball <[EMAIL PROTECTED]>
> Renato, > > In general if you need to perform a multi-pass MapReduce workflow, each > pass > materializes its output to files. The subsequent pass then reads those same > files back in as input. This allows the workflow to start at the last > "checkpoint" if it gets interrupted. There is no persistent in-memory > distributed storage feature in Hadoop that would allow a MapReduce job to > post results to memory for consumption by a subsequent job. > > So you would just read your initial data from /input, and write your > interim > results to /iteration0. Then the next pass reads from /iteration0 and > writes > to /iteration1, etc.. > > If your data is reasonably small and you think it could fit in memory > somewhere, then you could experiment with using other distributed key-value > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate > results. > But this will require some integration work on your part. > - Aaron > > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < > [EMAIL PROTECTED]> wrote: > > > Hi everyone, I have recently started to play around with hadoop, but I am > > getting some into some "design" problems. > > I need to make a loop to execute the same job several times, and in each > > iteration get the processed values (not using a file because I would need > > to > > read it). I was using an static vector in my main class (the one that > > iterates and executes the job in each iteration) to retrieve those > values, > > and it did work while I was using a standalone mode. Now I tried to test > it > > on a pseudo-distributed manner and obviously is not working. > > Any suggestions, please??? > > > > Thanks in advance, > > > > > > Renato M. > > >
-
Re: Hadoop Data Sharing
Aaron Kimball 2010-05-11, 17:31
What objects are you referring to? I'm not sure I understand your question. - Aaron
On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < [EMAIL PROTECTED]> wrote:
> Thanks Aaron! I was thinking the same after doing some reading. > Man what about serialize the objects? Would you think that is a good idea? > Thanks again. > > Renato M. > > > 2010/5/5 Aaron Kimball <[EMAIL PROTECTED]> > > > Renato, > > > > In general if you need to perform a multi-pass MapReduce workflow, each > > pass > > materializes its output to files. The subsequent pass then reads those > same > > files back in as input. This allows the workflow to start at the last > > "checkpoint" if it gets interrupted. There is no persistent in-memory > > distributed storage feature in Hadoop that would allow a MapReduce job to > > post results to memory for consumption by a subsequent job. > > > > So you would just read your initial data from /input, and write your > > interim > > results to /iteration0. Then the next pass reads from /iteration0 and > > writes > > to /iteration1, etc.. > > > > If your data is reasonably small and you think it could fit in memory > > somewhere, then you could experiment with using other distributed > key-value > > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate > > results. > > But this will require some integration work on your part. > > - Aaron > > > > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < > > [EMAIL PROTECTED]> wrote: > > > > > Hi everyone, I have recently started to play around with hadoop, but I > am > > > getting some into some "design" problems. > > > I need to make a loop to execute the same job several times, and in > each > > > iteration get the processed values (not using a file because I would > need > > > to > > > read it). I was using an static vector in my main class (the one that > > > iterates and executes the job in each iteration) to retrieve those > > values, > > > and it did work while I was using a standalone mode. Now I tried to > test > > it > > > on a pseudo-distributed manner and obviously is not working. > > > Any suggestions, please??? > > > > > > Thanks in advance, > > > > > > > > > Renato M. > > > > > >
-
Re: Hadoop Data Sharing
Aaron Kimball 2010-05-11, 17:34
Perhaps this is guidance in the area you were hoping for: If your data is in objects that implement the interface 'Writable', then you can use the SequenceFileOutputFormat and SequenceFileInputFormat to store your intermediate data in binary form in disk-backed files called SequenceFiles. The serialization will happen through the write() and readFields() methods of your objects, which will automatically be called by the OutputFormat/InputFormat as they move through the system. So your subsequent MR pass will receive objects back in the same form as they were emitted. This is a considerably better idea (from both a throughput and a sanity perspective) in a chained MapReduce job.
- Aaron
On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[EMAIL PROTECTED]> wrote:
> What objects are you referring to? I'm not sure I understand your question. > - Aaron > > > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < > [EMAIL PROTECTED]> wrote: > >> Thanks Aaron! I was thinking the same after doing some reading. >> Man what about serialize the objects? Would you think that is a good idea? >> Thanks again. >> >> Renato M. >> >> >> 2010/5/5 Aaron Kimball <[EMAIL PROTECTED]> >> >> > Renato, >> > >> > In general if you need to perform a multi-pass MapReduce workflow, each >> > pass >> > materializes its output to files. The subsequent pass then reads those >> same >> > files back in as input. This allows the workflow to start at the last >> > "checkpoint" if it gets interrupted. There is no persistent in-memory >> > distributed storage feature in Hadoop that would allow a MapReduce job >> to >> > post results to memory for consumption by a subsequent job. >> > >> > So you would just read your initial data from /input, and write your >> > interim >> > results to /iteration0. Then the next pass reads from /iteration0 and >> > writes >> > to /iteration1, etc.. >> > >> > If your data is reasonably small and you think it could fit in memory >> > somewhere, then you could experiment with using other distributed >> key-value >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate >> > results. >> > But this will require some integration work on your part. >> > - Aaron >> > >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Hi everyone, I have recently started to play around with hadoop, but I >> am >> > > getting some into some "design" problems. >> > > I need to make a loop to execute the same job several times, and in >> each >> > > iteration get the processed values (not using a file because I would >> need >> > > to >> > > read it). I was using an static vector in my main class (the one that >> > > iterates and executes the job in each iteration) to retrieve those >> > values, >> > > and it did work while I was using a standalone mode. Now I tried to >> test >> > it >> > > on a pseudo-distributed manner and obviously is not working. >> > > Any suggestions, please??? >> > > >> > > Thanks in advance, >> > > >> > > >> > > Renato M. >> > > >> > >> > >
-
Re: Hadoop Data Sharing
Renato Marroquín Mogrovej... 2010-05-11, 20:19
Hi Aaron,
The thing is that I had a data structure that is saved into a vector, and this vector needs to be available for my MapReduce jobs while iterating. So would you think it would a good and easy way to serialize this objects? It's a vector that each node contains another user define data structure. Maybe I will try to do it first just using files, and see how the throughput goes. Hey do you know where I can find some examples of serializing objects for Hadoop to save them into SequenceFiles? Thanks in advance.
Renato M. 2010/5/11 Aaron Kimball <[EMAIL PROTECTED]>
> Perhaps this is guidance in the area you were hoping for: If your data is > in > objects that implement the interface 'Writable', then you can use the > SequenceFileOutputFormat and SequenceFileInputFormat to store your > intermediate data in binary form in disk-backed files called SequenceFiles. > The serialization will happen through the write() and readFields() methods > of your objects, which will automatically be called by the > OutputFormat/InputFormat as they move through the system. So your > subsequent > MR pass will receive objects back in the same form as they were emitted. > This is a considerably better idea (from both a throughput and a sanity > perspective) in a chained MapReduce job. > > - Aaron > > On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[EMAIL PROTECTED]> > wrote: > > > What objects are you referring to? I'm not sure I understand your > question. > > - Aaron > > > > > > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < > > [EMAIL PROTECTED]> wrote: > > > >> Thanks Aaron! I was thinking the same after doing some reading. > >> Man what about serialize the objects? Would you think that is a good > idea? > >> Thanks again. > >> > >> Renato M. > >> > >> > >> 2010/5/5 Aaron Kimball <[EMAIL PROTECTED]> > >> > >> > Renato, > >> > > >> > In general if you need to perform a multi-pass MapReduce workflow, > each > >> > pass > >> > materializes its output to files. The subsequent pass then reads those > >> same > >> > files back in as input. This allows the workflow to start at the last > >> > "checkpoint" if it gets interrupted. There is no persistent in-memory > >> > distributed storage feature in Hadoop that would allow a MapReduce job > >> to > >> > post results to memory for consumption by a subsequent job. > >> > > >> > So you would just read your initial data from /input, and write your > >> > interim > >> > results to /iteration0. Then the next pass reads from /iteration0 and > >> > writes > >> > to /iteration1, etc.. > >> > > >> > If your data is reasonably small and you think it could fit in memory > >> > somewhere, then you could experiment with using other distributed > >> key-value > >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate > >> > results. > >> > But this will require some integration work on your part. > >> > - Aaron > >> > > >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> > > Hi everyone, I have recently started to play around with hadoop, but > I > >> am > >> > > getting some into some "design" problems. > >> > > I need to make a loop to execute the same job several times, and in > >> each > >> > > iteration get the processed values (not using a file because I would > >> need > >> > > to > >> > > read it). I was using an static vector in my main class (the one > that > >> > > iterates and executes the job in each iteration) to retrieve those > >> > values, > >> > > and it did work while I was using a standalone mode. Now I tried to > >> test > >> > it > >> > > on a pseudo-distributed manner and obviously is not working. > >> > > Any suggestions, please??? > >> > > > >> > > Thanks in advance, > >> > > > >> > > > >> > > Renato M. > >> > > > >> > > >> > > > > >
-
Re: Hadoop Data Sharing
Jay Booth 2010-05-11, 20:34
Probably the most direct route to get your desired result is to save the objects to either a SequenceFile or plain text file on DFS. Then in the configure() section of your mapreduce jobs, you open the file on DFS, stream contents into a local variable and refer to it as you need to. Either way, you'll need some sort of serialization via Writable or plain text.
On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo <[EMAIL PROTECTED]> wrote: > Hi Aaron, > > The thing is that I had a data structure that is saved into a vector, and > this vector needs to be available for my MapReduce jobs while iterating. So > would you think it would a good and easy way to serialize this objects? It's > a vector that each node contains another user define data structure. Maybe I > will try to do it first just using files, and see how the throughput goes. > Hey do you know where I can find some examples of serializing objects for > Hadoop to save them into SequenceFiles? > Thanks in advance. > > Renato M. > > > 2010/5/11 Aaron Kimball <[EMAIL PROTECTED]> > >> Perhaps this is guidance in the area you were hoping for: If your data is >> in >> objects that implement the interface 'Writable', then you can use the >> SequenceFileOutputFormat and SequenceFileInputFormat to store your >> intermediate data in binary form in disk-backed files called SequenceFiles. >> The serialization will happen through the write() and readFields() methods >> of your objects, which will automatically be called by the >> OutputFormat/InputFormat as they move through the system. So your >> subsequent >> MR pass will receive objects back in the same form as they were emitted. >> This is a considerably better idea (from both a throughput and a sanity >> perspective) in a chained MapReduce job. >> >> - Aaron >> >> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[EMAIL PROTECTED]> >> wrote: >> >> > What objects are you referring to? I'm not sure I understand your >> question. >> > - Aaron >> > >> > >> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < >> > [EMAIL PROTECTED]> wrote: >> > >> >> Thanks Aaron! I was thinking the same after doing some reading. >> >> Man what about serialize the objects? Would you think that is a good >> idea? >> >> Thanks again. >> >> >> >> Renato M. >> >> >> >> >> >> 2010/5/5 Aaron Kimball <[EMAIL PROTECTED]> >> >> >> >> > Renato, >> >> > >> >> > In general if you need to perform a multi-pass MapReduce workflow, >> each >> >> > pass >> >> > materializes its output to files. The subsequent pass then reads those >> >> same >> >> > files back in as input. This allows the workflow to start at the last >> >> > "checkpoint" if it gets interrupted. There is no persistent in-memory >> >> > distributed storage feature in Hadoop that would allow a MapReduce job >> >> to >> >> > post results to memory for consumption by a subsequent job. >> >> > >> >> > So you would just read your initial data from /input, and write your >> >> > interim >> >> > results to /iteration0. Then the next pass reads from /iteration0 and >> >> > writes >> >> > to /iteration1, etc.. >> >> > >> >> > If your data is reasonably small and you think it could fit in memory >> >> > somewhere, then you could experiment with using other distributed >> >> key-value >> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate >> >> > results. >> >> > But this will require some integration work on your part. >> >> > - Aaron >> >> > >> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < >> >> > [EMAIL PROTECTED]> wrote: >> >> > >> >> > > Hi everyone, I have recently started to play around with hadoop, but >> I >> >> am >> >> > > getting some into some "design" problems. >> >> > > I need to make a loop to execute the same job several times, and in >> >> each >> >> > > iteration get the processed values (not using a file because I would >> >> need >> >> > > to >> >> > > read it). I was using an static vector in my main class (the one
-
Re: Hadoop Data Sharing
Renato Marroquín Mogrovej... 2010-05-16, 02:05
Thanks for your replies. Yeah I have had to restructure a part of my code but it is all good now. Thanks again for your suggestions.
Renato M.
2010/5/11 Jay Booth <[EMAIL PROTECTED]>
> Probably the most direct route to get your desired result is to save > the objects to either a SequenceFile or plain text file on DFS. Then > in the configure() section of your mapreduce jobs, you open the file > on DFS, stream contents into a local variable and refer to it as you > need to. Either way, you'll need some sort of serialization via > Writable or plain text. > > On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo > <[EMAIL PROTECTED]> wrote: > > Hi Aaron, > > > > The thing is that I had a data structure that is saved into a vector, and > > this vector needs to be available for my MapReduce jobs while iterating. > So > > would you think it would a good and easy way to serialize this objects? > It's > > a vector that each node contains another user define data structure. > Maybe I > > will try to do it first just using files, and see how the throughput > goes. > > Hey do you know where I can find some examples of serializing objects for > > Hadoop to save them into SequenceFiles? > > Thanks in advance. > > > > Renato M. > > > > > > 2010/5/11 Aaron Kimball <[EMAIL PROTECTED]> > > > >> Perhaps this is guidance in the area you were hoping for: If your data > is > >> in > >> objects that implement the interface 'Writable', then you can use the > >> SequenceFileOutputFormat and SequenceFileInputFormat to store your > >> intermediate data in binary form in disk-backed files called > SequenceFiles. > >> The serialization will happen through the write() and readFields() > methods > >> of your objects, which will automatically be called by the > >> OutputFormat/InputFormat as they move through the system. So your > >> subsequent > >> MR pass will receive objects back in the same form as they were emitted. > >> This is a considerably better idea (from both a throughput and a sanity > >> perspective) in a chained MapReduce job. > >> > >> - Aaron > >> > >> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <[EMAIL PROTECTED]> > >> wrote: > >> > >> > What objects are you referring to? I'm not sure I understand your > >> question. > >> > - Aaron > >> > > >> > > >> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> >> Thanks Aaron! I was thinking the same after doing some reading. > >> >> Man what about serialize the objects? Would you think that is a good > >> idea? > >> >> Thanks again. > >> >> > >> >> Renato M. > >> >> > >> >> > >> >> 2010/5/5 Aaron Kimball <[EMAIL PROTECTED]> > >> >> > >> >> > Renato, > >> >> > > >> >> > In general if you need to perform a multi-pass MapReduce workflow, > >> each > >> >> > pass > >> >> > materializes its output to files. The subsequent pass then reads > those > >> >> same > >> >> > files back in as input. This allows the workflow to start at the > last > >> >> > "checkpoint" if it gets interrupted. There is no persistent > in-memory > >> >> > distributed storage feature in Hadoop that would allow a MapReduce > job > >> >> to > >> >> > post results to memory for consumption by a subsequent job. > >> >> > > >> >> > So you would just read your initial data from /input, and write > your > >> >> > interim > >> >> > results to /iteration0. Then the next pass reads from /iteration0 > and > >> >> > writes > >> >> > to /iteration1, etc.. > >> >> > > >> >> > If your data is reasonably small and you think it could fit in > memory > >> >> > somewhere, then you could experiment with using other distributed > >> >> key-value > >> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate > >> >> > results. > >> >> > But this will require some integration work on your part. > >> >> > - Aaron > >> >> > > >> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo < > >> >> > [EMAIL PROTECTED]> wrote: > >>
|
|