Peng, Wei
20101221, 01:49
Edward J. Yoon
20101221, 02:22
Peng, Wei
20101221, 02:30
Ricky Ho
20101221, 03:00
Peng, Wei
20101221, 04:16
Ted Dunning
20101221, 05:02
Peng, Wei
20101221, 05:43
Ted Dunning
20101221, 06:18
Edward J. Yoon
20101221, 06:27
Peng, Wei
20101221, 09:59
Ted Dunning
20101221, 15:28
Ricky Ho
20101221, 16:53
Peng, Wei
20101221, 18:04
Ted Dunning
20101221, 18:09
Peng, Wei
20101222, 02:07
Edward J. Yoon
20101222, 06:54
Peng, Wei
20101222, 16:57
Ricky Ho
20101222, 18:46
Ted Dunning
20101222, 19:00
Peng, Wei
20101222, 19:51
Ted Yu
20101222, 20:35


breadthfirst searchPeng, Wei 20101221, 01:49
I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadthfirst search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudodistributed mode on one machine)!! I wonder if there is a more efficient way to do breadthfirst search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei 
Re: breadthfirst searchEdward J. Yoon 20101221, 02:22
Check this slide out 
http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > > I implemented an algorithm to run hadoop on a 25GB graph data to > calculate its average separation length. > The input format is V1(tab)V2 (where V2 is the friend of V1). > My purpose is to first randomly select some seed nodes, and then for > each node, calculate the shortest paths from this node to all other > nodes on the graph. > > To do this, I first run a simple python code in a single machine to get > some random seed nodes. > Then I run a hadoop job to generate adjacent list for each node as the > input for the second job. > > The second job takes the adjacent list input and output the first level > breadthfirst search result. The nodes which are the friends of the seed > node have distance 1. Then this output is the input for the next hadoop > job so on so forth, until all the nodes are reached. > > I generated a simulated graph for testing. This data has only 100 nodes. > Normal python code can find the separation length within 1 second (100 > seed nodes). However, the hadoop took almost 3 hours to do that > (pseudodistributed mode on one machine)!! > > I wonder if there is a more efficient way to do breadthfirst search in > hadoop? It is very inefficient to output so many intermediate results. > Totally there would be seedNodeNumber*levelNumber+1 jobs, > seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > > Please help. Thanks! > > Wei >  Best Regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org 
RE: breadthfirst searchPeng, Wei 20101221, 02:30
Yoon,
Can I use HAMA now, or it is still in development? Thanks Wei Original Message From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 6:23 PM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search Check this slide out  http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > > I implemented an algorithm to run hadoop on a 25GB graph data to > calculate its average separation length. > The input format is V1(tab)V2 (where V2 is the friend of V1). > My purpose is to first randomly select some seed nodes, and then for > each node, calculate the shortest paths from this node to all other > nodes on the graph. > > To do this, I first run a simple python code in a single machine to get > some random seed nodes. > Then I run a hadoop job to generate adjacent list for each node as the > input for the second job. > > The second job takes the adjacent list input and output the first level > breadthfirst search result. The nodes which are the friends of the seed > node have distance 1. Then this output is the input for the next hadoop > job so on so forth, until all the nodes are reached. > > I generated a simulated graph for testing. This data has only 100 nodes. > Normal python code can find the separation length within 1 second (100 > seed nodes). However, the hadoop took almost 3 hours to do that > (pseudodistributed mode on one machine)!! > > I wonder if there is a more efficient way to do breadthfirst search in > hadoop? It is very inefficient to output so many intermediate results. > Totally there would be seedNodeNumber*levelNumber+1 jobs, > seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > > Please help. Thanks! > > Wei >  Best Regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org 
RE: breadthfirst searchRicky Ho 20101221, 03:00
I also blog about how to to Single Source Shortest Path here at
http://horicky.blogspot.com/2010/02/nosqlgraphdb.html One MR algorithm is based on Dijkstra and the other is based on BFS. I think the first one is more efficient than the second one. Rgds, Ricky Original Message From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 5:50 PM To: [EMAIL PROTECTED] Subject: breadthfirst search I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadthfirst search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudodistributed mode on one machine)!! I wonder if there is a more efficient way to do breadthfirst search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei 
RE: breadthfirst searchPeng, Wei 20101221, 04:16
Thanks Ricky.
I do not think the Dijkstra is more efficient than BFS (costly when looking for the node with minimum distance, and we do not need to do that when there is no edge weight). BFS should be the special version of Dijkstra when the weight of edges are all equal. In your algorithm, you are saving everything in memory. That is unrealistic for a large graph. My question is really about what is the efficient way for graph computation, matrix computation, algorithms that need many iterations to converge (with intermediate results). HAMA looks like a very good solution, but can we use it now and how to use it? Wei Original Message From: Ricky Ho [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 7:01 PM To: [EMAIL PROTECTED] Subject: RE: breadthfirst search I also blog about how to to Single Source Shortest Path here at http://horicky.blogspot.com/2010/02/nosqlgraphdb.html One MR algorithm is based on Dijkstra and the other is based on BFS. I think the first one is more efficient than the second one. Rgds, Ricky Original Message From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 5:50 PM To: [EMAIL PROTECTED] Subject: breadthfirst search I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadthfirst search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudodistributed mode on one machine)!! I wonder if there is a more efficient way to do breadthfirst search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei 
Re: breadthfirst searchTed Dunning 20101221, 05:02
On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei <[EMAIL PROTECTED]> wrote:
> ... My question is really about what is the efficient way for graph > computation, matrix computation, algorithms that need many iterations to > converge (with intermediate results). > Large graph computations usually assume a sparse graph for historical reasons. A key property of scalable algorithms is that the time and space are linear in the input size. Most all path algorithms are not linear because the result is n x n and is dense. Some graph path computations can be done indirectly by spectral methods. With good random projection algorithms for sparse matrix decomposition, approximate versions of some of these algorithms can be phrased in a scalable fashion. It isn't an easy task, however. > HAMA looks like a very good solution, but can we use it now and how to > use it? > > I don't think that Hama has produced any usable software yet. 
RE: breadthfirst searchPeng, Wei 20101221, 05:43
Dunning,
Currently, most of the matrix data (graph matrix, documentword matrix) that we are dealing with are sparse. The matrix decomposition often needs many iterations to converge, then intermediate results have to be saved to serve as the input for the next iteration. This is super inefficient. As I mentioned, the BFS algorithm written in python took 1 second to run, however, hadoop took almost 3 hours. I would expect hadoop to be slower, but not this slow. All the hadoop applications that I saw are all very simple calculations, I wonder how it can be applied to machine learning/data mining algorithms. Is HAMA the only way to solve it? If it is not ready to use yet, then can I assume hadoop is not a good solution for multiple iteration algorithms now? Wei Original Message From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 9:03 PM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei <[EMAIL PROTECTED]> wrote: > ... My question is really about what is the efficient way for graph > computation, matrix computation, algorithms that need many iterations to > converge (with intermediate results). > Large graph computations usually assume a sparse graph for historical reasons. A key property of scalable algorithms is that the time and space are linear in the input size. Most all path algorithms are not linear because the result is n x n and is dense. Some graph path computations can be done indirectly by spectral methods. With good random projection algorithms for sparse matrix decomposition, approximate versions of some of these algorithms can be phrased in a scalable fashion. It isn't an easy task, however. > HAMA looks like a very good solution, but can we use it now and how to > use it? > > I don't think that Hama has produced any usable software yet. 
Re: breadthfirst searchTed Dunning 20101221, 06:18
On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei <[EMAIL PROTECTED]> wrote:
> ... > Currently, most of the matrix data (graph matrix, documentword matrix) > that we are dealing with are sparse. > Good. > The matrix decomposition often needs many iterations to converge, then > intermediate results have to be saved to serve as the input for the next > iteration. > I think you are thinking of the wrong algorithms. The ones that I am talking about converge in a fixed and small number of steps. See https://issues.apache.org/jira/browse/MAHOUT376 for the work in progress on this. > This is super inefficient. As I mentioned, the BFS algorithm written in > python took 1 second to run, however, hadoop took almost 3 hours. I > would expect hadoop to be slower, but not this slow. > I think you have a combination of factors here. But, even accounting for having too many iterations in your BFS algorithm, iterations with stock Hadoop take 10s of seconds even if they do nothing. If you arrange your computation to need many iterations, it will be slow. All the hadoop applications that I saw are all very simple calculations, > I wonder how it can be applied to machine learning/data mining > algorithms. > Check out Mahout. There is a lot of machine learning going on there both on Hadoop and using other scalable methods. > Is HAMA the only way to solve it? If it is not ready to use yet, then > can I assume hadoop is not a good solution for multiple iteration > algorithms now? > I don't see much evidence that HAMA will ever solve anything, so I wouldn't recommend pinning your hopes on that. For fast, iterative mapreduce, you really need to keep your mappers and reducers live between iterations. Check out twister for that: http://www.iterativemapreduce.org/ 
Re: breadthfirst searchEdward J. Yoon 20101221, 06:27
There's no release yet.
But, I had tested the BFS using hama and, hbase. Sent from my iPhone On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > Yoon, > > Can I use HAMA now, or it is still in development? > > Thanks > > Wei > > Original Message > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Monday, December 20, 2010 6:23 PM > To: [EMAIL PROTECTED] > Subject: Re: breadthfirst search > > Check this slide out  > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: >> >> I implemented an algorithm to run hadoop on a 25GB graph data to >> calculate its average separation length. >> The input format is V1(tab)V2 (where V2 is the friend of V1). >> My purpose is to first randomly select some seed nodes, and then for >> each node, calculate the shortest paths from this node to all other >> nodes on the graph. >> >> To do this, I first run a simple python code in a single machine to get >> some random seed nodes. >> Then I run a hadoop job to generate adjacent list for each node as the >> input for the second job. >> >> The second job takes the adjacent list input and output the first level >> breadthfirst search result. The nodes which are the friends of the seed >> node have distance 1. Then this output is the input for the next hadoop >> job so on so forth, until all the nodes are reached. >> >> I generated a simulated graph for testing. This data has only 100 nodes. >> Normal python code can find the separation length within 1 second (100 >> seed nodes). However, the hadoop took almost 3 hours to do that >> (pseudodistributed mode on one machine)!! >> >> I wonder if there is a more efficient way to do breadthfirst search in >> hadoop? It is very inefficient to output so many intermediate results. >> Totally there would be seedNodeNumber*levelNumber+1 jobs, >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? >> >> Please help. Thanks! >> >> Wei >> > > > >  > Best Regards, Edward J. Yoon > [EMAIL PROTECTED] > http://blog.udanax.org 
RE: breadthfirst searchPeng, Wei 20101221, 09:59
The graph that my BFS algorithm is running on only needs 4 levels to reach all nodes. The reason I say "many iterations" is that there are 100 sources nodes, so totally 400 iterations. The algorithm should be right, and I cannot think of anything to reduce the number of iterations.
Ted, I will check out the links that you sent to me. I really appreciate your help. Wei Original Message From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 1:27 AM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search There's no release yet. But, I had tested the BFS using hama and, hbase. Sent from my iPhone On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > Yoon, > > Can I use HAMA now, or it is still in development? > > Thanks > > Wei > > Original Message > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Monday, December 20, 2010 6:23 PM > To: [EMAIL PROTECTED] > Subject: Re: breadthfirst search > > Check this slide out  > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: >> >> I implemented an algorithm to run hadoop on a 25GB graph data to >> calculate its average separation length. >> The input format is V1(tab)V2 (where V2 is the friend of V1). >> My purpose is to first randomly select some seed nodes, and then for >> each node, calculate the shortest paths from this node to all other >> nodes on the graph. >> >> To do this, I first run a simple python code in a single machine to get >> some random seed nodes. >> Then I run a hadoop job to generate adjacent list for each node as the >> input for the second job. >> >> The second job takes the adjacent list input and output the first level >> breadthfirst search result. The nodes which are the friends of the seed >> node have distance 1. Then this output is the input for the next hadoop >> job so on so forth, until all the nodes are reached. >> >> I generated a simulated graph for testing. This data has only 100 nodes. >> Normal python code can find the separation length within 1 second (100 >> seed nodes). However, the hadoop took almost 3 hours to do that >> (pseudodistributed mode on one machine)!! >> >> I wonder if there is a more efficient way to do breadthfirst search in >> hadoop? It is very inefficient to output so many intermediate results. >> Totally there would be seedNodeNumber*levelNumber+1 jobs, >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? >> >> Please help. Thanks! >> >> Wei >> > > > >  > Best Regards, Edward J. Yoon > [EMAIL PROTECTED] > http://blog.udanax.org 
Re: breadthfirst searchTed Dunning 20101221, 15:28
Ahh... I see what you mean.
This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 mapreduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[EMAIL PROTECTED]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > Original Message > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [EMAIL PROTECTED] > Subject: Re: breadthfirst search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > Original Message > > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [EMAIL PROTECTED] > > Subject: Re: breadthfirst search > > > > Check this slide out  > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other > >> nodes on the graph. > >> > >> To do this, I first run a simple python code in a single machine to get > >> some random seed nodes. > >> Then I run a hadoop job to generate adjacent list for each node as the > >> input for the second job. > >> > >> The second job takes the adjacent list input and output the first level > >> breadthfirst search result. The nodes which are the friends of the seed > >> node have distance 1. Then this output is the input for the next hadoop > >> job so on so forth, until all the nodes are reached. > >> > >> I generated a simulated graph for testing. This data has only 100 nodes. > >> Normal python code can find the separation length within 1 second (100 > >> seed nodes). However, the hadoop took almost 3 hours to do that > >> (pseudodistributed mode on one machine)!! > >> > >> I wonder if there is a more efficient way to do breadthfirst search in > >> hadoop? It is very inefficient to output so many intermediate results. > >> Totally there would be seedNodeNumber*levelNumber+1 jobs, > >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > >> > >> Please help. Thanks! > >> > >> Wei > >> > > > > > > > >  > > Best Regards, Edward J. Yoon > > [EMAIL PROTECTED] > > http://blog.udanax.org > 
RE: breadthfirst searchRicky Ho 20101221, 16:53
Exactly, it should be done in 4 iterations instead of 4 * 100 iterations.
But in general, Graph algorithms is harder to do in Map/Reduce and the statelessness nature typically requires shuffling of the whole graph in each iteration. Jimmy Lin has described a technique called Skimmy to avoid that, which I described here at http://horicky.blogspot.com/2010/07/graphprocessinginmapreduce.html And of course, the BSP style is another good way to achieve this which is where Google's Pregel model based on, I also blog about this here at http://horicky.blogspot.com/2010/07/googlepregelgraphprocessing.html Notice that "scalability" and "speed" are different animals. Hadoop is about scalability but not speed. If your data can be squeezed into a single machine, then Hadoop is not for you. Otherwise, Hadoop is one great tool to parallelize your processing. It has a pretty constant (but significant) overhead that is justifiable when you have large amount of data. For example, Hadoop is not for processing realtime streams. http://horicky.blogspot.com/2010/11/mapreduceandstreamprocessing.html Notice that ... Hadoop is just one of the tool that you can use, and you decide whether you don't have better ones. Rgds, Ricky Original Message From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 7:29 AM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 mapreduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[EMAIL PROTECTED]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > Original Message > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [EMAIL PROTECTED] > Subject: Re: breadthfirst search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > Original Message > > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [EMAIL PROTECTED] > > Subject: Re: breadthfirst search > > > > Check this slide out  > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other > >> nodes on the graph. > >> > >> To do this, I first run a simple python code in a single machine to get > >> some random seed nodes. > >> Then I run a hadoop job to generate adjacent list for each node as the > >> input for the second job. > >> > >> The second job takes the adjacent list input and output the first level > >> breadthfirst search result. The nodes which are the friends of the seed > >> node have distance 1. Then this output is the input for the next hadoop > >> job so on so forth, until all the nodes are reached. > >> > >> I generated a simulated graph for testing. This data has only 100 nodes. 
RE: breadthfirst searchPeng, Wei 20101221, 18:04
Thank you all for all the replies.
I know I can run 100 source nodes in parallel. It is just surprising to me that the python code and hadoop code have such a big efficiency difference (they are both running in sequential on 100 source nodes). Hadoop is useful when the data is huge and cannot fit into memory, but it does not seem to be a realtime solution. There are so many to learn. Thank you again for all your kind help. Wei Original Message From: Ricky Ho [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 11:53 AM To: [EMAIL PROTECTED] Subject: RE: breadthfirst search Exactly, it should be done in 4 iterations instead of 4 * 100 iterations. But in general, Graph algorithms is harder to do in Map/Reduce and the statelessness nature typically requires shuffling of the whole graph in each iteration. Jimmy Lin has described a technique called Skimmy to avoid that, which I described here at http://horicky.blogspot.com/2010/07/graphprocessinginmapreduce.html And of course, the BSP style is another good way to achieve this which is where Google's Pregel model based on, I also blog about this here at http://horicky.blogspot.com/2010/07/googlepregelgraphprocessing.html Notice that "scalability" and "speed" are different animals. Hadoop is about scalability but not speed. If your data can be squeezed into a single machine, then Hadoop is not for you. Otherwise, Hadoop is one great tool to parallelize your processing. It has a pretty constant (but significant) overhead that is justifiable when you have large amount of data. For example, Hadoop is not for processing realtime streams. http://horicky.blogspot.com/2010/11/mapreduceandstreamprocessing.html Notice that ... Hadoop is just one of the tool that you can use, and you decide whether you don't have better ones. Rgds, Ricky Original Message From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 7:29 AM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 mapreduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[EMAIL PROTECTED]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > Original Message > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [EMAIL PROTECTED] > Subject: Re: breadthfirst search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > Original Message > > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [EMAIL PROTECTED] > > Subject: Re: breadthfirst search > > > > Check this slide out  > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other 
Re: breadthfirst searchTed Dunning 20101221, 18:09
Absolutely true. Nobody should pretend otherwise.
On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Hadoop is useful when the data is huge and cannot fit into memory, but it > does not seem to be a realtime solution. 
RE: breadthfirst searchPeng, Wei 20101222, 02:07
I was just trying to run 100 source nodes in multiple threads, but the
mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks Original Message From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 1:10 PM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search Absolutely true. Nobody should pretend otherwise. On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Hadoop is useful when the data is huge and cannot fit into memory, but it > does not seem to be a realtime solution. 
Re: breadthfirst searchEdward J. Yoon 20101222, 06:54
> I don't see much evidence that HAMA will ever solve anything, so I wouldn't
> recommend pinning your hopes on that. Hello ted. What is your opinion based on? Have you tried to run the HAMA TRUNK? If not, you don't know about our progress in the last year. :) On Tue, Dec 21, 2010 at 3:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> ... >> Currently, most of the matrix data (graph matrix, documentword matrix) >> that we are dealing with are sparse. >> > > Good. > > >> The matrix decomposition often needs many iterations to converge, then >> intermediate results have to be saved to serve as the input for the next >> iteration. >> > > I think you are thinking of the wrong algorithms. The ones that I am > talking about converge > in a fixed and small number of steps. See > https://issues.apache.org/jira/browse/MAHOUT376 for the work > in progress on this. > > >> This is super inefficient. As I mentioned, the BFS algorithm written in >> python took 1 second to run, however, hadoop took almost 3 hours. I >> would expect hadoop to be slower, but not this slow. >> > > I think you have a combination of factors here. But, even accounting for > having too many > iterations in your BFS algorithm, iterations with stock Hadoop take 10s of > seconds even if > they do nothing. If you arrange your computation to need many iterations, > it will be slow. > > > All the hadoop applications that I saw are all very simple calculations, >> I wonder how it can be applied to machine learning/data mining >> algorithms. >> > > Check out Mahout. There is a lot of machine learning going on there both on > Hadoop and using other scalable methods. > > >> Is HAMA the only way to solve it? If it is not ready to use yet, then >> can I assume hadoop is not a good solution for multiple iteration >> algorithms now? >> > > I don't see much evidence that HAMA will ever solve anything, so I wouldn't > recommend pinning your hopes on that. > > For fast, iterative mapreduce, you really need to keep your mappers and > reducers live between iterations. Check out > twister for that: http://www.iterativemapreduce.org/ >  Best Regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org 
RE: breadthfirst searchPeng, Wei 20101222, 16:57
Can someone tell me whether we can run multiple threads in hadoop?
Thanks Wei Original Message From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 9:07 PM To: [EMAIL PROTECTED] Subject: RE: breadthfirst search I was just trying to run 100 source nodes in multiple threads, but the mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks Original Message From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 1:10 PM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search Absolutely true. Nobody should pretend otherwise. On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Hadoop is useful when the data is huge and cannot fit into memory, but it > does not seem to be a realtime solution. 
RE: breadthfirst searchRicky Ho 20101222, 18:46
You can do whatever your want (including spawning threads) in the Mapper process
(which is fork/exec by the TaskTracker). But this doesn't help I think you need to understand the fundamental difference between the 2 parallel processing models 1) Multithreading Small scale parallelism limited to number of cores within a single machine. Multiple execution threads with a shared memory model. And a lot of synchronization primitives to coordinate access to share data 2) Map/Reduce Large scale parallelism involving large number of machines (hundreds to thousands) Data is shuffled through 2 layers of machines via a special topology (output records of the same key from layer1 will land on the same place at layer 2) The first layer is conducting a perrecord transformation (map) and the second layer is conducting a consolidation (reduce) These 2 models has a very different notions of synchronization, one is using fine grain locking and the other is share nothing, you won't be consider them to be alternatives to each other. They are intended to solve very very different problems. Can they be used together ? Absolutely yes. But you need to design how you want to partition your problem ... For example, you can consider partitioning your graph into subgraphs so each Mapper/Reducer is dealing with a bigger subgraph rather than individual nodes. Of course you need to think about how to combine the subgraph results, and whether you need absolutely accurate answer or an approximation is good enough. I bet you are the later one and so should be more easy. Ted, Can you point me to Matrix algorithms that is tuned for sparse graph ? What I mean is from O(v^3) to O(v*e) where v = number of vertex and e = number of edges. Rgds, Ricky Original Message From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 22, 2010 8:58 AM To: [EMAIL PROTECTED] Subject: RE: breadthfirst search Can someone tell me whether we can run multiple threads in hadoop? Thanks Wei Original Message From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 9:07 PM To: [EMAIL PROTECTED] Subject: RE: breadthfirst search I was just trying to run 100 source nodes in multiple threads, but the mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks 
Re: breadthfirst searchTed Dunning 20101222, 19:00
The Mahout math package has a number of basic algorithms that use
algorithmic efficiencies when given sparse graphs. A number of other algorithms use only the product of a sparse matrix on another matrix or a vector. Since these algorithms never change the original sparse matrix, they are safe against fillin problems. The random projection technique avoids O(v^3) algorithms for computing SVD or related matrix decompositions. See http://arxiv.org/abs/0909.4061 and https://issues.apache.org/jira/browse/MAHOUT376 None of these these algorithms are specific to graph theory, but all deal with methods that are useful with sparse graphs. On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho <[EMAIL PROTECTED]> wrote: > Can you point me to Matrix algorithms that is tuned for sparse graph ? > What I > mean is from O(v^3) to O(v*e) where v = number of vertex and e = number of > edges. > 
RE: breadthfirst searchPeng, Wei 20101222, 19:51
Thanks for quick response.
Partitioning graphs into subgraphs and later combining the results is too complicated to do. I prefer a simple method. Currently, I do not want to divide the breadthfirst search from a single source. I just want to run 100 breadthfirst search from 100 source nodes with 100 threads running in parallel. The problem is that these 100 threads do not seem to run parallel, however, they seem to run in sequential. I have searched online. Some people mention that all tasks are put into queues waiting for free mapreduce slots. It is might be due to not enough slots. How to deal with this problem? Wei Original Message From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 22, 2010 2:01 PM To: [EMAIL PROTECTED] Subject: Re: breadthfirst search The Mahout math package has a number of basic algorithms that use algorithmic efficiencies when given sparse graphs. A number of other algorithms use only the product of a sparse matrix on another matrix or a vector. Since these algorithms never change the original sparse matrix, they are safe against fillin problems. The random projection technique avoids O(v^3) algorithms for computing SVD or related matrix decompositions. See http://arxiv.org/abs/0909.4061 and https://issues.apache.org/jira/browse/MAHOUT376 None of these these algorithms are specific to graph theory, but all deal with methods that are useful with sparse graphs. On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho <[EMAIL PROTECTED]> wrote: > Can you point me to Matrix algorithms that is tuned for sparse graph ? > What I > mean is from O(v^3) to O(v*e) where v = number of vertex and e number of > edges. > 
Re: breadthfirst searchTed Yu 20101222, 20:35
Modify the following parameters:
mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum mapred.map.tasks mapred.reduce.tasks FYI you need to adjust the Xmx for your mapper/reducer after increasing the values for above parameters On Wed, Dec 22, 2010 at 11:51 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Thanks for quick response. > > Partitioning graphs into subgraphs and later combining the results is > too complicated to do. I prefer a simple method. > > Currently, I do not want to divide the breadthfirst search from a > single source. I just want to run 100 breadthfirst search from 100 > source nodes with 100 threads running in parallel. > The problem is that these 100 threads do not seem to run parallel, > however, they seem to run in sequential. I have searched online. Some > people mention that all tasks are put into queues waiting for free > mapreduce slots. It is might be due to not enough slots. > How to deal with this problem? > > Wei > > > Original Message > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 22, 2010 2:01 PM > To: [EMAIL PROTECTED] > Subject: Re: breadthfirst search > > The Mahout math package has a number of basic algorithms that use > algorithmic efficiencies when given sparse graphs. > > A number of other algorithms use only the product of a sparse matrix on > another matrix or a vector. Since these algorithms never change the > original sparse matrix, they are safe against fillin problems. > > The random projection technique avoids O(v^3) algorithms for computing > SVD > or related matrix decompositions. See http://arxiv.org/abs/0909.4061 > and > https://issues.apache.org/jira/browse/MAHOUT376 > > None of these these algorithms are specific to graph theory, but all > deal > with methods that are useful with sparse graphs. > > On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho <[EMAIL PROTECTED]> > wrote: > > > Can you point me to Matrix algorithms that is tuned for sparse graph ? > > What I > > mean is from O(v^3) to O(v*e) where v = number of vertex and e > number of > > edges. > > > 