|
Peng, Wei
2010-12-21, 01:49
Edward J. Yoon
2010-12-21, 02:22
Peng, Wei
2010-12-21, 02:30
Ricky Ho
2010-12-21, 03:00
Peng, Wei
2010-12-21, 04:16
Ted Dunning
2010-12-21, 05:02
Peng, Wei
2010-12-21, 05:43
Ted Dunning
2010-12-21, 06:18
Edward J. Yoon
2010-12-21, 06:27
Peng, Wei
2010-12-21, 09:59
Ted Dunning
2010-12-21, 15:28
Ricky Ho
2010-12-21, 16:53
Peng, Wei
2010-12-21, 18:04
Ted Dunning
2010-12-21, 18:09
Peng, Wei
2010-12-22, 02:07
Edward J. Yoon
2010-12-22, 06:54
Peng, Wei
2010-12-22, 16:57
Ricky Ho
2010-12-22, 18:46
Ted Dunning
2010-12-22, 19:00
Peng, Wei
2010-12-22, 19:51
Ted Yu
2010-12-22, 20:35
|
-
breadth-first searchPeng, Wei 2010-12-21, 01:49
I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadth-first search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudo-distributed mode on one machine)!! I wonder if there is a more efficient way to do breadth-first search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei
-
Re: breadth-first searchEdward J. Yoon 2010-12-21, 02:22
Check this slide out -
http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > > I implemented an algorithm to run hadoop on a 25GB graph data to > calculate its average separation length. > The input format is V1(tab)V2 (where V2 is the friend of V1). > My purpose is to first randomly select some seed nodes, and then for > each node, calculate the shortest paths from this node to all other > nodes on the graph. > > To do this, I first run a simple python code in a single machine to get > some random seed nodes. > Then I run a hadoop job to generate adjacent list for each node as the > input for the second job. > > The second job takes the adjacent list input and output the first level > breadth-first search result. The nodes which are the friends of the seed > node have distance 1. Then this output is the input for the next hadoop > job so on so forth, until all the nodes are reached. > > I generated a simulated graph for testing. This data has only 100 nodes. > Normal python code can find the separation length within 1 second (100 > seed nodes). However, the hadoop took almost 3 hours to do that > (pseudo-distributed mode on one machine)!! > > I wonder if there is a more efficient way to do breadth-first search in > hadoop? It is very inefficient to output so many intermediate results. > Totally there would be seedNodeNumber*levelNumber+1 jobs, > seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > > Please help. Thanks! > > Wei > -- Best Regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
-
RE: breadth-first searchPeng, Wei 2010-12-21, 02:30
Yoon,
Can I use HAMA now, or it is still in development? Thanks Wei -----Original Message----- From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 6:23 PM To: [EMAIL PROTECTED] Subject: Re: breadth-first search Check this slide out - http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > > I implemented an algorithm to run hadoop on a 25GB graph data to > calculate its average separation length. > The input format is V1(tab)V2 (where V2 is the friend of V1). > My purpose is to first randomly select some seed nodes, and then for > each node, calculate the shortest paths from this node to all other > nodes on the graph. > > To do this, I first run a simple python code in a single machine to get > some random seed nodes. > Then I run a hadoop job to generate adjacent list for each node as the > input for the second job. > > The second job takes the adjacent list input and output the first level > breadth-first search result. The nodes which are the friends of the seed > node have distance 1. Then this output is the input for the next hadoop > job so on so forth, until all the nodes are reached. > > I generated a simulated graph for testing. This data has only 100 nodes. > Normal python code can find the separation length within 1 second (100 > seed nodes). However, the hadoop took almost 3 hours to do that > (pseudo-distributed mode on one machine)!! > > I wonder if there is a more efficient way to do breadth-first search in > hadoop? It is very inefficient to output so many intermediate results. > Totally there would be seedNodeNumber*levelNumber+1 jobs, > seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > > Please help. Thanks! > > Wei > -- Best Regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
-
RE: breadth-first searchRicky Ho 2010-12-21, 03:00
I also blog about how to to Single Source Shortest Path here at
http://horicky.blogspot.com/2010/02/nosql-graphdb.html One MR algorithm is based on Dijkstra and the other is based on BFS. I think the first one is more efficient than the second one. Rgds, Ricky -----Original Message----- From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 5:50 PM To: [EMAIL PROTECTED] Subject: breadth-first search I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadth-first search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudo-distributed mode on one machine)!! I wonder if there is a more efficient way to do breadth-first search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei
-
RE: breadth-first searchPeng, Wei 2010-12-21, 04:16
Thanks Ricky.
I do not think the Dijkstra is more efficient than BFS (costly when looking for the node with minimum distance, and we do not need to do that when there is no edge weight). BFS should be the special version of Dijkstra when the weight of edges are all equal. In your algorithm, you are saving everything in memory. That is unrealistic for a large graph. My question is really about what is the efficient way for graph computation, matrix computation, algorithms that need many iterations to converge (with intermediate results). HAMA looks like a very good solution, but can we use it now and how to use it? Wei -----Original Message----- From: Ricky Ho [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 7:01 PM To: [EMAIL PROTECTED] Subject: RE: breadth-first search I also blog about how to to Single Source Shortest Path here at http://horicky.blogspot.com/2010/02/nosql-graphdb.html One MR algorithm is based on Dijkstra and the other is based on BFS. I think the first one is more efficient than the second one. Rgds, Ricky -----Original Message----- From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 5:50 PM To: [EMAIL PROTECTED] Subject: breadth-first search I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadth-first search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudo-distributed mode on one machine)!! I wonder if there is a more efficient way to do breadth-first search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei
-
Re: breadth-first searchTed Dunning 2010-12-21, 05:02
On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei <[EMAIL PROTECTED]> wrote:
> ... My question is really about what is the efficient way for graph > computation, matrix computation, algorithms that need many iterations to > converge (with intermediate results). > Large graph computations usually assume a sparse graph for historical reasons. A key property of scalable algorithms is that the time and space are linear in the input size. Most all path algorithms are not linear because the result is n x n and is dense. Some graph path computations can be done indirectly by spectral methods. With good random projection algorithms for sparse matrix decomposition, approximate versions of some of these algorithms can be phrased in a scalable fashion. It isn't an easy task, however. > HAMA looks like a very good solution, but can we use it now and how to > use it? > > I don't think that Hama has produced any usable software yet.
-
RE: breadth-first searchPeng, Wei 2010-12-21, 05:43
Dunning,
Currently, most of the matrix data (graph matrix, document-word matrix) that we are dealing with are sparse. The matrix decomposition often needs many iterations to converge, then intermediate results have to be saved to serve as the input for the next iteration. This is super inefficient. As I mentioned, the BFS algorithm written in python took 1 second to run, however, hadoop took almost 3 hours. I would expect hadoop to be slower, but not this slow. All the hadoop applications that I saw are all very simple calculations, I wonder how it can be applied to machine learning/data mining algorithms. Is HAMA the only way to solve it? If it is not ready to use yet, then can I assume hadoop is not a good solution for multiple iteration algorithms now? Wei -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, December 20, 2010 9:03 PM To: [EMAIL PROTECTED] Subject: Re: breadth-first search On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei <[EMAIL PROTECTED]> wrote: > ... My question is really about what is the efficient way for graph > computation, matrix computation, algorithms that need many iterations to > converge (with intermediate results). > Large graph computations usually assume a sparse graph for historical reasons. A key property of scalable algorithms is that the time and space are linear in the input size. Most all path algorithms are not linear because the result is n x n and is dense. Some graph path computations can be done indirectly by spectral methods. With good random projection algorithms for sparse matrix decomposition, approximate versions of some of these algorithms can be phrased in a scalable fashion. It isn't an easy task, however. > HAMA looks like a very good solution, but can we use it now and how to > use it? > > I don't think that Hama has produced any usable software yet.
-
Re: breadth-first searchTed Dunning 2010-12-21, 06:18
On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei <[EMAIL PROTECTED]> wrote:
> ... > Currently, most of the matrix data (graph matrix, document-word matrix) > that we are dealing with are sparse. > Good. > The matrix decomposition often needs many iterations to converge, then > intermediate results have to be saved to serve as the input for the next > iteration. > I think you are thinking of the wrong algorithms. The ones that I am talking about converge in a fixed and small number of steps. See https://issues.apache.org/jira/browse/MAHOUT-376 for the work in progress on this. > This is super inefficient. As I mentioned, the BFS algorithm written in > python took 1 second to run, however, hadoop took almost 3 hours. I > would expect hadoop to be slower, but not this slow. > I think you have a combination of factors here. But, even accounting for having too many iterations in your BFS algorithm, iterations with stock Hadoop take 10s of seconds even if they do nothing. If you arrange your computation to need many iterations, it will be slow. All the hadoop applications that I saw are all very simple calculations, > I wonder how it can be applied to machine learning/data mining > algorithms. > Check out Mahout. There is a lot of machine learning going on there both on Hadoop and using other scalable methods. > Is HAMA the only way to solve it? If it is not ready to use yet, then > can I assume hadoop is not a good solution for multiple iteration > algorithms now? > I don't see much evidence that HAMA will ever solve anything, so I wouldn't recommend pinning your hopes on that. For fast, iterative map-reduce, you really need to keep your mappers and reducers live between iterations. Check out twister for that: http://www.iterativemapreduce.org/
-
Re: breadth-first searchEdward J. Yoon 2010-12-21, 06:27
There's no release yet.
But, I had tested the BFS using hama and, hbase. Sent from my iPhone On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > Yoon, > > Can I use HAMA now, or it is still in development? > > Thanks > > Wei > > -----Original Message----- > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Monday, December 20, 2010 6:23 PM > To: [EMAIL PROTECTED] > Subject: Re: breadth-first search > > Check this slide out - > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: >> >> I implemented an algorithm to run hadoop on a 25GB graph data to >> calculate its average separation length. >> The input format is V1(tab)V2 (where V2 is the friend of V1). >> My purpose is to first randomly select some seed nodes, and then for >> each node, calculate the shortest paths from this node to all other >> nodes on the graph. >> >> To do this, I first run a simple python code in a single machine to get >> some random seed nodes. >> Then I run a hadoop job to generate adjacent list for each node as the >> input for the second job. >> >> The second job takes the adjacent list input and output the first level >> breadth-first search result. The nodes which are the friends of the seed >> node have distance 1. Then this output is the input for the next hadoop >> job so on so forth, until all the nodes are reached. >> >> I generated a simulated graph for testing. This data has only 100 nodes. >> Normal python code can find the separation length within 1 second (100 >> seed nodes). However, the hadoop took almost 3 hours to do that >> (pseudo-distributed mode on one machine)!! >> >> I wonder if there is a more efficient way to do breadth-first search in >> hadoop? It is very inefficient to output so many intermediate results. >> Totally there would be seedNodeNumber*levelNumber+1 jobs, >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? >> >> Please help. Thanks! >> >> Wei >> > > > > -- > Best Regards, Edward J. Yoon > [EMAIL PROTECTED] > http://blog.udanax.org
-
RE: breadth-first searchPeng, Wei 2010-12-21, 09:59
The graph that my BFS algorithm is running on only needs 4 levels to reach all nodes. The reason I say "many iterations" is that there are 100 sources nodes, so totally 400 iterations. The algorithm should be right, and I cannot think of anything to reduce the number of iterations.
Ted, I will check out the links that you sent to me. I really appreciate your help. Wei -----Original Message----- From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 1:27 AM To: [EMAIL PROTECTED] Subject: Re: breadth-first search There's no release yet. But, I had tested the BFS using hama and, hbase. Sent from my iPhone On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > Yoon, > > Can I use HAMA now, or it is still in development? > > Thanks > > Wei > > -----Original Message----- > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Monday, December 20, 2010 6:23 PM > To: [EMAIL PROTECTED] > Subject: Re: breadth-first search > > Check this slide out - > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: >> >> I implemented an algorithm to run hadoop on a 25GB graph data to >> calculate its average separation length. >> The input format is V1(tab)V2 (where V2 is the friend of V1). >> My purpose is to first randomly select some seed nodes, and then for >> each node, calculate the shortest paths from this node to all other >> nodes on the graph. >> >> To do this, I first run a simple python code in a single machine to get >> some random seed nodes. >> Then I run a hadoop job to generate adjacent list for each node as the >> input for the second job. >> >> The second job takes the adjacent list input and output the first level >> breadth-first search result. The nodes which are the friends of the seed >> node have distance 1. Then this output is the input for the next hadoop >> job so on so forth, until all the nodes are reached. >> >> I generated a simulated graph for testing. This data has only 100 nodes. >> Normal python code can find the separation length within 1 second (100 >> seed nodes). However, the hadoop took almost 3 hours to do that >> (pseudo-distributed mode on one machine)!! >> >> I wonder if there is a more efficient way to do breadth-first search in >> hadoop? It is very inefficient to output so many intermediate results. >> Totally there would be seedNodeNumber*levelNumber+1 jobs, >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? >> >> Please help. Thanks! >> >> Wei >> > > > > -- > Best Regards, Edward J. Yoon > [EMAIL PROTECTED] > http://blog.udanax.org
-
Re: breadth-first searchTed Dunning 2010-12-21, 15:28
Ahh... I see what you mean.
This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 map-reduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[EMAIL PROTECTED]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > -----Original Message----- > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [EMAIL PROTECTED] > Subject: Re: breadth-first search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > -----Original Message----- > > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [EMAIL PROTECTED] > > Subject: Re: breadth-first search > > > > Check this slide out - > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other > >> nodes on the graph. > >> > >> To do this, I first run a simple python code in a single machine to get > >> some random seed nodes. > >> Then I run a hadoop job to generate adjacent list for each node as the > >> input for the second job. > >> > >> The second job takes the adjacent list input and output the first level > >> breadth-first search result. The nodes which are the friends of the seed > >> node have distance 1. Then this output is the input for the next hadoop > >> job so on so forth, until all the nodes are reached. > >> > >> I generated a simulated graph for testing. This data has only 100 nodes. > >> Normal python code can find the separation length within 1 second (100 > >> seed nodes). However, the hadoop took almost 3 hours to do that > >> (pseudo-distributed mode on one machine)!! > >> > >> I wonder if there is a more efficient way to do breadth-first search in > >> hadoop? It is very inefficient to output so many intermediate results. > >> Totally there would be seedNodeNumber*levelNumber+1 jobs, > >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? > >> > >> Please help. Thanks! > >> > >> Wei > >> > > > > > > > > -- > > Best Regards, Edward J. Yoon > > [EMAIL PROTECTED] > > http://blog.udanax.org >
-
RE: breadth-first searchRicky Ho 2010-12-21, 16:53
Exactly, it should be done in 4 iterations instead of 4 * 100 iterations.
But in general, Graph algorithms is harder to do in Map/Reduce and the statelessness nature typically requires shuffling of the whole graph in each iteration. Jimmy Lin has described a technique called Skimmy to avoid that, which I described here at http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html And of course, the BSP style is another good way to achieve this which is where Google's Pregel model based on, I also blog about this here at http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html Notice that "scalability" and "speed" are different animals. Hadoop is about scalability but not speed. If your data can be squeezed into a single machine, then Hadoop is not for you. Otherwise, Hadoop is one great tool to parallelize your processing. It has a pretty constant (but significant) overhead that is justifiable when you have large amount of data. For example, Hadoop is not for processing real-time streams. http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html Notice that ... Hadoop is just one of the tool that you can use, and you decide whether you don't have better ones. Rgds, Ricky -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 7:29 AM To: [EMAIL PROTECTED] Subject: Re: breadth-first search Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 map-reduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[EMAIL PROTECTED]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > -----Original Message----- > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [EMAIL PROTECTED] > Subject: Re: breadth-first search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > -----Original Message----- > > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [EMAIL PROTECTED] > > Subject: Re: breadth-first search > > > > Check this slide out - > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other > >> nodes on the graph. > >> > >> To do this, I first run a simple python code in a single machine to get > >> some random seed nodes. > >> Then I run a hadoop job to generate adjacent list for each node as the > >> input for the second job. > >> > >> The second job takes the adjacent list input and output the first level > >> breadth-first search result. The nodes which are the friends of the seed > >> node have distance 1. Then this output is the input for the next hadoop > >> job so on so forth, until all the nodes are reached. > >> > >> I generated a simulated graph for testing. This data has only 100 nodes.
-
RE: breadth-first searchPeng, Wei 2010-12-21, 18:04
Thank you all for all the replies.
I know I can run 100 source nodes in parallel. It is just surprising to me that the python code and hadoop code have such a big efficiency difference (they are both running in sequential on 100 source nodes). Hadoop is useful when the data is huge and cannot fit into memory, but it does not seem to be a real-time solution. There are so many to learn. Thank you again for all your kind help. Wei -----Original Message----- From: Ricky Ho [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 11:53 AM To: [EMAIL PROTECTED] Subject: RE: breadth-first search Exactly, it should be done in 4 iterations instead of 4 * 100 iterations. But in general, Graph algorithms is harder to do in Map/Reduce and the statelessness nature typically requires shuffling of the whole graph in each iteration. Jimmy Lin has described a technique called Skimmy to avoid that, which I described here at http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html And of course, the BSP style is another good way to achieve this which is where Google's Pregel model based on, I also blog about this here at http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html Notice that "scalability" and "speed" are different animals. Hadoop is about scalability but not speed. If your data can be squeezed into a single machine, then Hadoop is not for you. Otherwise, Hadoop is one great tool to parallelize your processing. It has a pretty constant (but significant) overhead that is justifiable when you have large amount of data. For example, Hadoop is not for processing real-time streams. http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html Notice that ... Hadoop is just one of the tool that you can use, and you decide whether you don't have better ones. Rgds, Ricky -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 7:29 AM To: [EMAIL PROTECTED] Subject: Re: breadth-first search Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 map-reduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei <[EMAIL PROTECTED]> > The graph that my BFS algorithm is running on only needs 4 levels to reach > all nodes. The reason I say "many iterations" is that there are 100 sources > nodes, so totally 400 iterations. The algorithm should be right, and I > cannot think of anything to reduce the number of iterations. > > Ted, I will check out the links that you sent to me. > I really appreciate your help. > > Wei > -----Original Message----- > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 21, 2010 1:27 AM > To: [EMAIL PROTECTED] > Subject: Re: breadth-first search > > There's no release yet. > > But, I had tested the BFS using hama and, hbase. > > Sent from my iPhone > > On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <[EMAIL PROTECTED]> wrote: > > > Yoon, > > > > Can I use HAMA now, or it is still in development? > > > > Thanks > > > > Wei > > > > -----Original Message----- > > From: Edward J. Yoon [mailto:[EMAIL PROTECTED]] > > Sent: Monday, December 20, 2010 6:23 PM > > To: [EMAIL PROTECTED] > > Subject: Re: breadth-first search > > > > Check this slide out - > > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf > > > > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> > >> I implemented an algorithm to run hadoop on a 25GB graph data to > >> calculate its average separation length. > >> The input format is V1(tab)V2 (where V2 is the friend of V1). > >> My purpose is to first randomly select some seed nodes, and then for > >> each node, calculate the shortest paths from this node to all other
-
Re: breadth-first searchTed Dunning 2010-12-21, 18:09
Absolutely true. Nobody should pretend otherwise.
On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Hadoop is useful when the data is huge and cannot fit into memory, but it > does not seem to be a real-time solution.
-
RE: breadth-first searchPeng, Wei 2010-12-22, 02:07
I was just trying to run 100 source nodes in multiple threads, but the
mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 1:10 PM To: [EMAIL PROTECTED] Subject: Re: breadth-first search Absolutely true. Nobody should pretend otherwise. On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Hadoop is useful when the data is huge and cannot fit into memory, but it > does not seem to be a real-time solution.
-
Re: breadth-first searchEdward J. Yoon 2010-12-22, 06:54
> I don't see much evidence that HAMA will ever solve anything, so I wouldn't
> recommend pinning your hopes on that. Hello ted. What is your opinion based on? Have you tried to run the HAMA TRUNK? If not, you don't know about our progress in the last year. :-) On Tue, Dec 21, 2010 at 3:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei <[EMAIL PROTECTED]> wrote: > >> ... >> Currently, most of the matrix data (graph matrix, document-word matrix) >> that we are dealing with are sparse. >> > > Good. > > >> The matrix decomposition often needs many iterations to converge, then >> intermediate results have to be saved to serve as the input for the next >> iteration. >> > > I think you are thinking of the wrong algorithms. The ones that I am > talking about converge > in a fixed and small number of steps. See > https://issues.apache.org/jira/browse/MAHOUT-376 for the work > in progress on this. > > >> This is super inefficient. As I mentioned, the BFS algorithm written in >> python took 1 second to run, however, hadoop took almost 3 hours. I >> would expect hadoop to be slower, but not this slow. >> > > I think you have a combination of factors here. But, even accounting for > having too many > iterations in your BFS algorithm, iterations with stock Hadoop take 10s of > seconds even if > they do nothing. If you arrange your computation to need many iterations, > it will be slow. > > > All the hadoop applications that I saw are all very simple calculations, >> I wonder how it can be applied to machine learning/data mining >> algorithms. >> > > Check out Mahout. There is a lot of machine learning going on there both on > Hadoop and using other scalable methods. > > >> Is HAMA the only way to solve it? If it is not ready to use yet, then >> can I assume hadoop is not a good solution for multiple iteration >> algorithms now? >> > > I don't see much evidence that HAMA will ever solve anything, so I wouldn't > recommend pinning your hopes on that. > > For fast, iterative map-reduce, you really need to keep your mappers and > reducers live between iterations. Check out > twister for that: http://www.iterativemapreduce.org/ > -- Best Regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
-
RE: breadth-first searchPeng, Wei 2010-12-22, 16:57
Can someone tell me whether we can run multiple threads in hadoop?
Thanks Wei -----Original Message----- From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 9:07 PM To: [EMAIL PROTECTED] Subject: RE: breadth-first search I was just trying to run 100 source nodes in multiple threads, but the mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 1:10 PM To: [EMAIL PROTECTED] Subject: Re: breadth-first search Absolutely true. Nobody should pretend otherwise. On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Hadoop is useful when the data is huge and cannot fit into memory, but it > does not seem to be a real-time solution.
-
RE: breadth-first searchRicky Ho 2010-12-22, 18:46
You can do whatever your want (including spawning threads) in the Mapper process
(which is fork/exec by the TaskTracker). But this doesn't help I think you need to understand the fundamental difference between the 2 parallel processing models 1) Multi-threading Small scale parallelism limited to number of cores within a single machine. Multiple execution threads with a shared memory model. And a lot of synchronization primitives to coordinate access to share data 2) Map/Reduce Large scale parallelism involving large number of machines (hundreds to thousands) Data is shuffled through 2 layers of machines via a special topology (output records of the same key from layer1 will land on the same place at layer 2) The first layer is conducting a per-record transformation (map) and the second layer is conducting a consolidation (reduce) These 2 models has a very different notions of synchronization, one is using fine grain locking and the other is share nothing, you won't be consider them to be alternatives to each other. They are intended to solve very very different problems. Can they be used together ? Absolutely yes. But you need to design how you want to partition your problem ... For example, you can consider partitioning your graph into sub-graphs so each Mapper/Reducer is dealing with a bigger sub-graph rather than individual nodes. Of course you need to think about how to combine the subgraph results, and whether you need absolutely accurate answer or an approximation is good enough. I bet you are the later one and so should be more easy. Ted, Can you point me to Matrix algorithms that is tuned for sparse graph ? What I mean is from O(v^3) to O(v*e) where v = number of vertex and e = number of edges. Rgds, Ricky -----Original Message----- From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 22, 2010 8:58 AM To: [EMAIL PROTECTED] Subject: RE: breadth-first search Can someone tell me whether we can run multiple threads in hadoop? Thanks Wei -----Original Message----- From: Peng, Wei [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 21, 2010 9:07 PM To: [EMAIL PROTECTED] Subject: RE: breadth-first search I was just trying to run 100 source nodes in multiple threads, but the mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks
-
Re: breadth-first searchTed Dunning 2010-12-22, 19:00
The Mahout math package has a number of basic algorithms that use
algorithmic efficiencies when given sparse graphs. A number of other algorithms use only the product of a sparse matrix on another matrix or a vector. Since these algorithms never change the original sparse matrix, they are safe against fill-in problems. The random projection technique avoids O(v^3) algorithms for computing SVD or related matrix decompositions. See http://arxiv.org/abs/0909.4061 and https://issues.apache.org/jira/browse/MAHOUT-376 None of these these algorithms are specific to graph theory, but all deal with methods that are useful with sparse graphs. On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho <[EMAIL PROTECTED]> wrote: > Can you point me to Matrix algorithms that is tuned for sparse graph ? > What I > mean is from O(v^3) to O(v*e) where v = number of vertex and e = number of > edges. >
-
RE: breadth-first searchPeng, Wei 2010-12-22, 19:51
Thanks for quick response.
Partitioning graphs into subgraphs and later combining the results is too complicated to do. I prefer a simple method. Currently, I do not want to divide the breadth-first search from a single source. I just want to run 100 breadth-first search from 100 source nodes with 100 threads running in parallel. The problem is that these 100 threads do not seem to run parallel, however, they seem to run in sequential. I have searched on-line. Some people mention that all tasks are put into queues waiting for free mapreduce slots. It is might be due to not enough slots. How to deal with this problem? Wei -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 22, 2010 2:01 PM To: [EMAIL PROTECTED] Subject: Re: breadth-first search The Mahout math package has a number of basic algorithms that use algorithmic efficiencies when given sparse graphs. A number of other algorithms use only the product of a sparse matrix on another matrix or a vector. Since these algorithms never change the original sparse matrix, they are safe against fill-in problems. The random projection technique avoids O(v^3) algorithms for computing SVD or related matrix decompositions. See http://arxiv.org/abs/0909.4061 and https://issues.apache.org/jira/browse/MAHOUT-376 None of these these algorithms are specific to graph theory, but all deal with methods that are useful with sparse graphs. On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho <[EMAIL PROTECTED]> wrote: > Can you point me to Matrix algorithms that is tuned for sparse graph ? > What I > mean is from O(v^3) to O(v*e) where v = number of vertex and e number of > edges. >
-
Re: breadth-first searchTed Yu 2010-12-22, 20:35
Modify the following parameters:
mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum mapred.map.tasks mapred.reduce.tasks FYI you need to adjust the -Xmx for your mapper/reducer after increasing the values for above parameters On Wed, Dec 22, 2010 at 11:51 AM, Peng, Wei <[EMAIL PROTECTED]> wrote: > Thanks for quick response. > > Partitioning graphs into subgraphs and later combining the results is > too complicated to do. I prefer a simple method. > > Currently, I do not want to divide the breadth-first search from a > single source. I just want to run 100 breadth-first search from 100 > source nodes with 100 threads running in parallel. > The problem is that these 100 threads do not seem to run parallel, > however, they seem to run in sequential. I have searched on-line. Some > people mention that all tasks are put into queues waiting for free > mapreduce slots. It is might be due to not enough slots. > How to deal with this problem? > > Wei > > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 22, 2010 2:01 PM > To: [EMAIL PROTECTED] > Subject: Re: breadth-first search > > The Mahout math package has a number of basic algorithms that use > algorithmic efficiencies when given sparse graphs. > > A number of other algorithms use only the product of a sparse matrix on > another matrix or a vector. Since these algorithms never change the > original sparse matrix, they are safe against fill-in problems. > > The random projection technique avoids O(v^3) algorithms for computing > SVD > or related matrix decompositions. See http://arxiv.org/abs/0909.4061 > and > https://issues.apache.org/jira/browse/MAHOUT-376 > > None of these these algorithms are specific to graph theory, but all > deal > with methods that are useful with sparse graphs. > > On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho <[EMAIL PROTECTED]> > wrote: > > > Can you point me to Matrix algorithms that is tuned for sparse graph ? > > What I > > mean is from O(v^3) to O(v*e) where v = number of vertex and e > number of > > edges. > > > |