Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> GSoC 2013

Copy link to this message
Re: GSoC 2013
I'm using only WTF graph representation to fit the memory. By the way I
haven't seen any explanation from the pig 0.11 release page about WTF or
graph models.
I don't wanna use Cassovary. I believe it can be done with pig. I implement
a graph representation using WTF paper to pig and then I'll use it to
implement random walk algorithm. To do that maybe I need to improve some
features such as joins(fuzzy join) etc or implement a new operator. I can
implement it using either existing operators or new operators. That's up to
us and it doesn't really matter. If there is already a implementation to
random walker algorithm, please feel free to tell. Because I haven't found
Are you proposing to create an open-source implementation of those
Yes, I'm proposing to implement a random walk algorithm, new data model
which is representing graph. After that, people can use it coding the pig.

Do you suggest they should be Pig scripts added to the Pig project, or do
you want to create some new operators?
Maybe, it can be UDF or new operator.

I made a quick example. It may not be completely accurate, I've just tried
to explain it.
Think about you have a graph file just like that
user_id follower
1 2
1 3
1 10
2 3
3 4
3 5

Vertex List is an array including sorted vertex ids
node List is a matrix including vertex id and its starting position
graph = load 'graph' using PigStorage() (vertex:int, follower:int) - --load
the graph file
vertex = COGROUP graph BY (vertex);
list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
vertexList; --load the whole vertexes from HDFS into the memory
list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
nodeList; --load the whole vertexes from HDFS into the memory
randomWalk = FOREACH vertex GENERATE
flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; -- generate a
score using the node list you can traverse the graph to the your finishing
Best Regards...
On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> I'm somewhat familiar with WTF code (my day job is managing the analytics
> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in fact
> some of the Pig 11 features/improvements are directly due to this
> project...), and mostly has to do with clever algorithms implemented in Pig
> (an earlier version of WTF loaded the graph into main memory on large-mem
> machines -- that system is open sourced, too, under
> github.com/twitter/cassovary). Are you proposing to create an open-source
> implementation of those algorithms? Do you suggest they should be Pig
> scripts added to the Pig project, or do you want to create some new
> operators? I'm not totally sure where you are going here.
> GSoC proposals for Pig are usually made by students who want to work on
> issues labeled as GSoC candidates on the apache jira. The students spend
> some time to understand the problem stated in the jira, familiarize
> themselves with the existing codebase, and put a basic technical
> implementation plan and schedule into their proposal. Since in this case
> you are proposing something we haven't scoped or defined well for
> ourselves, we need you to be very clear and specific about what you are
> trying to do, and how you plan to go about it. I think that Graph
> processing in Pig (or other Hadoop-based systems) is a really interesting
> topic and there is a lot of work to be done, but we really need you to be
> far more detailed to be able to give you good guidance with regards to
> GSoC.
> Best,
> Dmitriy
> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <[EMAIL PROTECTED]> wrote:
> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> Service
> > at Twitter article we can" article.This article's said that in this way
> > graph can be stored one machine's memory so that every node will read
> from
> > HDFS and cache the graph to the memory. Every node is responsible from
> its
*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*