|
Rob Stewart
2009-10-07, 13:18
Dmitriy Ryaboy
2009-10-07, 15:04
Santhosh Srinivasan
2009-10-07, 16:44
Rob Stewart
2009-10-07, 16:46
Dmitriy Ryaboy
2009-10-07, 17:22
Dmitriy Ryaboy
2009-10-07, 17:23
Rob Stewart
2009-10-07, 22:09
Jeff Hammerbacher
2009-10-08, 06:48
|
-
Using Pig for a comparative StudyRob Stewart 2009-10-07, 13:18
Hello Pig user group !
OK, here's two things about me: 1. I'm new to Pig and Hadoop 2. I'm studying for a Masters in Software Engineering in the UK. 3. I'm looking to do a comparitive study on probably two distributed systems over a cluster network. I have investigated Hadoop, and have deployed Hadoop across various virtual Linux systems on this PC I'm using (which was fun!), and my university has given me permission to use the cluster at university to deploy Hadoop, which I'm excited about. (They may even use it for future research, or better still, production processing!). Anyway... I have had a look at Pig, and have worked through the various tutorials, which are very well written, and have these tutorials working on my virtual Hadoop cluster here on this PC, and I assume the same would be the case on the university cluster. I am needing another system, as similar as possible to the function and use of Pig. My supervisor has pointed me in the direction of CouchDB (written in Erlang) as another tool which potentially could be used for comparison for my studies. Reading a little about it, there seems no formal process for distributing a CouchDB job however, across a cluster of nodes for parallel processing. I have contacted the CouchDB mailing list for clarification about this however. So, I write to you guys for four reasons: 1. To touch base, and say - "hey, I'm hoping to use Pig for a comparitive study for my Masters dissertation - Thanks !!" 2. To ask, if there is any other solution out there that can be closely compared to the functionality and use of Pig. 3. If CouchDB has been benchmarked against Pig before now, where I can find it, or who can help me with this. 4. Am I off the mark with these questions? If so, please speak now! thanks, Rob Stewart
-
Re: Using Pig for a comparative StudyDmitriy Ryaboy 2009-10-07, 15:04
Hi Rob,
CouchDB is a totally different project with very different goals. Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and HBase. They are also different from each other, but that's a long conversation.. In what way do you intend to compare the systems -- speed, architecture, parallelization policy? In the Hadoop world, Hive is a system with similar goals to Pig, although it has a somewhat different philosophy. You may also want to check out JAQL. Microsoft has been letting academics get access to its Dryad system, so you may want to look at their DryadLINQ and SCOPE stuff. I am not sure of the extent MS actually lets you play with their stack, but they seem to be getting more student-researcher-friendly in recent years. -D On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <[EMAIL PROTECTED]>wrote: > Hello Pig user group ! > > OK, here's two things about me: > 1. I'm new to Pig and Hadoop > 2. I'm studying for a Masters in Software Engineering in the UK. > 3. I'm looking to do a comparitive study on probably two distributed > systems > over a cluster network. I have investigated Hadoop, and have deployed > Hadoop > across various virtual Linux systems on this PC I'm using (which was fun!), > and my university has given me permission to use the cluster at university > to deploy Hadoop, which I'm excited about. (They may even use it for future > research, or better still, production processing!). > > Anyway... I have had a look at Pig, and have worked through the various > tutorials, which are very well written, and have these tutorials working on > my virtual Hadoop cluster here on this PC, and I assume the same would be > the case on the university cluster. > > I am needing another system, as similar as possible to the function and use > of Pig. My supervisor has pointed me in the direction of CouchDB (written > in > Erlang) as another tool which potentially could be used for comparison for > my studies. Reading a little about it, there seems no formal process for > distributing a CouchDB job however, across a cluster of nodes for parallel > processing. I have contacted the CouchDB mailing list for clarification > about this however. > > So, I write to you guys for four reasons: > 1. To touch base, and say - "hey, I'm hoping to use Pig for a comparitive > study for my Masters dissertation - Thanks !!" > 2. To ask, if there is any other solution out there that can be closely > compared to the functionality and use of Pig. > 3. If CouchDB has been benchmarked against Pig before now, where I can find > it, or who can help me with this. > 4. Am I off the mark with these questions? If so, please speak now! > > > thanks, > > Rob Stewart >
-
RE: Using Pig for a comparative StudySanthosh Srinivasan 2009-10-07, 16:44
Rob,
>> 2. To ask, if there is any other solution out there that can be closely compared to the functionality and use of Pig. Hive (http://hadoop.apache.org/hive/) provides a SQL interface on top of Hadoop and JAQL (http://www.jaql.org/), another query language which also works on Hadoop are two good candidates. >> 4. Am I off the mark with these questions? If so, please speak now! Not at all. It will be great if you could share the parameters that form the basis for the comparison. Thanks, Santhosh -----Original Message----- From: Rob Stewart [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 07, 2009 6:18 AM To: [EMAIL PROTECTED] Subject: Using Pig for a comparative Study Hello Pig user group ! OK, here's two things about me: 1. I'm new to Pig and Hadoop 2. I'm studying for a Masters in Software Engineering in the UK. 3. I'm looking to do a comparitive study on probably two distributed systems over a cluster network. I have investigated Hadoop, and have deployed Hadoop across various virtual Linux systems on this PC I'm using (which was fun!), and my university has given me permission to use the cluster at university to deploy Hadoop, which I'm excited about. (They may even use it for future research, or better still, production processing!). Anyway... I have had a look at Pig, and have worked through the various tutorials, which are very well written, and have these tutorials working on my virtual Hadoop cluster here on this PC, and I assume the same would be the case on the university cluster. I am needing another system, as similar as possible to the function and use of Pig. My supervisor has pointed me in the direction of CouchDB (written in Erlang) as another tool which potentially could be used for comparison for my studies. Reading a little about it, there seems no formal process for distributing a CouchDB job however, across a cluster of nodes for parallel processing. I have contacted the CouchDB mailing list for clarification about this however. So, I write to you guys for four reasons: 1. To touch base, and say - "hey, I'm hoping to use Pig for a comparitive study for my Masters dissertation - Thanks !!" 2. To ask, if there is any other solution out there that can be closely compared to the functionality and use of Pig. 3. If CouchDB has been benchmarked against Pig before now, where I can find it, or who can help me with this. 4. Am I off the mark with these questions? If so, please speak now! thanks, Rob Stewart
-
Re: Using Pig for a comparative StudyRob Stewart 2009-10-07, 16:46
Hi Dmitry, excellent response, thanks.
I was predominately looking at CouchDB simply for the fact that it's written in Erlang, which is a scalable, distributable language. I do realise that if I were to compare CouchDB with Pig/Hadoop, it would be difficult to argue that I was indeed comparing like for like. RE: "in what way do you intend to compare the systems". That is *the* question. Speed - Yes, it would be nice to be able to implement the same execution procedure in two different systems/languages, run it on the same cluster (at different times!) and compare the time it takes to execute. The variable here would be the size of data to process Architecture - A good one to discuss. Is the required infrastructure identical on both systems (I know, for instance, that Dryad and Hadoop have the "one master and many slaves" architecture, albeit for different roles. Parallelization policy - Indeed, at one point does the execution switch from sequential to parallel, which nodes execute in parallel etc... Fault Tolernce - This is one I'd be keen to explore. The obvious advantage in using Pig for my research is that I get fault tolerance for free from Hadoop. Great! But I want to be able to control failures to analyse the performance of recovery. I would need to investigate exactly how to create a fault, other than killing the DataNode service using the Linux kill command. Answers on the back of a postcard, thanks. I've just had a quick look at JAQL. Wow, good suggestion, the core of the language offers: filter, transform, group, join, sort and expand. A few of these are matched in Pig, and JAQL can also from delimited files, like Pig does. I will certainly spend time looking into this, and see if I can create an input file and process it using both JAQL and Pig without any alterations to the input, whilst generating an identical output file. If so, I'm in business... This would eliminate the distributed nature of the systems as a variable (they both use Hadoop) also. I had been pointed in the direction of Dryad, and whilst I am, at this stage, open to suggestions for my study, I do have a few concerns about using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers for the master and the slaves, and they need to have the Dryad software installed. I'm not so sure on how accessible that is to me. Also, I wonder where my support would dry up once I've started (and I wouldn't have a community to rely on like this one!). RE: Avram - "that closely relates toe Pig". I *think* I meant both in terms of an underlying architecture (in Pig's case, Hadoop), syntax that is above the level of data allocation to DataNodes, and also the sort of functionality Pig provides (basic data processing/manipulation using filter/join although I realise that you can write user defined functions to fill the gap). I will indeed have a look at Hive. It will be interesting the see the differences between Hive and Pig, bearing in mind they have both been merged into the Apache Hadoop software stack, to see how much crossover exists between the two. Finally, Cascading looks interesting also, I shall try and get an example working, and take it from there. Is it anticipated that Cascading will get merged into the Hadoop software stack? thanks guys, no doubt I will have a ton of problems/questions that need solving when I've tried these out. Rob Stewart 2009/10/7 Dmitriy Ryaboy <[EMAIL PROTECTED]> > Hi Rob, > > CouchDB is a totally different project with very different goals. > Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and > HBase. They are also different from each other, but that's a long > conversation.. > > In what way do you intend to compare the systems -- speed, architecture, > parallelization policy? > > In the Hadoop world, Hive is a system with similar goals to Pig, although > it > has a somewhat different philosophy. > You may also want to check out JAQL. > > Microsoft has been letting academics get access to its Dryad system, so
-
Re: Using Pig for a comparative StudyDmitriy Ryaboy 2009-10-07, 17:22
Rob,
It's highly unlikely that Cascading would be "merged" into Hadoop due to license issues (it's GPL, while Hadoop is Apache). But it's open source, and the author seems to be pretty available on the mailing lists; I am not sure how much the specifics of which source code repository the code comes from matter for your purposes (as long as you aren't distributing the software). There is a Hive vs Pig benchmark on the Hive jira, which reproduces queries from the Hadoop vs RDBMS paper by Pavlo et al. The queries are a bit biased towards the kind of stuff RDBMSes are good at, but it's a good place to start. Pig also has its own benchmark, called PigMix, which you can translate into Hive / JAQL / Cascading queries. Note that the Pig version in the trunk just got a whole lot faster, so it may be worth rerunning both of those benchmarks. For inducing failures, you can kill data nodes / task trackers, or you can induce various loads on individual machines -- the CMU group that did performance monitoring had a few common things they would do, like a "disk hog", a "cpu hog", a "network hog" to simulate various problems that might arise. I suspect that since the underlying fault-tolerance model is Hadoop's for all the systems, you will wind up with the same results for Pig, Hive, JAQL, and Cascading. It might be interesting to look at how many map-reduce steps are generated by the different frameworks to achieve the same task (keeping in mind that not all steps are created equal -- for example Pig often generates indexing MR jobs that are very fast, and whose "cost" is much lower than an MR job that requires processing all the input data). Take a look at the "Distributed Aggregation for Data-Parallel Computing" paper from MSR's Yu et al (SIGMOD 2009 I think? Might have the conference wrong). It's got an interesting analysis of different models for computing distributed aggregations, and some criticisms of how Pig, specifically, does it. Maybe there's some follow-up work in that? You may also want to experiment with how the various systems deal with odd distributions and skewed data, especially skewed data that models the real world -- graphs of social connections or web links (with in- and out-degrees of nodes following a power law), etc. I think CouchDB is a red herring as far as comparing things to Pig is concerned. But if you want to use Erlang to write a Pig clone, no one would stop you :-). -D On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <[EMAIL PROTECTED]>wrote: > Hi Dmitry, excellent response, thanks. > > I was predominately looking at CouchDB simply for the fact that it's > written > in Erlang, which is a scalable, distributable language. I do realise that > if I were to compare CouchDB with Pig/Hadoop, it would be difficult to > argue > that I was indeed comparing like for like. > > RE: "in what way do you intend to compare the systems". That is *the* > question. > Speed - Yes, it would be nice to be able to implement the same execution > procedure in two different systems/languages, run it on the same cluster > (at > different times!) and compare the time it takes to execute. The variable > here would be the size of data to process > Architecture - A good one to discuss. Is the required infrastructure > identical on both systems (I know, for instance, that Dryad and Hadoop have > the "one master and many slaves" architecture, albeit for different roles. > Parallelization policy - Indeed, at one point does the execution switch > from > sequential to parallel, which nodes execute in parallel etc... > Fault Tolernce - This is one I'd be keen to explore. The obvious advantage > in using Pig for my research is that I get fault tolerance for free from > Hadoop. Great! But I want to be able to control failures to analyse the > performance of recovery. I would need to investigate exactly how to create > a > fault, other than killing the DataNode service using the Linux kill > command. > Answers on the back of a postcard, thanks.
-
Re: Using Pig for a comparative StudyDmitriy Ryaboy 2009-10-07, 17:23
Oh, check out HadoopDB also.
On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <[EMAIL PROTECTED]>wrote: > Hi Dmitry, excellent response, thanks. > > I was predominately looking at CouchDB simply for the fact that it's > written > in Erlang, which is a scalable, distributable language. I do realise that > if I were to compare CouchDB with Pig/Hadoop, it would be difficult to > argue > that I was indeed comparing like for like. > > RE: "in what way do you intend to compare the systems". That is *the* > question. > Speed - Yes, it would be nice to be able to implement the same execution > procedure in two different systems/languages, run it on the same cluster > (at > different times!) and compare the time it takes to execute. The variable > here would be the size of data to process > Architecture - A good one to discuss. Is the required infrastructure > identical on both systems (I know, for instance, that Dryad and Hadoop have > the "one master and many slaves" architecture, albeit for different roles. > Parallelization policy - Indeed, at one point does the execution switch > from > sequential to parallel, which nodes execute in parallel etc... > Fault Tolernce - This is one I'd be keen to explore. The obvious advantage > in using Pig for my research is that I get fault tolerance for free from > Hadoop. Great! But I want to be able to control failures to analyse the > performance of recovery. I would need to investigate exactly how to create > a > fault, other than killing the DataNode service using the Linux kill > command. > Answers on the back of a postcard, thanks. > > I've just had a quick look at JAQL. Wow, good suggestion, the core of the > language offers: filter, transform, group, join, sort and expand. A few of > these are matched in Pig, and JAQL can also from delimited files, like Pig > does. I will certainly spend time looking into this, and see if I can > create > an input file and process it using both JAQL and Pig without any > alterations > to the input, whilst generating an identical output file. If so, I'm in > business... This would eliminate the distributed nature of the systems as a > variable (they both use Hadoop) also. > > I had been pointed in the direction of Dryad, and whilst I am, at this > stage, open to suggestions for my study, I do have a few concerns about > using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers for > the master and the slaves, and they need to have the Dryad software > installed. I'm not so sure on how accessible that is to me. Also, I wonder > where my support would dry up once I've started (and I wouldn't have a > community to rely on like this one!). > > RE: Avram - "that closely relates toe Pig". I *think* I meant both in terms > of an underlying architecture (in Pig's case, Hadoop), syntax that is above > the level of data allocation to DataNodes, and also the sort of > functionality Pig provides (basic data processing/manipulation using > filter/join although I realise that you can write user defined functions to > fill the gap). I will indeed have a look at Hive. It will be interesting > the > see the differences between Hive and Pig, bearing in mind they have both > been merged into the Apache Hadoop software stack, to see how much > crossover > exists between the two. Finally, Cascading looks interesting also, I shall > try and get an example working, and take it from there. Is it anticipated > that Cascading will get merged into the Hadoop software stack? > > > thanks guys, no doubt I will have a ton of problems/questions that need > solving when I've tried these out. > > > Rob Stewart > > > 2009/10/7 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > Hi Rob, > > > > CouchDB is a totally different project with very different goals. > > Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and > > HBase. They are also different from each other, but that's a long > > conversation.. > > > > In what way do you intend to compare the systems -- speed, architecture,
-
Re: Using Pig for a comparative StudyRob Stewart 2009-10-07, 22:09
@ Santhosh - I will indeed keep this mailing list abreast of my study -
You'll probably figure out what I'm upto with the question that'll appear on the mailing list :-) @ Dmitriy - What you've pointed me towards is already a massive help. I appreciate the time you've taken to respond to my plea for help! :-) And I've had a look around the community project documentation, your name appears quite a lot! Ok, I'm cramming the references you've pointed out in my bibtex database, so they don't get lost. I will, over the next few days, have a good look at PIgMix and the Pig vs Hive benchmark. So for now, it's a lot of playing about with Hive, Pig, and I will have a look at how HadoopDB functions, though the creators of this project explain how they use elements of Hive, but I need to clarify the differences between the two before deciding which to use for my study. A HadoopDB vs Hive vs Pig vs JAQL evaluation is not out of the equation at this moment in time. Thanks Dmitriy, I will most likely touch base down the line. Rob 2009/10/7 Dmitriy Ryaboy <[EMAIL PROTECTED]> > Oh, check out HadoopDB also. > > On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <[EMAIL PROTECTED] > >wrote: > > > Hi Dmitry, excellent response, thanks. > > > > I was predominately looking at CouchDB simply for the fact that it's > > written > > in Erlang, which is a scalable, distributable language. I do realise > that > > if I were to compare CouchDB with Pig/Hadoop, it would be difficult to > > argue > > that I was indeed comparing like for like. > > > > RE: "in what way do you intend to compare the systems". That is *the* > > question. > > Speed - Yes, it would be nice to be able to implement the same execution > > procedure in two different systems/languages, run it on the same cluster > > (at > > different times!) and compare the time it takes to execute. The variable > > here would be the size of data to process > > Architecture - A good one to discuss. Is the required infrastructure > > identical on both systems (I know, for instance, that Dryad and Hadoop > have > > the "one master and many slaves" architecture, albeit for different > roles. > > Parallelization policy - Indeed, at one point does the execution switch > > from > > sequential to parallel, which nodes execute in parallel etc... > > Fault Tolernce - This is one I'd be keen to explore. The obvious > advantage > > in using Pig for my research is that I get fault tolerance for free from > > Hadoop. Great! But I want to be able to control failures to analyse the > > performance of recovery. I would need to investigate exactly how to > create > > a > > fault, other than killing the DataNode service using the Linux kill > > command. > > Answers on the back of a postcard, thanks. > > > > I've just had a quick look at JAQL. Wow, good suggestion, the core of the > > language offers: filter, transform, group, join, sort and expand. A few > of > > these are matched in Pig, and JAQL can also from delimited files, like > Pig > > does. I will certainly spend time looking into this, and see if I can > > create > > an input file and process it using both JAQL and Pig without any > > alterations > > to the input, whilst generating an identical output file. If so, I'm in > > business... This would eliminate the distributed nature of the systems as > a > > variable (they both use Hadoop) also. > > > > I had been pointed in the direction of Dryad, and whilst I am, at this > > stage, open to suggestions for my study, I do have a few concerns about > > using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers > for > > the master and the slaves, and they need to have the Dryad software > > installed. I'm not so sure on how accessible that is to me. Also, I > wonder > > where my support would dry up once I've started (and I wouldn't have a > > community to rely on like this one!). > > > > RE: Avram - "that closely relates toe Pig". I *think* I meant both in > terms > > of an underlying architecture (in Pig's case, Hadoop), syntax that is
-
Re: Using Pig for a comparative StudyJeff Hammerbacher 2009-10-08, 06:48
Hey Rob,
There's been some fairly extensive benchmarking of Pig and Hive over at https://issues.apache.org/jira/browse/HIVE-396 and https://issues.apache.org/jira/browse/HIVE-600 that may help you get started. Regards, Jeff On Wed, Oct 7, 2009 at 3:09 PM, Rob Stewart <[EMAIL PROTECTED]>wrote: > @ Santhosh - I will indeed keep this mailing list abreast of my study - > You'll probably figure out what I'm upto with the question that'll appear > on > the mailing list :-) > > @ Dmitriy - What you've pointed me towards is already a massive help. I > appreciate the time you've taken to respond to my plea for help! :-) And > I've had a look around the community project documentation, your name > appears quite a lot! > > Ok, I'm cramming the references you've pointed out in my bibtex database, > so > they don't get lost. > > I will, over the next few days, have a good look at PIgMix and the Pig vs > Hive benchmark. > > So for now, it's a lot of playing about with Hive, Pig, and I will have a > look at how HadoopDB functions, though the creators of this project explain > how they use elements of Hive, but I need to clarify the differences > between > the two before deciding which to use for my study. A HadoopDB vs Hive vs > Pig > vs JAQL evaluation is not out of the equation at this moment in time. > > Thanks Dmitriy, I will most likely touch base down the line. > > > Rob > > > > > 2009/10/7 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > Oh, check out HadoopDB also. > > > > On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart < > [EMAIL PROTECTED] > > >wrote: > > > > > Hi Dmitry, excellent response, thanks. > > > > > > I was predominately looking at CouchDB simply for the fact that it's > > > written > > > in Erlang, which is a scalable, distributable language. I do realise > > that > > > if I were to compare CouchDB with Pig/Hadoop, it would be difficult to > > > argue > > > that I was indeed comparing like for like. > > > > > > RE: "in what way do you intend to compare the systems". That is *the* > > > question. > > > Speed - Yes, it would be nice to be able to implement the same > execution > > > procedure in two different systems/languages, run it on the same > cluster > > > (at > > > different times!) and compare the time it takes to execute. The > variable > > > here would be the size of data to process > > > Architecture - A good one to discuss. Is the required infrastructure > > > identical on both systems (I know, for instance, that Dryad and Hadoop > > have > > > the "one master and many slaves" architecture, albeit for different > > roles. > > > Parallelization policy - Indeed, at one point does the execution switch > > > from > > > sequential to parallel, which nodes execute in parallel etc... > > > Fault Tolernce - This is one I'd be keen to explore. The obvious > > advantage > > > in using Pig for my research is that I get fault tolerance for free > from > > > Hadoop. Great! But I want to be able to control failures to analyse the > > > performance of recovery. I would need to investigate exactly how to > > create > > > a > > > fault, other than killing the DataNode service using the Linux kill > > > command. > > > Answers on the back of a postcard, thanks. > > > > > > I've just had a quick look at JAQL. Wow, good suggestion, the core of > the > > > language offers: filter, transform, group, join, sort and expand. A few > > of > > > these are matched in Pig, and JAQL can also from delimited files, like > > Pig > > > does. I will certainly spend time looking into this, and see if I can > > > create > > > an input file and process it using both JAQL and Pig without any > > > alterations > > > to the input, whilst generating an identical output file. If so, I'm in > > > business... This would eliminate the distributed nature of the systems > as > > a > > > variable (they both use Hadoop) also. > > > > > > I had been pointed in the direction of Dryad, and whilst I am, at this > > > stage, open to suggestions for my study, I do have a few concerns about |