|
Dmitriy Ryaboy
2010-08-27, 21:13
Corbin Hoenes
2010-08-28, 01:26
Milind A Bhandarkar
2010-08-28, 18:39
Dmitriy Ryaboy
2010-08-29, 21:11
Alan Gates
2010-08-30, 17:10
Corbin Hoenes
2010-08-31, 14:39
Russell Jurney
2010-08-31, 19:01
|
-
Request for Comments: Piggybank futureDmitriy Ryaboy 2010-08-27, 21:13
Hi folks, at the last Pig contributor meeting, the piggybank question was
discussed -- namely, how to make it more easy to contribute to. (by the way, the contributor meetings are generally open to all comers -- sign up for the pig-dev list if you are interested in that type of thing). Here's a section of the notes I sent to Pig-dev that documents the results of the piggybank discussion. How do you, as users, feel about this plan? Piggybank. Kevin Weil led a discussion of the piggybank. There are a few problems with it -- it's released on the Pig schedule, and has quite a few barriers to submission that are, anecdotally at least, preventing people from contributing. Several options were discussed, with the group finally settling on starting a community-curated GitHub project for piggybank. It will have a number of committers from different companies, and will aim to make it easy for folks to contribute (all contribs will still have to have tests, and be Apache 2.0-licensed). More details will be forthcoming as we figure them out. Initially this project will be seeded with the current Piggybank functions some time after 0.8 is branched. The initial list of committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone. Please send us any thoughts you might have on this subject. It was suggested that a lot of common code might be shared with Hive UDFs, which have the same problems as Piggybank does, and that perhaps the project can be another collaboration point between the projects. Not clear how that would work, Carl will talk to other Hive people.
-
Re: Request for Comments: Piggybank futureCorbin Hoenes 2010-08-28, 01:26
I really like this idea. I'd like to see more sharing of udfs out in
the open. What barriers to submission are removed by this move? How does a udf make it into piggybank now vs. before? Sent from my iPhone On Aug 27, 2010, at 3:13 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Hi folks, at the last Pig contributor meeting, the piggybank > question was > discussed -- namely, how to make it more easy to contribute to. > (by the way, the contributor meetings are generally open to all > comers -- > sign up for the pig-dev list if you are interested in that type of > thing). > > Here's a section of the notes I sent to Pig-dev that documents the > results > of the piggybank discussion. How do you, as users, feel about this > plan? > > Piggybank. > Kevin Weil led a discussion of the piggybank. There are a few > problems with > it -- it's released on the Pig schedule, and has quite a few > barriers to > submission that are, anecdotally at least, preventing people from > contributing. Several options were discussed, with the group finally > settling on starting a community-curated GitHub project for > piggybank. It > will have a number of committers from different companies, and will > aim to > make it easy for folks to contribute (all contribs will still have > to have > tests, and be Apache 2.0-licensed). More details will be forthcoming > as we > figure them out. Initially this project will be seeded with the > current > Piggybank functions some time after 0.8 is branched. The initial > list of > committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl > Steinbach > (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate > someone. > Please send us any thoughts you might have on this subject. It was > suggested > that a lot of common code might be shared with Hive UDFs, which have > the > same problems as Piggybank does, and that perhaps the project can be > another > collaboration point between the projects. Not clear how that would > work, > Carl will talk to other Hive people.
-
RE: Request for Comments: Piggybank futureMilind A Bhandarkar 2010-08-28, 18:39
+1 on the direction.
A few questions: 1. With Pig marching towards becoming a TLP at Apache, can Piggybank become a full-fledged subproject (with it's own releases and all) ? 2. Or since the ultimate goal is to have a common UDF repository for both Pig and Hive, t would make sense to make it into an incubator project, with a name that does not indicate pig dependency? 3. I see parallels between Howl and proposed Piggybank, since they aspire to become common components in both Hive and Pig distributions. What are long term plans for Howl as far as hosting is concerned ? - Milind ________________________________________ From: Dmitriy Ryaboy [[EMAIL PROTECTED]] Sent: Friday, August 27, 2010 2:13 PM To: [EMAIL PROTECTED] Subject: Request for Comments: Piggybank future Hi folks, at the last Pig contributor meeting, the piggybank question was discussed -- namely, how to make it more easy to contribute to. (by the way, the contributor meetings are generally open to all comers -- sign up for the pig-dev list if you are interested in that type of thing). Here's a section of the notes I sent to Pig-dev that documents the results of the piggybank discussion. How do you, as users, feel about this plan? Piggybank. Kevin Weil led a discussion of the piggybank. There are a few problems with it -- it's released on the Pig schedule, and has quite a few barriers to submission that are, anecdotally at least, preventing people from contributing. Several options were discussed, with the group finally settling on starting a community-curated GitHub project for piggybank. It will have a number of committers from different companies, and will aim to make it easy for folks to contribute (all contribs will still have to have tests, and be Apache 2.0-licensed). More details will be forthcoming as we figure them out. Initially this project will be seeded with the current Piggybank functions some time after 0.8 is branched. The initial list of committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone. Please send us any thoughts you might have on this subject. It was suggested that a lot of common code might be shared with Hive UDFs, which have the same problems as Piggybank does, and that perhaps the project can be another collaboration point between the projects. Not clear how that would work, Carl will talk to other Hive people.
-
Re: Request for Comments: Piggybank futureDmitriy Ryaboy 2010-08-29, 21:11
Hi folks,
I'll try to address both Corbin's and Milind's questions. This is just my opinion, I'm open to criticism/suggestions/corrections. There are several barriers that are being removed. First, piggybank will no longer be bound to the pig release schedule. At the moment, I am not sure there will be "releases" of piggybank, as such -- we might just tag snapshots with their own git branches and move on. This allows the code to develop at a much faster pace, while possibly sacrificing some of the stability and permanence of Apache-style releases. I feel that this is ok, as piggybank was always subject to less stringent testing, and the attitude towards it has long been "it might work, and you might have to tweak it if it doesn't". Second, moving to github makes it easy for people to cook their own versions of piggybank if they want to -- they just have to fork the main master, and apply changes as needed. The committers can pull in all, or some, of the changes, if they are desirable. This puts such mutations in the public view, as opposed to what's happening now, where they either don't happen, or happen on people's unseen svn exports. Third, this allows contributions to piggybank for older version of pig. At the moment, for example, there isn't really a way to contribute a Pig 0.6 loader -- the current svn trunk is on the new API, so such contributions won't compile. Something could be contributed for a 0.6 branch, but that won't see the light of day unless Pig team decides to do a 0.6.1 release, which is highly unlikely and kind of a maintenance nightmare. This is why, for example, my HBase loader changes wound up in Elephant-Bird instead of Pig proper -- I didn't have a good way of getting them out there otherwise. On github, we will be able to just keep a 0.6 branch that folks using that version can keep moving. Bottom line is that we are sacrificing the benefits of a stately, strict Apache workflow in order to gain agility and decrease barriers to contribution. I personally feel that this is ok because piggybank is not so much a software project as a collection of individual, distinct libraries. It's kind of the CPAN of Pig, and no one versions all modules of CPAN in one go -- the whole thing would get bogged down if that were to happen. Granted, cpan lets you pull down specific versions of individual modules, and this doesn't.. but let's take it one step at a time. I think the bit about Hive interoperation might be a bit overstated. The observation was just that Hive has the same problem with user-defined functions, and some common code might be reused since the two projects are often used to achieve similar goals. So if the Hive people wanted to collaborate on the common bits, and put their udfs into /hive while we put ours into /pig, we agreed that would be a good thing. There is no intent, at the moment, to build some generic udf interface that would allow one to write udfs for both hive and pig at once. Though that would be cool. -Dmitriy On Sat, Aug 28, 2010 at 11:39 AM, Milind A Bhandarkar <[EMAIL PROTECTED] > wrote: > +1 on the direction. > > A few questions: > > 1. With Pig marching towards becoming a TLP at Apache, can Piggybank become > a full-fledged subproject (with it's own releases and all) ? > 2. Or since the ultimate goal is to have a common UDF repository for both > Pig and Hive, t would make sense to make it into an incubator project, with > a name that does not indicate pig dependency? > 3. I see parallels between Howl and proposed Piggybank, since they aspire > to become common components in both Hive and Pig distributions. What are > long term plans for Howl as far as hosting is concerned ? > > - Milind > > ________________________________________ > From: Dmitriy Ryaboy [[EMAIL PROTECTED]] > Sent: Friday, August 27, 2010 2:13 PM > To: [EMAIL PROTECTED] > Subject: Request for Comments: Piggybank future > > Hi folks, at the last Pig contributor meeting, the piggybank question was
-
Re: Request for Comments: Piggybank futureAlan Gates 2010-08-30, 17:10
On Aug 28, 2010, at 11:39 AM, Milind A Bhandarkar wrote: > +1 on the direction. > > A few questions: > > 1. With Pig marching towards becoming a TLP at Apache, can Piggybank > become a full-fledged subproject (with it's own releases and all) ? > 2. Or since the ultimate goal is to have a common UDF repository for > both Pig and Hive, t would make sense to make it into an incubator > project, with a name that does not indicate pig dependency? I agree with Dmitriy that this is not necessarily the ultimate goal. > 3. I see parallels between Howl and proposed Piggybank, since they > aspire to become common components in both Hive and Pig > distributions. What are long term plans for Howl as far as hosting > is concerned ? The stated plan with Howl has been to put it in the Incubator. Alan. > > - Milind > > ________________________________________ > From: Dmitriy Ryaboy [[EMAIL PROTECTED]] > Sent: Friday, August 27, 2010 2:13 PM > To: [EMAIL PROTECTED] > Subject: Request for Comments: Piggybank future > > Hi folks, at the last Pig contributor meeting, the piggybank > question was > discussed -- namely, how to make it more easy to contribute to. > (by the way, the contributor meetings are generally open to all > comers -- > sign up for the pig-dev list if you are interested in that type of > thing). > > Here's a section of the notes I sent to Pig-dev that documents the > results > of the piggybank discussion. How do you, as users, feel about this > plan? > > Piggybank. > Kevin Weil led a discussion of the piggybank. There are a few > problems with > it -- it's released on the Pig schedule, and has quite a few > barriers to > submission that are, anecdotally at least, preventing people from > contributing. Several options were discussed, with the group finally > settling on starting a community-curated GitHub project for > piggybank. It > will have a number of committers from different companies, and will > aim to > make it easy for folks to contribute (all contribs will still have > to have > tests, and be Apache 2.0-licensed). More details will be forthcoming > as we > figure them out. Initially this project will be seeded with the > current > Piggybank functions some time after 0.8 is branched. The initial > list of > committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl > Steinbach > (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate > someone. > Please send us any thoughts you might have on this subject. It was > suggested > that a lot of common code might be shared with Hive UDFs, which have > the > same problems as Piggybank does, and that perhaps the project can be > another > collaboration point between the projects. Not clear how that would > work, > Carl will talk to other Hive people.
-
Re: Request for Comments: Piggybank futureCorbin Hoenes 2010-08-31, 14:39
All sounds reasonable thanks for explaining the thought process.
On Aug 29, 2010, at 3:11 PM, Dmitriy Ryaboy wrote: > Hi folks, > > I'll try to address both Corbin's and Milind's questions. This is just my > opinion, I'm open to criticism/suggestions/corrections. > > There are several barriers that are being removed. > > First, piggybank will no longer be bound to the pig release schedule. At the > moment, I am not sure there will be "releases" of piggybank, as such -- we > might just tag snapshots with their own git branches and move on. This > allows the code to develop at a much faster pace, while possibly sacrificing > some of the stability and permanence of Apache-style releases. I feel that > this is ok, as piggybank was always subject to less stringent testing, and > the attitude towards it has long been "it might work, and you might have to > tweak it if it doesn't". > > Second, moving to github makes it easy for people to cook their own versions > of piggybank if they want to -- they just have to fork the main master, and > apply changes as needed. The committers can pull in all, or some, of the > changes, if they are desirable. This puts such mutations in the public view, > as opposed to what's happening now, where they either don't happen, or > happen on people's unseen svn exports. > > Third, this allows contributions to piggybank for older version of pig. At > the moment, for example, there isn't really a way to contribute a Pig 0.6 > loader -- the current svn trunk is on the new API, so such contributions > won't compile. Something could be contributed for a 0.6 branch, but that > won't see the light of day unless Pig team decides to do a 0.6.1 release, > which is highly unlikely and kind of a maintenance nightmare. This is why, > for example, my HBase loader changes wound up in Elephant-Bird instead of > Pig proper -- I didn't have a good way of getting them out there otherwise. > On github, we will be able to just keep a 0.6 branch that folks using that > version can keep moving. > > Bottom line is that we are sacrificing the benefits of a stately, strict > Apache workflow in order to gain agility and decrease barriers to > contribution. I personally feel that this is ok because piggybank is not so > much a software project as a collection of individual, distinct libraries. > It's kind of the CPAN of Pig, and no one versions all modules of CPAN in one > go -- the whole thing would get bogged down if that were to happen. Granted, > cpan lets you pull down specific versions of individual modules, and this > doesn't.. but let's take it one step at a time. > > I think the bit about Hive interoperation might be a bit overstated. The > observation was just that Hive has the same problem with user-defined > functions, and some common code might be reused since the two projects are > often used to achieve similar goals. So if the Hive people wanted to > collaborate on the common bits, and put their udfs into /hive while we put > ours into /pig, we agreed that would be a good thing. There is no intent, at > the moment, to build some generic udf interface that would allow one to > write udfs for both hive and pig at once. Though that would be cool. > > -Dmitriy > > On Sat, Aug 28, 2010 at 11:39 AM, Milind A Bhandarkar <[EMAIL PROTECTED] >> wrote: > >> +1 on the direction. >> >> A few questions: >> >> 1. With Pig marching towards becoming a TLP at Apache, can Piggybank become >> a full-fledged subproject (with it's own releases and all) ? >> 2. Or since the ultimate goal is to have a common UDF repository for both >> Pig and Hive, t would make sense to make it into an incubator project, with >> a name that does not indicate pig dependency? >> 3. I see parallels between Howl and proposed Piggybank, since they aspire >> to become common components in both Hive and Pig distributions. What are >> long term plans for Howl as far as hosting is concerned ? >> >> - Milind >> >> ________________________________________
-
Re: Request for Comments: Piggybank futureRussell Jurney 2010-08-31, 19:01
I'm pretty excited about this. This removes all the pain of contributing
UDFs. Russ On Tue, Aug 31, 2010 at 7:39 AM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: > All sounds reasonable thanks for explaining the thought process. > > On Aug 29, 2010, at 3:11 PM, Dmitriy Ryaboy wrote: > > > Hi folks, > > > > I'll try to address both Corbin's and Milind's questions. This is just my > > opinion, I'm open to criticism/suggestions/corrections. > > > > There are several barriers that are being removed. > > > > First, piggybank will no longer be bound to the pig release schedule. At > the > > moment, I am not sure there will be "releases" of piggybank, as such -- > we > > might just tag snapshots with their own git branches and move on. This > > allows the code to develop at a much faster pace, while possibly > sacrificing > > some of the stability and permanence of Apache-style releases. I feel > that > > this is ok, as piggybank was always subject to less stringent testing, > and > > the attitude towards it has long been "it might work, and you might have > to > > tweak it if it doesn't". > > > > Second, moving to github makes it easy for people to cook their own > versions > > of piggybank if they want to -- they just have to fork the main master, > and > > apply changes as needed. The committers can pull in all, or some, of the > > changes, if they are desirable. This puts such mutations in the public > view, > > as opposed to what's happening now, where they either don't happen, or > > happen on people's unseen svn exports. > > > > Third, this allows contributions to piggybank for older version of pig. > At > > the moment, for example, there isn't really a way to contribute a Pig 0.6 > > loader -- the current svn trunk is on the new API, so such contributions > > won't compile. Something could be contributed for a 0.6 branch, but that > > won't see the light of day unless Pig team decides to do a 0.6.1 release, > > which is highly unlikely and kind of a maintenance nightmare. This is > why, > > for example, my HBase loader changes wound up in Elephant-Bird instead of > > Pig proper -- I didn't have a good way of getting them out there > otherwise. > > On github, we will be able to just keep a 0.6 branch that folks using > that > > version can keep moving. > > > > Bottom line is that we are sacrificing the benefits of a stately, strict > > Apache workflow in order to gain agility and decrease barriers to > > contribution. I personally feel that this is ok because piggybank is not > so > > much a software project as a collection of individual, distinct > libraries. > > It's kind of the CPAN of Pig, and no one versions all modules of CPAN in > one > > go -- the whole thing would get bogged down if that were to happen. > Granted, > > cpan lets you pull down specific versions of individual modules, and this > > doesn't.. but let's take it one step at a time. > > > > I think the bit about Hive interoperation might be a bit overstated. The > > observation was just that Hive has the same problem with user-defined > > functions, and some common code might be reused since the two projects > are > > often used to achieve similar goals. So if the Hive people wanted to > > collaborate on the common bits, and put their udfs into /hive while we > put > > ours into /pig, we agreed that would be a good thing. There is no intent, > at > > the moment, to build some generic udf interface that would allow one to > > write udfs for both hive and pig at once. Though that would be cool. > > > > -Dmitriy > > > > On Sat, Aug 28, 2010 at 11:39 AM, Milind A Bhandarkar < > [EMAIL PROTECTED] > >> wrote: > > > >> +1 on the direction. > >> > >> A few questions: > >> > >> 1. With Pig marching towards becoming a TLP at Apache, can Piggybank > become > >> a full-fledged subproject (with it's own releases and all) ? > >> 2. Or since the ultimate goal is to have a common UDF repository for > both > >> Pig and Hive, t would make sense to make it into an incubator project, |