|
Ricky Ho
2009-05-06, 05:17
asif md
2009-05-06, 07:48
Sharad Agarwal
2009-05-06, 11:55
Jeff Hammerbacher
2009-05-06, 20:38
Ricky Ho
2009-05-06, 21:16
Amr Awadallah
2009-05-06, 22:13
Ricky Ho
2009-05-06, 22:56
Ashish Thusoo
2009-05-06, 23:21
Ricky Ho
2009-05-07, 03:56
Olga Natkovich
2009-05-07, 00:32
Scott Carey
2009-05-07, 02:47
Ricky Ho
2009-05-07, 04:11
Luc Hunt
2009-05-07, 05:46
Amr Awadallah
2009-05-07, 06:20
Alan Gates
2009-05-07, 14:14
Namit Jain
2009-05-07, 17:12
Scott Carey
2009-05-07, 18:08
Ashish Thusoo
2009-05-07, 18:18
Ashish Thusoo
2009-05-07, 18:10
Ricky Ho
2009-05-08, 14:35
|
-
PIG and HiveRicky Ho 2009-05-06, 05:17
Are they competing technologies of providing a higher level language for Map/Reduce programming ?
Or are they complementary ? Any comparison between them ? Rgds, Ricky +
Ricky Ho 2009-05-06, 05:17
-
Re: PIG and Hiveasif md 2009-05-06, 07:48
http://www.cloudera.com/hadoop-training-hive-introduction
http://www.cloudera.com/hadoop-training-pig-introduction On Wed, May 6, 2009 at 1:17 AM, Ricky Ho <[EMAIL PROTECTED]> wrote: > Are they competing technologies of providing a higher level language for > Map/Reduce programming ? > > Or are they complementary ? > > Any comparison between them ? > > Rgds, > Ricky > +
asif md 2009-05-06, 07:48
-
Re: PIG and HiveSharad Agarwal 2009-05-06, 11:55
see core-user mail thread with subject "HBase, Hive, Pig and other Hadoop based technologies" - Sharad Ricky Ho wrote: > > Are they competing technologies of providing a higher level language for Map/Reduce programming ? > > Or are they complementary ? > > Any comparison between them ? > > Rgds, > Ricky +
Sharad Agarwal 2009-05-06, 11:55
-
Re: PIG and HiveJeff Hammerbacher 2009-05-06, 20:38
Here's a permalink for the thread on MarkMail:
http://markmail.org/thread/ee4hpcji74higqvk On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal <[EMAIL PROTECTED]>wrote: > > see core-user mail thread with subject "HBase, Hive, Pig and other Hadoop > based technologies" > > - Sharad > > Ricky Ho wrote: > > > > Are they competing technologies of providing a higher level language for > Map/Reduce programming ? > > > > Or are they complementary ? > > > > Any comparison between them ? > > > > Rgds, > > Ricky > > +
Jeff Hammerbacher 2009-05-06, 20:38
-
RE: PIG and HiveRicky Ho 2009-05-06, 21:16
Jeff,
Thanks for the pointer. It is pretty clear that Hive and PIG are the same kind and HBase is a different kind. The difference between PIG and Hive seems to be pretty insignificant. Layer a tool on top of them can completely hide their difference. I am viewing your PIG and Hive tutorial and hopefully can extract some technical details there. Rgds, Ricky -----Original Message----- From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 1:38 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Here's a permalink for the thread on MarkMail: http://markmail.org/thread/ee4hpcji74higqvk On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal <[EMAIL PROTECTED]>wrote: > > see core-user mail thread with subject "HBase, Hive, Pig and other Hadoop > based technologies" > > - Sharad > > Ricky Ho wrote: > > > > Are they competing technologies of providing a higher level language for > Map/Reduce programming ? > > > > Or are they complementary ? > > > > Any comparison between them ? > > > > Rgds, > > Ricky > > +
Ricky Ho 2009-05-06, 21:16
-
Re: PIG and HiveAmr Awadallah 2009-05-06, 22:13
> The difference between PIG and Hive seems to be pretty insignificant.
Difference between Pig and Hive is significant, specifically: (1) Pig doesn't require underlying structure to the data, Hive does imply structure via a metastore. This has it pros and cons. It allows Pig to be more suitable for ETL kind tasks where the input data is still a mish-mash and you want to convert it to be structured. On the other hand, Hive's metastore provides a dictionary that lets you easily see what columns exist in which tables which can be very handy. (2) Pig is a new language, easy to learn if you know languages similar to Perl. Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you. Furthermore, Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). You're right that HBase is a completely different game, HBase is not about being a high level language that compiles to map-reduce, HBase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to (1) do quick random lookups, versus scan all of data sequentially, (2) do insert/update/delete from middle, not just add/append. -- amr Ricky Ho wrote: > Jeff, > > Thanks for the pointer. > It is pretty clear that Hive and PIG are the same kind and HBase is a different kind. > The difference between PIG and Hive seems to be pretty insignificant. Layer a tool on top of them can completely hide their difference. > > I am viewing your PIG and Hive tutorial and hopefully can extract some technical details there. > > Rgds, > Ricky > -----Original Message----- > From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, May 06, 2009 1:38 PM > To: [EMAIL PROTECTED] > Subject: Re: PIG and Hive > > Here's a permalink for the thread on MarkMail: > http://markmail.org/thread/ee4hpcji74higqvk > > On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal <[EMAIL PROTECTED]>wrote: > > >> see core-user mail thread with subject "HBase, Hive, Pig and other Hadoop >> based technologies" >> >> - Sharad >> >> Ricky Ho wrote: >> >>> Are they competing technologies of providing a higher level language for >>> >> Map/Reduce programming ? >> >>> Or are they complementary ? >>> >>> Any comparison between them ? >>> >>> Rgds, >>> Ricky >>> >> +
Amr Awadallah 2009-05-06, 22:13
-
RE: PIG and HiveRicky Ho 2009-05-06, 22:56
Thanks Amr,
Without knowing the details of Hive, one constraint of SQL model is you can never generate more than one records from a single record. I don't know how this is done in Hive. Another question is whether the Hive script can take in user-defined functions ? Using the following word count as an example. Can you show me how the Pig script and Hive script looks like ? Map: Input: a line (a collection of words) Output: multiple [word, 1] Reduce: Input: [word, [1, 1, 1, ...]] Output: [word, count] Rgds, Ricky -----Original Message----- From: Amr Awadallah [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 3:14 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive > The difference between PIG and Hive seems to be pretty insignificant. Difference between Pig and Hive is significant, specifically: (1) Pig doesn't require underlying structure to the data, Hive does imply structure via a metastore. This has it pros and cons. It allows Pig to be more suitable for ETL kind tasks where the input data is still a mish-mash and you want to convert it to be structured. On the other hand, Hive's metastore provides a dictionary that lets you easily see what columns exist in which tables which can be very handy. (2) Pig is a new language, easy to learn if you know languages similar to Perl. Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you. Furthermore, Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). You're right that HBase is a completely different game, HBase is not about being a high level language that compiles to map-reduce, HBase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to (1) do quick random lookups, versus scan all of data sequentially, (2) do insert/update/delete from middle, not just add/append. -- amr Ricky Ho wrote: > Jeff, > > Thanks for the pointer. > It is pretty clear that Hive and PIG are the same kind and HBase is a different kind. > The difference between PIG and Hive seems to be pretty insignificant. Layer a tool on top of them can completely hide their difference. > > I am viewing your PIG and Hive tutorial and hopefully can extract some technical details there. > > Rgds, > Ricky > -----Original Message----- > From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, May 06, 2009 1:38 PM > To: [EMAIL PROTECTED] > Subject: Re: PIG and Hive > > Here's a permalink for the thread on MarkMail: > http://markmail.org/thread/ee4hpcji74higqvk > > On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal <[EMAIL PROTECTED]>wrote: > > >> see core-user mail thread with subject "HBase, Hive, Pig and other Hadoop >> based technologies" >> >> - Sharad >> >> Ricky Ho wrote: >> >>> Are they competing technologies of providing a higher level language for >>> >> Map/Reduce programming ? >> >>> Or are they complementary ? >>> >>> Any comparison between them ? >>> >>> Rgds, >>> Ricky >>> >> +
Ricky Ho 2009-05-06, 22:56
-
RE: PIG and HiveAshish Thusoo 2009-05-06, 23:21
Ricky,
For your particular example Hive allows you to plugin a user defined map and reduce script (in the language of your choice) within Hive QL (there are some minor extensions to SQL to support such a use case). So for your case you could do the following: FROM (FROM lines MAP line USING 'map_script' AS word, cnt DISTRIBUTE BY word) a REDUCE a.word, a.cnt USING 'reduce_script'; The map_script and reduce_script has the map and reduce logic (thse can be simple shell scripts, python scripts, php, java - you name it). And they CAN generate multiple records for each input record. In the RDBMS world there is a concept of Table functions that achieves the same effect, except that those are plugged into the FROM clause of a usual SQL statement. Also, SQL does actually have a workaround that you can use to generate more than one recods from a single record - provided the explosion factor is fixed. Suppose you want to generate x record for each input record, you can do a cartesian join with a dummy table that has x rows. Ashish -----Original Message----- From: Ricky Ho [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 3:56 PM To: [EMAIL PROTECTED] Subject: RE: PIG and Hive Thanks Amr, Without knowing the details of Hive, one constraint of SQL model is you can never generate more than one records from a single record. I don't know how this is done in Hive. Another question is whether the Hive script can take in user-defined functions ? Using the following word count as an example. Can you show me how the Pig script and Hive script looks like ? Map: Input: a line (a collection of words) Output: multiple [word, 1] Reduce: Input: [word, [1, 1, 1, ...]] Output: [word, count] Rgds, Ricky -----Original Message----- From: Amr Awadallah [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 3:14 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive > The difference between PIG and Hive seems to be pretty insignificant. Difference between Pig and Hive is significant, specifically: (1) Pig doesn't require underlying structure to the data, Hive does imply structure via a metastore. This has it pros and cons. It allows Pig to be more suitable for ETL kind tasks where the input data is still a mish-mash and you want to convert it to be structured. On the other hand, Hive's metastore provides a dictionary that lets you easily see what columns exist in which tables which can be very handy. (2) Pig is a new language, easy to learn if you know languages similar to Perl. Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you. Furthermore, Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). You're right that HBase is a completely different game, HBase is not about being a high level language that compiles to map-reduce, HBase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to (1) do quick random lookups, versus scan all of data sequentially, (2) do insert/update/delete from middle, not just add/append. -- amr Ricky Ho wrote: > Jeff, > > Thanks for the pointer. > It is pretty clear that Hive and PIG are the same kind and HBase is a different kind. > The difference between PIG and Hive seems to be pretty insignificant. Layer a tool on top of them can completely hide their difference. > > I am viewing your PIG and Hive tutorial and hopefully can extract some technical details there. > > Rgds, > Ricky > -----Original Message----- > From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, May 06, 2009 1:38 PM > To: [EMAIL PROTECTED] > Subject: Re: PIG and Hive > > Here's a permalink for the thread on MarkMail: +
Ashish Thusoo 2009-05-06, 23:21
-
RE: PIG and HiveRicky Ho 2009-05-07, 03:56
Ashish,
Thanks for your code. So the map_script is kinda like a subquery. Why do I need to use a customized reduce_script in the wordcount example ? Can I just use the "count(*) groupby word" ? We cannot assume a fix explosion factor, a line is a variable length word array. Supporting the "collection" type in PIG seems to make the solution cleaner. Rgds, Ricky -----Original Message----- From: Ashish Thusoo [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 4:21 PM To: [EMAIL PROTECTED] Subject: RE: PIG and Hive Ricky, For your particular example Hive allows you to plugin a user defined map and reduce script (in the language of your choice) within Hive QL (there are some minor extensions to SQL to support such a use case). So for your case you could do the following: FROM (FROM lines MAP line USING 'map_script' AS word, cnt DISTRIBUTE BY word) a REDUCE a.word, a.cnt USING 'reduce_script'; The map_script and reduce_script has the map and reduce logic (thse can be simple shell scripts, python scripts, php, java - you name it). And they CAN generate multiple records for each input record. In the RDBMS world there is a concept of Table functions that achieves the same effect, except that those are plugged into the FROM clause of a usual SQL statement. Also, SQL does actually have a workaround that you can use to generate more than one recods from a single record - provided the explosion factor is fixed. Suppose you want to generate x record for each input record, you can do a cartesian join with a dummy table that has x rows. Ashish -----Original Message----- From: Ricky Ho [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 3:56 PM To: [EMAIL PROTECTED] Subject: RE: PIG and Hive Thanks Amr, Without knowing the details of Hive, one constraint of SQL model is you can never generate more than one records from a single record. I don't know how this is done in Hive. Another question is whether the Hive script can take in user-defined functions ? Using the following word count as an example. Can you show me how the Pig script and Hive script looks like ? Map: Input: a line (a collection of words) Output: multiple [word, 1] Reduce: Input: [word, [1, 1, 1, ...]] Output: [word, count] Rgds, Ricky -----Original Message----- From: Amr Awadallah [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 3:14 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive > The difference between PIG and Hive seems to be pretty insignificant. Difference between Pig and Hive is significant, specifically: (1) Pig doesn't require underlying structure to the data, Hive does imply structure via a metastore. This has it pros and cons. It allows Pig to be more suitable for ETL kind tasks where the input data is still a mish-mash and you want to convert it to be structured. On the other hand, Hive's metastore provides a dictionary that lets you easily see what columns exist in which tables which can be very handy. (2) Pig is a new language, easy to learn if you know languages similar to Perl. Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you. Furthermore, Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). You're right that HBase is a completely different game, HBase is not about being a high level language that compiles to map-reduce, HBase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to (1) do quick random lookups, versus scan all of data sequentially, (2) do insert/update/delete from middle, not just add/append. -- amr Ricky Ho wrote: > Jeff, > > Thanks for the pointer. +
Ricky Ho 2009-05-07, 03:56
-
RE: PIG and HiveOlga Natkovich 2009-05-07, 00:32
Hi Ricky,
This is how the code will look in Pig. A = load 'textdoc' using TextLoader() as (sentence: chararray); B = foreach A generate flatten(TOKENIZE(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B); store D into 'wordcount'; Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) explains how the example above works. Let me know if you have further questions. Olga > -----Original Message----- > From: Ricky Ho [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, May 06, 2009 3:56 PM > To: [EMAIL PROTECTED] > Subject: RE: PIG and Hive > > Thanks Amr, > > Without knowing the details of Hive, one constraint of SQL > model is you can never generate more than one records from a > single record. I don't know how this is done in Hive. > Another question is whether the Hive script can take in > user-defined functions ? > > Using the following word count as an example. Can you show > me how the Pig script and Hive script looks like ? > > Map: > Input: a line (a collection of words) > Output: multiple [word, 1] > > Reduce: > Input: [word, [1, 1, 1, ...]] > Output: [word, count] > > Rgds, > Ricky > > -----Original Message----- > From: Amr Awadallah [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, May 06, 2009 3:14 PM > To: [EMAIL PROTECTED] > Subject: Re: PIG and Hive > > > The difference between PIG and Hive seems to be pretty > insignificant. > > Difference between Pig and Hive is significant, specifically: > > (1) Pig doesn't require underlying structure to the data, > Hive does imply structure via a metastore. This has it pros > and cons. It allows Pig to be more suitable for ETL kind > tasks where the input data is still a mish-mash and you want > to convert it to be structured. On the other hand, Hive's > metastore provides a dictionary that lets you easily see what > columns exist in which tables which can be very handy. > > (2) Pig is a new language, easy to learn if you know > languages similar to Perl. Hive is a sub-set of SQL with very > simple variations to enable map-reduce like computation. So, > if you come from a SQL background you will find Hive QL > extremely easy to pickup (many of your SQL queries will run > as is), while if you come from a procedural programming > background (w/o SQL knowledge) then Pig will be much more > suitable for you. Furthermore, Hive is a bit easier to > integrate with other systems and tools since it speaks the > language they already speak (i.e. SQL). > > You're right that HBase is a completely different game, HBase > is not about being a high level language that compiles to > map-reduce, HBase is about allowing Hadoop to support > lookups/transactions on key/value pairs. HBase allows you to > (1) do quick random lookups, versus scan all of data > sequentially, (2) do insert/update/delete from middle, not > just add/append. > > -- amr > > Ricky Ho wrote: > > Jeff, > > > > Thanks for the pointer. > > It is pretty clear that Hive and PIG are the same kind and > HBase is a different kind. > > The difference between PIG and Hive seems to be pretty > insignificant. Layer a tool on top of them can completely > hide their difference. > > > > I am viewing your PIG and Hive tutorial and hopefully can > extract some technical details there. > > > > Rgds, > > Ricky > > -----Original Message----- > > From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, May 06, 2009 1:38 PM > > To: [EMAIL PROTECTED] > > Subject: Re: PIG and Hive > > > > Here's a permalink for the thread on MarkMail: > > http://markmail.org/thread/ee4hpcji74higqvk > > > > On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal > <[EMAIL PROTECTED]>wrote: > > > > > >> see core-user mail thread with subject "HBase, Hive, Pig and other > >> Hadoop based technologies" > >> > >> - Sharad > >> > >> Ricky Ho wrote: > >> > >>> Are they competing technologies of providing a higher > level language +
Olga Natkovich 2009-05-07, 00:32
-
Re: PIG and HiveScott Carey 2009-05-07, 02:47
Pig currently also compiles similar operations (like the below) into many
fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: > Hi Ricky, > > This is how the code will look in Pig. > > A = load 'textdoc' using TextLoader() as (sentence: chararray); > B = foreach A generate flatten(TOKENIZE(sentence)) as word; > C = group B by word; > D = foreach C generate group, COUNT(B); > store D into 'wordcount'; > > Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) > explains how the example above works. > > Let me know if you have further questions. > > Olga > > >> -----Original Message----- >> From: Ricky Ho [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 06, 2009 3:56 PM >> To: [EMAIL PROTECTED] >> Subject: RE: PIG and Hive >> >> Thanks Amr, >> >> Without knowing the details of Hive, one constraint of SQL >> model is you can never generate more than one records from a >> single record. I don't know how this is done in Hive. >> Another question is whether the Hive script can take in >> user-defined functions ? >> >> Using the following word count as an example. Can you show >> me how the Pig script and Hive script looks like ? >> >> Map: >> Input: a line (a collection of words) >> Output: multiple [word, 1] >> >> Reduce: >> Input: [word, [1, 1, 1, ...]] >> Output: [word, count] >> >> Rgds, >> Ricky >> >> -----Original Message----- >> From: Amr Awadallah [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 06, 2009 3:14 PM >> To: [EMAIL PROTECTED] >> Subject: Re: PIG and Hive >> >>> The difference between PIG and Hive seems to be pretty >> insignificant. >> >> Difference between Pig and Hive is significant, specifically: >> >> (1) Pig doesn't require underlying structure to the data, >> Hive does imply structure via a metastore. This has it pros >> and cons. It allows Pig to be more suitable for ETL kind >> tasks where the input data is still a mish-mash and you want >> to convert it to be structured. On the other hand, Hive's >> metastore provides a dictionary that lets you easily see what >> columns exist in which tables which can be very handy. >> >> (2) Pig is a new language, easy to learn if you know >> languages similar to Perl. Hive is a sub-set of SQL with very >> simple variations to enable map-reduce like computation. So, >> if you come from a SQL background you will find Hive QL >> extremely easy to pickup (many of your SQL queries will run >> as is), while if you come from a procedural programming >> background (w/o SQL knowledge) then Pig will be much more >> suitable for you. Furthermore, Hive is a bit easier to >> integrate with other systems and tools since it speaks the >> language they already speak (i.e. SQL). >> >> You're right that HBase is a completely different game, HBase >> is not about being a high level language that compiles to >> map-reduce, HBase is about allowing Hadoop to support >> lookups/transactions on key/value pairs. HBase allows you to >> (1) do quick random lookups, versus scan all of data >> sequentially, (2) do insert/update/delete from middle, not +
Scott Carey 2009-05-07, 02:47
-
RE: PIG and HiveRicky Ho 2009-05-07, 04:11
Thanks for Olga example and Scott's comment.
My goal is to pick a higher level parallel programming language (as a algorithm design / prototyping tool) to express my parallel algorithms in a concise way. The deeper I look into these, I have a stronger feeling that PIG and HIVE are competitors rather than complementing each other. I think a large set of problems can be done in either way, without much difference in terms of skillset requirements. At this moment, I am focus in the richness of the language model rather than the implementation optimization. Supporting "collection" as well as the flatten operation in the language model seems to make PIG more powerful. Yes, you can achieve the same thing in Hive but then it starts to look odd. Am I missing something Hive folks ? Rgds, Ricky -----Original Message----- From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 7:48 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Pig currently also compiles similar operations (like the below) into many fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: > Hi Ricky, > > This is how the code will look in Pig. > > A = load 'textdoc' using TextLoader() as (sentence: chararray); > B = foreach A generate flatten(TOKENIZE(sentence)) as word; > C = group B by word; > D = foreach C generate group, COUNT(B); > store D into 'wordcount'; > > Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) > explains how the example above works. > > Let me know if you have further questions. > > Olga > > >> -----Original Message----- >> From: Ricky Ho [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 06, 2009 3:56 PM >> To: [EMAIL PROTECTED] >> Subject: RE: PIG and Hive >> >> Thanks Amr, >> >> Without knowing the details of Hive, one constraint of SQL >> model is you can never generate more than one records from a >> single record. I don't know how this is done in Hive. >> Another question is whether the Hive script can take in >> user-defined functions ? >> >> Using the following word count as an example. Can you show >> me how the Pig script and Hive script looks like ? >> >> Map: >> Input: a line (a collection of words) >> Output: multiple [word, 1] >> >> Reduce: >> Input: [word, [1, 1, 1, ...]] >> Output: [word, count] >> >> Rgds, >> Ricky >> >> -----Original Message----- >> From: Amr Awadallah [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 06, 2009 3:14 PM >> To: [EMAIL PROTECTED] >> Subject: Re: PIG and Hive >> >>> The difference between PIG and Hive seems to be pretty >> insignificant. >> >> Difference between Pig and Hive is significant, specifically: >> >> (1) Pig doesn't require underlying structure to the data, >> Hive does imply structure via a metastore. This has it pros >> and cons. It allows Pig to be more suitable for ETL kind >> tasks where the input data is still a mish-mash and you want >> to convert it to be structured. On the other hand, Hive's >> metastore provides a dictionary that lets you easily see what >> columns exist in which tables which can be very handy. +
Ricky Ho 2009-05-07, 04:11
-
Re: PIG and HiveLuc Hunt 2009-05-07, 05:46
Ricky,
One thing to mention is, SQL support is on the Pig roadmap this year. --Yiping On Wed, May 6, 2009 at 9:11 PM, Ricky Ho <[EMAIL PROTECTED]> wrote: > Thanks for Olga example and Scott's comment. > > My goal is to pick a higher level parallel programming language (as a > algorithm design / prototyping tool) to express my parallel algorithms in a > concise way. The deeper I look into these, I have a stronger feeling that > PIG and HIVE are competitors rather than complementing each other. I think > a large set of problems can be done in either way, without much difference > in terms of skillset requirements. > > At this moment, I am focus in the richness of the language model rather > than the implementation optimization. Supporting "collection" as well as > the flatten operation in the language model seems to make PIG more powerful. > Yes, you can achieve the same thing in Hive but then it starts to look odd. > Am I missing something Hive folks ? > > Rgds, > Ricky > > -----Original Message----- > From: Scott Carey [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, May 06, 2009 7:48 PM > To: [EMAIL PROTECTED] > Subject: Re: PIG and Hive > > Pig currently also compiles similar operations (like the below) into many > fewer map reduce passes and is several times faster in general. > > This will change as the optimizer and available optimizations converge and > in the future they won't differ much. But for now, Pig optimizes much > better. > > I ran a test that boiled down to SQL like this: > > SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y > group by x, y. > > (and equivalent, but more verbose Pig) > > Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 > map reduce passes in 10 minutes. > > There is nothing keeping Hive from applying the optimizations necessary to > make that one pass, but those sort of performance optimizations aren't > there > yet. That is expected, it is a younger project. > > It would be useful if more of these higher level tools shared work on the > various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) > could benefit from a shared map-reduce compiler. > > > On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: > > > Hi Ricky, > > > > This is how the code will look in Pig. > > > > A = load 'textdoc' using TextLoader() as (sentence: chararray); > > B = foreach A generate flatten(TOKENIZE(sentence)) as word; > > C = group B by word; > > D = foreach C generate group, COUNT(B); > > store D into 'wordcount'; > > > > Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) > > explains how the example above works. > > > > Let me know if you have further questions. > > > > Olga > > > > > >> -----Original Message----- > >> From: Ricky Ho [mailto:[EMAIL PROTECTED]] > >> Sent: Wednesday, May 06, 2009 3:56 PM > >> To: [EMAIL PROTECTED] > >> Subject: RE: PIG and Hive > >> > >> Thanks Amr, > >> > >> Without knowing the details of Hive, one constraint of SQL > >> model is you can never generate more than one records from a > >> single record. I don't know how this is done in Hive. > >> Another question is whether the Hive script can take in > >> user-defined functions ? > >> > >> Using the following word count as an example. Can you show > >> me how the Pig script and Hive script looks like ? > >> > >> Map: > >> Input: a line (a collection of words) > >> Output: multiple [word, 1] > >> > >> Reduce: > >> Input: [word, [1, 1, 1, ...]] > >> Output: [word, count] > >> > >> Rgds, > >> Ricky > >> > >> -----Original Message----- > >> From: Amr Awadallah [mailto:[EMAIL PROTECTED]] > >> Sent: Wednesday, May 06, 2009 3:14 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: PIG and Hive > >> > >>> The difference between PIG and Hive seems to be pretty > >> insignificant. > >> > >> Difference between Pig and Hive is significant, specifically: > >> > >> (1) Pig doesn't require underlying structure to the data, +
Luc Hunt 2009-05-07, 05:46
-
Re: PIG and HiveAmr Awadallah 2009-05-07, 06:20
Yiping,
(1) Any ETA for when that will become available? (2) Where can we read more about the SQL functionality it will support? (3) Where is the JIRA for this? Thanks, -- amr Luc Hunt wrote: > Ricky, > > One thing to mention is, SQL support is on the Pig roadmap this year. > > > --Yiping > > On Wed, May 6, 2009 at 9:11 PM, Ricky Ho <[EMAIL PROTECTED]> wrote: > > >> Thanks for Olga example and Scott's comment. >> >> My goal is to pick a higher level parallel programming language (as a >> algorithm design / prototyping tool) to express my parallel algorithms in a >> concise way. The deeper I look into these, I have a stronger feeling that >> PIG and HIVE are competitors rather than complementing each other. I think >> a large set of problems can be done in either way, without much difference >> in terms of skillset requirements. >> >> At this moment, I am focus in the richness of the language model rather >> than the implementation optimization. Supporting "collection" as well as >> the flatten operation in the language model seems to make PIG more powerful. >> Yes, you can achieve the same thing in Hive but then it starts to look odd. >> Am I missing something Hive folks ? >> >> Rgds, >> Ricky >> >> -----Original Message----- >> From: Scott Carey [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 06, 2009 7:48 PM >> To: [EMAIL PROTECTED] >> Subject: Re: PIG and Hive >> >> Pig currently also compiles similar operations (like the below) into many >> fewer map reduce passes and is several times faster in general. >> >> This will change as the optimizer and available optimizations converge and >> in the future they won't differ much. But for now, Pig optimizes much >> better. >> >> I ran a test that boiled down to SQL like this: >> >> SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y >> group by x, y. >> >> (and equivalent, but more verbose Pig) >> >> Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 >> map reduce passes in 10 minutes. >> >> There is nothing keeping Hive from applying the optimizations necessary to >> make that one pass, but those sort of performance optimizations aren't >> there >> yet. That is expected, it is a younger project. >> >> It would be useful if more of these higher level tools shared work on the >> various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) >> could benefit from a shared map-reduce compiler. >> >> >> On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: >> >> >>> Hi Ricky, >>> >>> This is how the code will look in Pig. >>> >>> A = load 'textdoc' using TextLoader() as (sentence: chararray); >>> B = foreach A generate flatten(TOKENIZE(sentence)) as word; >>> C = group B by word; >>> D = foreach C generate group, COUNT(B); >>> store D into 'wordcount'; >>> >>> Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) >>> explains how the example above works. >>> >>> Let me know if you have further questions. >>> >>> Olga >>> >>> >>> >>>> -----Original Message----- >>>> From: Ricky Ho [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, May 06, 2009 3:56 PM >>>> To: [EMAIL PROTECTED] >>>> Subject: RE: PIG and Hive >>>> >>>> Thanks Amr, >>>> >>>> Without knowing the details of Hive, one constraint of SQL >>>> model is you can never generate more than one records from a >>>> single record. I don't know how this is done in Hive. >>>> Another question is whether the Hive script can take in >>>> user-defined functions ? >>>> >>>> Using the following word count as an example. Can you show >>>> me how the Pig script and Hive script looks like ? >>>> >>>> Map: >>>> Input: a line (a collection of words) >>>> Output: multiple [word, 1] >>>> >>>> Reduce: >>>> Input: [word, [1, 1, 1, ...]] >>>> Output: [word, count] >>>> >>>> Rgds, >>>> Ricky >>>> >>>> -----Original Message----- >>>> From: Amr Awadallah [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, May 06, 2009 3:14 PM +
Amr Awadallah 2009-05-07, 06:20
-
Re: PIG and HiveAlan Gates 2009-05-07, 14:14
SQL has been on Pig's roadmap for some time, see http://wiki.apache.org/pig/ProposedRoadMap
We would like to add SQL support to Pig sometime this year. We don't have an ETA or a JIRA for it yet. Alan. On May 6, 2009, at 11:20 PM, Amr Awadallah wrote: > Yiping, > > (1) Any ETA for when that will become available? > (2) Where can we read more about the SQL functionality it will > support? > > (3) Where is the JIRA for this? > > Thanks, > > -- amr > > Luc Hunt wrote: >> Ricky, >> >> One thing to mention is, SQL support is on the Pig roadmap this year. >> >> >> --Yiping >> >> On Wed, May 6, 2009 at 9:11 PM, Ricky Ho <[EMAIL PROTECTED]> wrote: >> >> >>> Thanks for Olga example and Scott's comment. >>> >>> My goal is to pick a higher level parallel programming language >>> (as a >>> algorithm design / prototyping tool) to express my parallel >>> algorithms in a >>> concise way. The deeper I look into these, I have a stronger >>> feeling that >>> PIG and HIVE are competitors rather than complementing each >>> other. I think >>> a large set of problems can be done in either way, without much >>> difference >>> in terms of skillset requirements. >>> >>> At this moment, I am focus in the richness of the language model >>> rather >>> than the implementation optimization. Supporting "collection" as >>> well as >>> the flatten operation in the language model seems to make PIG more >>> powerful. >>> Yes, you can achieve the same thing in Hive but then it starts to >>> look odd. >>> Am I missing something Hive folks ? >>> >>> Rgds, >>> Ricky >>> >>> -----Original Message----- >>> From: Scott Carey [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, May 06, 2009 7:48 PM >>> To: [EMAIL PROTECTED] >>> Subject: Re: PIG and Hive >>> >>> Pig currently also compiles similar operations (like the below) >>> into many >>> fewer map reduce passes and is several times faster in general. >>> >>> This will change as the optimizer and available optimizations >>> converge and >>> in the future they won't differ much. But for now, Pig optimizes >>> much >>> better. >>> >>> I ran a test that boiled down to SQL like this: >>> >>> SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and >>> a.y = b.y >>> group by x, y. >>> >>> (and equivalent, but more verbose Pig) >>> >>> Pig did it in one map reduce pass in about 2 minutes and Hive did >>> it in 5 >>> map reduce passes in 10 minutes. >>> >>> There is nothing keeping Hive from applying the optimizations >>> necessary to >>> make that one pass, but those sort of performance optimizations >>> aren't >>> there >>> yet. That is expected, it is a younger project. >>> >>> It would be useful if more of these higher level tools shared work >>> on the >>> various optimizations. Pig and Hive (and perhaps CloudBase and >>> Cascading?) >>> could benefit from a shared map-reduce compiler. >>> >>> >>> On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: >>> >>> >>>> Hi Ricky, >>>> >>>> This is how the code will look in Pig. >>>> >>>> A = load 'textdoc' using TextLoader() as (sentence: chararray); >>>> B = foreach A generate flatten(TOKENIZE(sentence)) as word; >>>> C = group B by word; >>>> D = foreach C generate group, COUNT(B); >>>> store D into 'wordcount'; >>>> >>>> Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) >>>> explains how the example above works. >>>> >>>> Let me know if you have further questions. >>>> >>>> Olga >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Ricky Ho [mailto:[EMAIL PROTECTED]] >>>>> Sent: Wednesday, May 06, 2009 3:56 PM >>>>> To: [EMAIL PROTECTED] >>>>> Subject: RE: PIG and Hive >>>>> >>>>> Thanks Amr, >>>>> >>>>> Without knowing the details of Hive, one constraint of SQL >>>>> model is you can never generate more than one records from a >>>>> single record. I don't know how this is done in Hive. >>>>> Another question is whether the Hive script can take in >>>>> user-defined functions ? +
Alan Gates 2009-05-07, 14:14
-
RE: PIG and HiveNamit Jain 2009-05-07, 17:12
SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y. If you do a explain on the above query, you will see that you are performing a Cartesian product followed by the filter. It would be better to rewrite the query as: SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = b.y) group by a.x, a.y; The explain should have 2 map-reduce jobs and a fetch task (which is not a map-reduce job). Can you send me the exact Hive query that you are trying along with the schema of tables 'a' and 'b'. In order to see the plan, you can do: Explain <QUERY> Thanks, -namit ------ Forwarded Message From: Ricky Ho <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Wed, 6 May 2009 21:11:43 -0700 To: <[EMAIL PROTECTED]> Subject: RE: PIG and Hive Thanks for Olga example and Scott's comment. My goal is to pick a higher level parallel programming language (as a algorithm design / prototyping tool) to express my parallel algorithms in a concise way. The deeper I look into these, I have a stronger feeling that PIG and HIVE are competitors rather than complementing each other. I think a large set of problems can be done in either way, without much difference in terms of skillset requirements. At this moment, I am focus in the richness of the language model rather than the implementation optimization. Supporting "collection" as well as the flatten operation in the language model seems to make PIG more powerful. Yes, you can achieve the same thing in Hive but then it starts to look odd. Am I missing something Hive folks ? Rgds, Ricky -----Original Message----- From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 7:48 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Pig currently also compiles similar operations (like the below) into many fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: > Hi Ricky, > > This is how the code will look in Pig. > > A = load 'textdoc' using TextLoader() as (sentence: chararray); > B = foreach A generate flatten(TOKENIZE(sentence)) as word; > C = group B by word; > D = foreach C generate group, COUNT(B); > store D into 'wordcount'; > > Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) > explains how the example above works. > > Let me know if you have further questions. > > Olga > > >> -----Original Message----- >> From: Ricky Ho [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 06, 2009 3:56 PM >> To: [EMAIL PROTECTED] >> Subject: RE: PIG and Hive >> >> Thanks Amr, >> >> Without knowing the details of Hive, one constraint of SQL >> model is you can never generate more than one records from a >> single record. I don't know how this is done in Hive. >> Another question is whether the Hive script can take in >> user-defined functions ? >> >> Using the following word count as an example. Can you show >> me how the Pig script and Hive script looks like ? >> >> Map: >> Input: a line (a collection of words) >> Output: multiple [word, 1] >> >> Reduce: >> Input: [word, [1, 1, 1, ...]] [EMAIL PROTECTED] http://lists.facebook.com/mailman/listinfo/hive +
Namit Jain 2009-05-07, 17:12
-
Re: PIG and HiveScott Carey 2009-05-07, 18:08
The work was done 3 months ago, and the exact query I used may not have been the below - it was functionally the same - two sources, arithmetic aggregation on each inner-joined by a small set of values. We wrote a hand-coded map reduce, a Pig script, and Hive against the same data and performance tested.
At that time, even "SELECT count(a.z) FROM a group by a.z" took 3 phases (not sure how many were fetch versus M/R). Since then, we abandoned Hive for reassessment at a later date. All releases of Hive since then http://hadoop.apache.org/hive/docs/r0.3.0/changes.html don't have anything under "optimizations" and few of the enhancements listed suggest that there has been much change on the performance front (yet). Can Hive not yet detect an implicit inner join in a WHERE clause? Our use case would have less optimization-savvy people querying data ad-hoc, so being able to detect implicit joins and collapse subselects, etc is a requirement. I'm not going to go sitting over the shoulder of everyone who wants to do some ad-hoc data analysis and tell them how to re-write their queries to perform better. That is a big weakness of SQL that affects everything that uses it - there are so many equivalent or near-equivalent forms of expression that often lead to implementation specific performance preferences. I'm sure Hive will get over that hump but it takes time. I'm certainly interested in it and will have a deeper look again in the second half of this year. On 5/7/09 10:12 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. If you do a explain on the above query, you will see that you are performing a Cartesian product followed by the filter. It would be better to rewrite the query as: SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = b.y) group by a.x, a.y; The explain should have 2 map-reduce jobs and a fetch task (which is not a map-reduce job). Can you send me the exact Hive query that you are trying along with the schema of tables 'a' and 'b'. In order to see the plan, you can do: Explain <QUERY> Thanks, -namit ------ Forwarded Message From: Ricky Ho <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Wed, 6 May 2009 21:11:43 -0700 To: <[EMAIL PROTECTED]> Subject: RE: PIG and Hive Thanks for Olga example and Scott's comment. My goal is to pick a higher level parallel programming language (as a algorithm design / prototyping tool) to express my parallel algorithms in a concise way. The deeper I look into these, I have a stronger feeling that PIG and HIVE are competitors rather than complementing each other. I think a large set of problems can be done in either way, without much difference in terms of skillset requirements. At this moment, I am focus in the richness of the language model rather than the implementation optimization. Supporting "collection" as well as the flatten operation in the language model seems to make PIG more powerful. Yes, you can achieve the same thing in Hive but then it starts to look odd. Am I missing something Hive folks ? Rgds, Ricky -----Original Message----- From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 7:48 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Pig currently also compiles similar operations (like the below) into many fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] http://lists.facebook.com/mailman/listinfo/hive +
Scott Carey 2009-05-07, 18:08
-
RE: PIG and HiveAshish Thusoo 2009-05-07, 18:18
Ok that explains a lot of that. When we started off Hive our immediate usecase was to do group bys on data with a lot of skew on the grouping keys. In that scenario it is better to do this in 2 map/reduce jobs using the first one to randomly distribute data and generating the partial sums followed by another one that does the complete sums. This was originally the default plan in Hive. Since then we have moved the default to just using a single map/reduce job and using
hive.exec.skeweddata = true as a parameter to trigger the older behavior. We already collapse subselects. We already do predicate pushdown and column pruning. We don't yet do subexpression elimination but that will happen soon. Implicit detection of an inner join is possible though we never had a JIRA asking for it. Will open one soon... I am sure you will not be disappointed by the capabilities of the system when you try it again.. Feel free to mail [EMAIL PROTECTED] for any clarifications/help/optimization questions. Cheers, Ashish -----Original Message----- From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 07, 2009 11:08 AM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive The work was done 3 months ago, and the exact query I used may not have been the below - it was functionally the same - two sources, arithmetic aggregation on each inner-joined by a small set of values. We wrote a hand-coded map reduce, a Pig script, and Hive against the same data and performance tested. At that time, even "SELECT count(a.z) FROM a group by a.z" took 3 phases (not sure how many were fetch versus M/R). Since then, we abandoned Hive for reassessment at a later date. All releases of Hive since then http://hadoop.apache.org/hive/docs/r0.3.0/changes.html don't have anything under "optimizations" and few of the enhancements listed suggest that there has been much change on the performance front (yet). Can Hive not yet detect an implicit inner join in a WHERE clause? Our use case would have less optimization-savvy people querying data ad-hoc, so being able to detect implicit joins and collapse subselects, etc is a requirement. I'm not going to go sitting over the shoulder of everyone who wants to do some ad-hoc data analysis and tell them how to re-write their queries to perform better. That is a big weakness of SQL that affects everything that uses it - there are so many equivalent or near-equivalent forms of expression that often lead to implementation specific performance preferences. I'm sure Hive will get over that hump but it takes time. I'm certainly interested in it and will have a deeper look again in the second half of this year. On 5/7/09 10:12 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. If you do a explain on the above query, you will see that you are performing a Cartesian product followed by the filter. It would be better to rewrite the query as: SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = b.y) group by a.x, a.y; The explain should have 2 map-reduce jobs and a fetch task (which is not a map-reduce job). Can you send me the exact Hive query that you are trying along with the schema of tables 'a' and 'b'. In order to see the plan, you can do: Explain <QUERY> Thanks, -namit ------ Forwarded Message From: Ricky Ho <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Wed, 6 May 2009 21:11:43 -0700 To: <[EMAIL PROTECTED]> Subject: RE: PIG and Hive Thanks for Olga example and Scott's comment. My goal is to pick a higher level parallel programming language (as a algorithm design / prototyping tool) to express my parallel algorithms in a concise way. The deeper I look into these, I have a stronger feeling that PIG and HIVE are competitors rather than complementing each other. I think a large set of problems can be done in either way, without much difference in terms of skillset requirements. At this moment, I am focus in the richness of the language model rather than the implementation optimization. Supporting "collection" as well as the flatten operation in the language model seems to make PIG more powerful. Yes, you can achieve the same thing in Hive but then it starts to look odd. Am I missing something Hive folks ? Rgds, Ricky From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 7:48 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Pig currently also compiles similar operations (like the below) into many fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] http://lists.facebook.com/mailman/listinfo/hive +
Ashish Thusoo 2009-05-07, 18:18
-
RE: PIG and HiveAshish Thusoo 2009-05-07, 18:10
Scott,
Namit is actually correct. If you do a explain on the query that he sent out, you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified that and that is consistent with what we should expect in this case. We would be very interested to know the exact query that you used as 5 map/reduce jobs is somewhat of a surprise to us. Ricky, Without SQL - at least PIG does not have that now, it is really not usable for people like data analysts at this time - people who have been brought up on SQL and do not necessarily have the skill set of learning another imperative programing language. PIG appeals more to the engineering users - our approach has been different though even in this respect. We have followed a philosophy of allowing even engineering users to write their custom code in an imperative programming language of their choice and be able to plugin that customized logic in different parts of the data flow. Again, this idea may appeal to some and may not appeal to others and it is really a subjective call when it comes to engineering users when you think from the language perspective. Regarding collect and flatten, these have been in Hive roadmap for quite sometime (just as SQL has been on the pig roadmap :)) and we will put those into the language at some future release. Ashish -----Original Message----- From: Namit Jain [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 07, 2009 10:12 AM To: [EMAIL PROTECTED] Subject: RE: PIG and Hive SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. If you do a explain on the above query, you will see that you are performing a Cartesian product followed by the filter. It would be better to rewrite the query as: SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = b.y) group by a.x, a.y; The explain should have 2 map-reduce jobs and a fetch task (which is not a map-reduce job). Can you send me the exact Hive query that you are trying along with the schema of tables 'a' and 'b'. In order to see the plan, you can do: Explain <QUERY> Thanks, -namit ------ Forwarded Message From: Ricky Ho <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Wed, 6 May 2009 21:11:43 -0700 To: <[EMAIL PROTECTED]> Subject: RE: PIG and Hive Thanks for Olga example and Scott's comment. My goal is to pick a higher level parallel programming language (as a algorithm design / prototyping tool) to express my parallel algorithms in a concise way. The deeper I look into these, I have a stronger feeling that PIG and HIVE are competitors rather than complementing each other. I think a large set of problems can be done in either way, without much difference in terms of skillset requirements. At this moment, I am focus in the richness of the language model rather than the implementation optimization. Supporting "collection" as well as the flatten operation in the language model seems to make PIG more powerful. Yes, you can achieve the same thing in Hive but then it starts to look odd. Am I missing something Hive folks ? Rgds, Ricky -----Original Message----- From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 7:48 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Pig currently also compiles similar operations (like the below) into many fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] http://lists.facebook.com/mailman/listinfo/hive +
Ashish Thusoo 2009-05-07, 18:10
-
RE: PIG and HiveRicky Ho 2009-05-08, 14:35
Great ! Glad to see things are merging ... At that point, PIG and Hive are even more competitive to each other.
Rgds, Ricky -----Original Message----- From: Ashish Thusoo [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 07, 2009 11:11 AM To: [EMAIL PROTECTED] Subject: RE: PIG and Hive Scott, Namit is actually correct. If you do a explain on the query that he sent out, you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified that and that is consistent with what we should expect in this case. We would be very interested to know the exact query that you used as 5 map/reduce jobs is somewhat of a surprise to us. Ricky, Without SQL - at least PIG does not have that now, it is really not usable for people like data analysts at this time - people who have been brought up on SQL and do not necessarily have the skill set of learning another imperative programing language. PIG appeals more to the engineering users - our approach has been different though even in this respect. We have followed a philosophy of allowing even engineering users to write their custom code in an imperative programming language of their choice and be able to plugin that customized logic in different parts of the data flow. Again, this idea may appeal to some and may not appeal to others and it is really a subjective call when it comes to engineering users when you think from the language perspective. Regarding collect and flatten, these have been in Hive roadmap for quite sometime (just as SQL has been on the pig roadmap :)) and we will put those into the language at some future release. Ashish -----Original Message----- From: Namit Jain [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 07, 2009 10:12 AM To: [EMAIL PROTECTED] Subject: RE: PIG and Hive SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. If you do a explain on the above query, you will see that you are performing a Cartesian product followed by the filter. It would be better to rewrite the query as: SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = b.y) group by a.x, a.y; The explain should have 2 map-reduce jobs and a fetch task (which is not a map-reduce job). Can you send me the exact Hive query that you are trying along with the schema of tables 'a' and 'b'. In order to see the plan, you can do: Explain <QUERY> Thanks, -namit ------ Forwarded Message From: Ricky Ho <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Wed, 6 May 2009 21:11:43 -0700 To: <[EMAIL PROTECTED]> Subject: RE: PIG and Hive Thanks for Olga example and Scott's comment. My goal is to pick a higher level parallel programming language (as a algorithm design / prototyping tool) to express my parallel algorithms in a concise way. The deeper I look into these, I have a stronger feeling that PIG and HIVE are competitors rather than complementing each other. I think a large set of problems can be done in either way, without much difference in terms of skillset requirements. At this moment, I am focus in the richness of the language model rather than the implementation optimization. Supporting "collection" as well as the flatten operation in the language model seems to make PIG more powerful. Yes, you can achieve the same thing in Hive but then it starts to look odd. Am I missing something Hive folks ? Rgds, Ricky -----Original Message----- From: Scott Carey [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 06, 2009 7:48 PM To: [EMAIL PROTECTED] Subject: Re: PIG and Hive Pig currently also compiles similar operations (like the below) into many fewer map reduce passes and is several times faster in general. This will change as the optimizer and available optimizations converge and in the future they won't differ much. But for now, Pig optimizes much better. I ran a test that boiled down to SQL like this: SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. (and equivalent, but more verbose Pig) Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map reduce passes in 10 minutes. There is nothing keeping Hive from applying the optimizations necessary to make that one pass, but those sort of performance optimizations aren't there yet. That is expected, it is a younger project. It would be useful if more of these higher level tools shared work on the various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) could benefit from a shared map-reduce compiler. On 5/6/09 5:32 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] http://lists.facebook.com/mailman/listinfo/hive +
Ricky Ho 2009-05-08, 14:35
|