Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Can't JOIN self?


+
Russell Jurney 2012-07-20, 02:34
+
Russell Jurney 2012-07-20, 02:46
+
Robert Yerex 2012-07-20, 03:00
+
Russell Jurney 2012-07-20, 03:39
+
Bill Graham 2012-07-20, 04:49
+
Russell Jurney 2012-07-20, 05:10
+
Bill Graham 2012-07-20, 05:34
+
Dmitriy Ryaboy 2012-07-20, 07:53
+
Alan Gates 2012-07-20, 16:01
+
Sean Timm 2012-07-23, 21:36
+
Russell Jurney 2012-07-23, 21:48
Copy link to this message
-
Re: Can't JOIN self?
I'm realizing that I need to do this constantly, otherwise I can't make
much of anything. I used to do this, I think, maybe Pig let it slide.

On Mon, Jul 23, 2012 at 2:48 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> Thanks, that was my thinking. If I make an alias and self-JOIN to it, it
> should work. Self-joins this way are really powerful.
>
>
> On Mon, Jul 23, 2012 at 2:36 PM, Sean Timm <[EMAIL PROTECTED]> wrote:
>
>> It seem the self join should work in Pig 0.10 if using an alias, but alas
>> it doesn't.  See Jira PIG-2630. https://issues.apache.org/**
>> jira/browse/PIG-2630 <https://issues.apache.org/jira/browse/PIG-2630>
>>
>> -Sean
>>
>>
>> On 7/20/2012 12:01 PM, Alan Gates wrote:
>>
>>> It isn't a bug that you need to declare the join twice in your script.
>>>  That is necessary for clarity and semantic correctness.  That is, if we
>>> allowed:
>>>
>>> A = load 'bla';
>>> B = join A by user, A by user;
>>>
>>> then you'd have two user fields in the B with no way to disambiguate.
>>>  What's a bug (or missed optimization opportunity) is that we actually
>>> double read and shuffle the data.  We could optimize here and only read
>>> shuffle one copy and then do the join in the reduce.
>>>
>>> Alan.
>>>
>>> On Jul 20, 2012, at 12:53 AM, Dmitriy Ryaboy wrote:
>>>
>>>  It's kind if a waste of io and mappers. If not a bug, it's an
>>>> optimization opportunity.
>>>>
>>>> On Jul 19, 2012, at 10:34 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>>>
>>>>  No, it isn't a bug as I see it. You need to load the two relations
>>>>> separately because a join is across two separate data sources.
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 10:10 PM, Russell Jurney
>>>>> <[EMAIL PROTECTED]>**wrote:
>>>>>
>>>>>  So it is a bug? Because Pig will not let me self JOIN. I have to LOAD
>>>>>> the
>>>>>> data twice.
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 9:49 PM, Bill Graham <[EMAIL PROTECTED]>
>>>>>> wrote:
>>>>>>
>>>>>>  No, to Pig a self join is just like a regular join across two
>>>>>>> different
>>>>>>> relations. It just happens to be to the same input data.
>>>>>>>
>>>>>>> On Thu, Jul 19, 2012 at 8:39 PM, Russell Jurney <
>>>>>>> [EMAIL PROTECTED]
>>>>>>>
>>>>>>>> wrote:
>>>>>>>> Is this a bug?
>>>>>>>>
>>>>>>>> On Thu, Jul 19, 2012 at 8:00 PM, Robert Yerex <
>>>>>>>> robert.yerex@civitaslearning.**com<[EMAIL PROTECTED]>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  The only way to get it to work is to load a second copy.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 19, 2012 at 7:46 PM, Russell Jurney <
>>>>>>>>>
>>>>>>>> [EMAIL PROTECTED]
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>> Note: this works if I LOAD a new, 2nd relation and do the join.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 19, 2012 at 7:34 PM, Russell Jurney <
>>>>>>>>>>
>>>>>>>>> [EMAIL PROTECTED]
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>> I have a problem where I can't join a relation to itself on a
>>>>>>>>>>>
>>>>>>>>>> different
>>>>>>>>
>>>>>>>>> field.
>>>>>>>>>>>
>>>>>>>>>>> describe pairs
>>>>>>>>>>> pairs: {from: chararray,to: chararray,message_id:
>>>>>>>>>>>
>>>>>>>>>> chararray,in_reply_to:
>>>>>>>>>
>>>>>>>>>> chararray}
>>>>>>>>>>>
>>>>>>>>>>> pairs2 = pairs;
>>>>>>>>>>>
>>>>>>>>>>> with_reply = join pairs by in_reply_to, pairs2 by message_id;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I get this error:
>>>>>>>>>>>
>>>>>>>>>>> 2012-07-19 19:31:16,927 [main] ERROR
>>>>>>>>>>>
>>>>>>>>>> org.apache.pig.tools.grunt.**Grunt -
>>>>>>>>
>>>>>>>>> ERROR 1200: Pig script failed to parse:
>>>>>>>>>>> <line 20, column 6> pig script failed to validate:
>>>>>>>>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR
>>>>>>>>>>> 2225:
>>>>>>>>>>>
>>>>>>>>>> Projection
>>>>>>>>>>
>>>>>>>>>>> with nothing to reference!
>>>>>>>>>>> 2012-07-19 19:31:16,928 [main] ERROR
>>>>>>>>>>>
>>>>>>>>>> org.apache.pig.tools.grunt.**Grunt -
>>>>>>>>
>>>>>>>>> Failed to parse: Pig script failed to parse:
>>>>>>>>>>> <line 20, column 6> pig script failed to validate:
>>>>>>
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com