Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Can't JOIN self?


Copy link to this message
-
Re: Can't JOIN self?
Russell Jurney 2012-07-23, 21:48
Thanks, that was my thinking. If I make an alias and self-JOIN to it, it
should work. Self-joins this way are really powerful.

On Mon, Jul 23, 2012 at 2:36 PM, Sean Timm <[EMAIL PROTECTED]> wrote:

> It seem the self join should work in Pig 0.10 if using an alias, but alas
> it doesn't.  See Jira PIG-2630. https://issues.apache.org/**
> jira/browse/PIG-2630 <https://issues.apache.org/jira/browse/PIG-2630>
>
> -Sean
>
>
> On 7/20/2012 12:01 PM, Alan Gates wrote:
>
>> It isn't a bug that you need to declare the join twice in your script.
>>  That is necessary for clarity and semantic correctness.  That is, if we
>> allowed:
>>
>> A = load 'bla';
>> B = join A by user, A by user;
>>
>> then you'd have two user fields in the B with no way to disambiguate.
>>  What's a bug (or missed optimization opportunity) is that we actually
>> double read and shuffle the data.  We could optimize here and only read
>> shuffle one copy and then do the join in the reduce.
>>
>> Alan.
>>
>> On Jul 20, 2012, at 12:53 AM, Dmitriy Ryaboy wrote:
>>
>>  It's kind if a waste of io and mappers. If not a bug, it's an
>>> optimization opportunity.
>>>
>>> On Jul 19, 2012, at 10:34 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>>
>>>  No, it isn't a bug as I see it. You need to load the two relations
>>>> separately because a join is across two separate data sources.
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 10:10 PM, Russell Jurney
>>>> <[EMAIL PROTECTED]>**wrote:
>>>>
>>>>  So it is a bug? Because Pig will not let me self JOIN. I have to LOAD
>>>>> the
>>>>> data twice.
>>>>>
>>>>> On Thu, Jul 19, 2012 at 9:49 PM, Bill Graham <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>  No, to Pig a self join is just like a regular join across two
>>>>>> different
>>>>>> relations. It just happens to be to the same input data.
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 8:39 PM, Russell Jurney <
>>>>>> [EMAIL PROTECTED]
>>>>>>
>>>>>>> wrote:
>>>>>>> Is this a bug?
>>>>>>>
>>>>>>> On Thu, Jul 19, 2012 at 8:00 PM, Robert Yerex <
>>>>>>> robert.yerex@civitaslearning.**com<[EMAIL PROTECTED]>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  The only way to get it to work is to load a second copy.
>>>>>>>>
>>>>>>>> On Thu, Jul 19, 2012 at 7:46 PM, Russell Jurney <
>>>>>>>>
>>>>>>> [EMAIL PROTECTED]
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>> Note: this works if I LOAD a new, 2nd relation and do the join.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 19, 2012 at 7:34 PM, Russell Jurney <
>>>>>>>>>
>>>>>>>> [EMAIL PROTECTED]
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>> I have a problem where I can't join a relation to itself on a
>>>>>>>>>>
>>>>>>>>> different
>>>>>>>
>>>>>>>> field.
>>>>>>>>>>
>>>>>>>>>> describe pairs
>>>>>>>>>> pairs: {from: chararray,to: chararray,message_id:
>>>>>>>>>>
>>>>>>>>> chararray,in_reply_to:
>>>>>>>>
>>>>>>>>> chararray}
>>>>>>>>>>
>>>>>>>>>> pairs2 = pairs;
>>>>>>>>>>
>>>>>>>>>> with_reply = join pairs by in_reply_to, pairs2 by message_id;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I get this error:
>>>>>>>>>>
>>>>>>>>>> 2012-07-19 19:31:16,927 [main] ERROR
>>>>>>>>>>
>>>>>>>>> org.apache.pig.tools.grunt.**Grunt -
>>>>>>>
>>>>>>>> ERROR 1200: Pig script failed to parse:
>>>>>>>>>> <line 20, column 6> pig script failed to validate:
>>>>>>>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR
>>>>>>>>>> 2225:
>>>>>>>>>>
>>>>>>>>> Projection
>>>>>>>>>
>>>>>>>>>> with nothing to reference!
>>>>>>>>>> 2012-07-19 19:31:16,928 [main] ERROR
>>>>>>>>>>
>>>>>>>>> org.apache.pig.tools.grunt.**Grunt -
>>>>>>>
>>>>>>>> Failed to parse: Pig script failed to parse:
>>>>>>>>>> <line 20, column 6> pig script failed to validate:
>>>>>>>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR
>>>>>>>>>> 2225:
>>>>>>>>>>
>>>>>>>>> Projection
>>>>>>>>>
>>>>>>>>>> with nothing to reference!
>>>>>>>>>> at
>>>>>>>>>>
>>>>>>>>>>  org.apache.pig.parser.**QueryParserDriver.parse(**
>>>>>> QueryParserDriver.java:182)
>>>>>>
>>>>>>> at
>>>>>>>>>>
>>>>>>>>> org.apache.pig.PigServer$**Graph.validateQuery(PigServer.**
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com