Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Question about immediately projecting on a strsplit() return tuple...


+
Daniel Eklund 2011-05-17, 15:32
+
Thejas M Nair 2011-05-17, 18:39
+
Daniel Eklund 2011-05-17, 19:20
Copy link to this message
-
Re: Question about immediately projecting on a strsplit() return tuple...
Thejas M Nair 2011-05-17, 20:41

On 5/17/11 12:20 PM, "Daniel Eklund" <[EMAIL PROTECTED]> wrote:

> I can absolutely open a ticket...  Can you confirm though that the expression
> I am using
>   STRSPLIT(timestamp, ' ', 1).$0
> is a valid means of projecting/indexing the first element of the tuple?
>

Yes, that is correct.

> I am using the cloudera release CDH3U0 for pig which is (I believe 0.8)... I
> will investigate 0.8.1
I think you would need to use the pig jar without hadoop in it, if you are
using CDH, and add CDH hadoop jar in class path.

To build 0.8.1 jar without hadoop -
svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8
 ant jar-withouthadoop

-Thejas

>
> thanks,
> daniel
>
> On Tue, May 17, 2011 at 1:39 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:
>> Are you using 0.8.1 release ? It has several bug fixes.
>> The new logical plan was introduced in 0.8 to make it easier to write
>> optimization rules. The error seems to be caused by some bug in the code
>> related to new logical plan.
>> That is why disabling the new logical plan gets it working.
>>
>> Can you try 0.8.1, and if it fails, can you send the entire stack trace from
>> the pig log file. It would be even better if you can open a pig jira ticket.
>>
>> Thanks
>> Thejas
>>
>>
>>
>> On 5/17/11 8:32 AM, "Daniel Eklund" <[EMAIL PROTECTED]> wrote:
>>
>>> Hey all,
>>>
>>> I have one file A with a 'day' column like "2011/3/2"  and another B with a
>>> column 'timestamp' like "2011/3/2 12:32"  ...  I want to join on these two
>>> field in these records.
>>> I do something like this:
>>>
>>> A_and_B = JOIN A by (tracking_id, day) LEFT OUTER,
>>>                B by (tracking_id,  STRSPLIT(timestamp, ' ', 1).$0)
>>>
>>> where you can see I am projecting out the first element of the tuple
>>> returned by strsplit...
>>>
>>> When I run this I get an error of the form:
>>>     org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>>> script: HASH_JOIN
>>>     ERROR 2042: Error in new logical plan. Try
>>> -Dpig.usenewlogicalplan=false.
>>> Putting the environment variable before the "-x local" I see that the join
>>> appears to be working. Yay.
>>>
>>> I am happy that thing seem to be working, though I would appreciate some
>>> feedback from those in the know as to why the environment variable fixes
>>> this and if there is a more canonical way of doing this join.
>>>
>>> thanks,
>>> daniel
>>>
>>
--
+
Daniel Dai 2011-05-17, 21:17