Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Regex operand - chararray only

Copy link to this message
RE: Regex operand - chararray only
Santhosh Srinivasan 2009-03-26, 16:38

x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );

Here projecting the column 'to' from the bag 'rdt1' will give you a bag
of chararray.

You could write a UDF that takes this bag, iterate over the contents and
do a regex match on each item.


-----Original Message-----
From: Tamir Kamara [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 26, 2009 12:12 AM
Subject: Regex operand - chararray only


Following a COGROUP I would like to filter results by one of the fields
I'm getting an error: Operand of Regex can be CharArray only. The
lines in my script are:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target;
x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );
x3 = FOREACH x2 GENERATE flatten(f4);

describe of x1
x1: {group: chararray,p3: {domain: chararray},rdt1: {from: chararray,to:
chararray},f4: {source: chararray,target: chararray}}

I'm not sure why the error occurs. Is it because rdt1 inside x1 is a bag
multiple rdt1 can exist in the same group ?

I can get around this with this script:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target parallel 32;
x2 = FOREACH x1 GENERATE flatten(f4), COUNT(p3) as p3_count, COUNT(rdt1)
rdt1_count, flatten(rdt1.to);
x3 = FILTER x2 BY ( p3_count==0 AND (rdt1_count==0 OR (to matches
x4 = FOREACH x3 GENERATE source, target;

but it seems to me too complicated. Is there a way to make my first
work ?

Thanks in advance,