Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Regex operand - chararray only


Copy link to this message
-
RE: Regex operand - chararray only
Santhosh Srinivasan 2009-03-26, 16:38
Tamir,

x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );

Here projecting the column 'to' from the bag 'rdt1' will give you a bag
of chararray.

You could write a UDF that takes this bag, iterate over the contents and
do a regex match on each item.

Thanks,
Santhosh

-----Original Message-----
From: Tamir Kamara [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 26, 2009 12:12 AM
To: [EMAIL PROTECTED]
Subject: Regex operand - chararray only

Hi,

Following a COGROUP I would like to filter results by one of the fields
but
I'm getting an error: Operand of Regex can be CharArray only. The
relevant
lines in my script are:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target;
x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );
x3 = FOREACH x2 GENERATE flatten(f4);

describe of x1
x1: {group: chararray,p3: {domain: chararray},rdt1: {from: chararray,to:
chararray},f4: {source: chararray,target: chararray}}

I'm not sure why the error occurs. Is it because rdt1 inside x1 is a bag
-
multiple rdt1 can exist in the same group ?

I can get around this with this script:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target parallel 32;
x2 = FOREACH x1 GENERATE flatten(f4), COUNT(p3) as p3_count, COUNT(rdt1)
as
rdt1_count, flatten(rdt1.to);
x3 = FILTER x2 BY ( p3_count==0 AND (rdt1_count==0 OR (to matches
'.com'))
);
x4 = FOREACH x3 GENERATE source, target;

but it seems to me too complicated. Is there a way to make my first
version
work ?

Thanks in advance,
Tamir