Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Regex operand - chararray only


Copy link to this message
-
RE: Regex operand - chararray only
Tamir,

x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );

Here projecting the column 'to' from the bag 'rdt1' will give you a bag
of chararray.

You could write a UDF that takes this bag, iterate over the contents and
do a regex match on each item.

Thanks,
Santhosh

-----Original Message-----
From: Tamir Kamara [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 26, 2009 12:12 AM
To: [EMAIL PROTECTED]
Subject: Regex operand - chararray only

Hi,

Following a COGROUP I would like to filter results by one of the fields
but
I'm getting an error: Operand of Regex can be CharArray only. The
relevant
lines in my script are:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target;
x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );
x3 = FOREACH x2 GENERATE flatten(f4);

describe of x1
x1: {group: chararray,p3: {domain: chararray},rdt1: {from: chararray,to:
chararray},f4: {source: chararray,target: chararray}}

I'm not sure why the error occurs. Is it because rdt1 inside x1 is a bag
-
multiple rdt1 can exist in the same group ?

I can get around this with this script:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target parallel 32;
x2 = FOREACH x1 GENERATE flatten(f4), COUNT(p3) as p3_count, COUNT(rdt1)
as
rdt1_count, flatten(rdt1.to);
x3 = FILTER x2 BY ( p3_count==0 AND (rdt1_count==0 OR (to matches
'.com'))
);
x4 = FOREACH x3 GENERATE source, target;

but it seems to me too complicated. Is there a way to make my first
version
work ?

Thanks in advance,
Tamir
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB