Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Filter bag with multiple output


Copy link to this message
-
Re: Filter bag with multiple output
You can do the SPLIT outside the nested FOREACH. I'm assuming you have UDF
defined for VALID.

So, your scrpit can be written as:

rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped {
             ordered = ORDER rawRecords BY ts;
             GENERATE group as group_key, ordered as data;
};
SPLIT validAndNotValidRecords INTO validRecords IF VALID(data), INTO
invalidRecords OTHERWISE;
On Tue, Jul 23, 2013 at 8:58 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Omg, thanks it's exactly the thing I need.
>
> I can't do it before GROUP. I need group by key, then sort by timestamp
> field inside each group.
> After sort is done I do can determine non valid records.
> I've provided simplified case.
>
> The only problem is that SPLIT is not allowed in nested FOREACH statement.
>
>
> 2013/7/23 Pradeep Gollakota <[EMAIL PROTECTED]>
>
> > You can use the SPLIT operator to split a relation into two (or more)
> > relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> >
> > Also, you should probably do this before GROUP. As a best practice (and
> > general pig optimization strategy), you should filter (and project) early
> > and often.
> >
> >
> > On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi, I have rather simple problem and I can't create nice solution.
> > > Here is my input:
> > > msisdn longitude latitude ts
> > > 1 20.30 40.50 123
> > > 1 0.0 null 456
> > > 2 60.70 34.67 678
> > > 2 null null 978
> > >
> > > I need:
> > > group by msisdn
> > > order by ts inside each group
> > > filter records in each group:
> > > 1. put all records where longitude, latitude are valid on one side
> > > 2. put all records where longitude/latidude = 0.0/null to the othe side
> > >
> > > Here is pig pseudo-code:
> > > rawRecords = LOAD '/data' as ...;
> > > grouped = GROUP rawRecords BY msisdn;
> > > validAndNotValidRecords = FOREACH grouped{
> > >              ordered = ORDER rawRecords BY ts;
> > >              --do sometihing here to filter valid and not valid
> > records....
> > > }
> > > STORE notValidRecords INTO /not_valid_data;
> > >
> > > someOtherProjection = GROUP validRecords By msisdn;
> > > --continue to work with filtered valid records...
> > >
> > > Can I do it in a single pig script, or I need to create two scripts:
> > > the first one would filter not valid records and store them
> > > the second one will continue to process filtered set of records?
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB