Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Filter bag with multiple output


Copy link to this message
-
Re: Filter bag with multiple output
You can do the SPLIT outside the nested FOREACH. I'm assuming you have UDF
defined for VALID.

So, your scrpit can be written as:

rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped {
             ordered = ORDER rawRecords BY ts;
             GENERATE group as group_key, ordered as data;
};
SPLIT validAndNotValidRecords INTO validRecords IF VALID(data), INTO
invalidRecords OTHERWISE;
On Tue, Jul 23, 2013 at 8:58 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Omg, thanks it's exactly the thing I need.
>
> I can't do it before GROUP. I need group by key, then sort by timestamp
> field inside each group.
> After sort is done I do can determine non valid records.
> I've provided simplified case.
>
> The only problem is that SPLIT is not allowed in nested FOREACH statement.
>
>
> 2013/7/23 Pradeep Gollakota <[EMAIL PROTECTED]>
>
> > You can use the SPLIT operator to split a relation into two (or more)
> > relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> >
> > Also, you should probably do this before GROUP. As a best practice (and
> > general pig optimization strategy), you should filter (and project) early
> > and often.
> >
> >
> > On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi, I have rather simple problem and I can't create nice solution.
> > > Here is my input:
> > > msisdn longitude latitude ts
> > > 1 20.30 40.50 123
> > > 1 0.0 null 456
> > > 2 60.70 34.67 678
> > > 2 null null 978
> > >
> > > I need:
> > > group by msisdn
> > > order by ts inside each group
> > > filter records in each group:
> > > 1. put all records where longitude, latitude are valid on one side
> > > 2. put all records where longitude/latidude = 0.0/null to the othe side
> > >
> > > Here is pig pseudo-code:
> > > rawRecords = LOAD '/data' as ...;
> > > grouped = GROUP rawRecords BY msisdn;
> > > validAndNotValidRecords = FOREACH grouped{
> > >              ordered = ORDER rawRecords BY ts;
> > >              --do sometihing here to filter valid and not valid
> > records....
> > > }
> > > STORE notValidRecords INTO /not_valid_data;
> > >
> > > someOtherProjection = GROUP validRecords By msisdn;
> > > --continue to work with filtered valid records...
> > >
> > > Can I do it in a single pig script, or I need to create two scripts:
> > > the first one would filter not valid records and store them
> > > the second one will continue to process filtered set of records?
> > >
> >
>