Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Filter bag with multiple output


Copy link to this message
-
Re: Filter bag with multiple output
Serega Sheypak 2013-07-23, 12:58
Omg, thanks it's exactly the thing I need.

I can't do it before GROUP. I need group by key, then sort by timestamp
field inside each group.
After sort is done I do can determine non valid records.
I've provided simplified case.

The only problem is that SPLIT is not allowed in nested FOREACH statement.
2013/7/23 Pradeep Gollakota <[EMAIL PROTECTED]>

> You can use the SPLIT operator to split a relation into two (or more)
> relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
>
> Also, you should probably do this before GROUP. As a best practice (and
> general pig optimization strategy), you should filter (and project) early
> and often.
>
>
> On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <[EMAIL PROTECTED]
> >wrote:
>
> > Hi, I have rather simple problem and I can't create nice solution.
> > Here is my input:
> > msisdn longitude latitude ts
> > 1 20.30 40.50 123
> > 1 0.0 null 456
> > 2 60.70 34.67 678
> > 2 null null 978
> >
> > I need:
> > group by msisdn
> > order by ts inside each group
> > filter records in each group:
> > 1. put all records where longitude, latitude are valid on one side
> > 2. put all records where longitude/latidude = 0.0/null to the othe side
> >
> > Here is pig pseudo-code:
> > rawRecords = LOAD '/data' as ...;
> > grouped = GROUP rawRecords BY msisdn;
> > validAndNotValidRecords = FOREACH grouped{
> >              ordered = ORDER rawRecords BY ts;
> >              --do sometihing here to filter valid and not valid
> records....
> > }
> > STORE notValidRecords INTO /not_valid_data;
> >
> > someOtherProjection = GROUP validRecords By msisdn;
> > --continue to work with filtered valid records...
> >
> > Can I do it in a single pig script, or I need to create two scripts:
> > the first one would filter not valid records and store them
> > the second one will continue to process filtered set of records?
> >
>