Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Filter bag with multiple output


Copy link to this message
-
Re: Filter bag with multiple output
Great, thanks! It helped.
2013/7/23 Pradeep Gollakota <[EMAIL PROTECTED]>

> You can do the SPLIT outside the nested FOREACH. I'm assuming you have UDF
> defined for VALID.
>
> So, your scrpit can be written as:
>
> rawRecords = LOAD '/data' as ...;
> grouped = GROUP rawRecords BY msisdn;
> validAndNotValidRecords = FOREACH grouped {
>              ordered = ORDER rawRecords BY ts;
>              GENERATE group as group_key, ordered as data;
> };
> SPLIT validAndNotValidRecords INTO validRecords IF VALID(data), INTO
> invalidRecords OTHERWISE;
>
>
>
>
> On Tue, Jul 23, 2013 at 8:58 AM, Serega Sheypak <[EMAIL PROTECTED]
> >wrote:
>
> > Omg, thanks it's exactly the thing I need.
> >
> > I can't do it before GROUP. I need group by key, then sort by timestamp
> > field inside each group.
> > After sort is done I do can determine non valid records.
> > I've provided simplified case.
> >
> > The only problem is that SPLIT is not allowed in nested FOREACH
> statement.
> >
> >
> > 2013/7/23 Pradeep Gollakota <[EMAIL PROTECTED]>
> >
> > > You can use the SPLIT operator to split a relation into two (or more)
> > > relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> > >
> > > Also, you should probably do this before GROUP. As a best practice (and
> > > general pig optimization strategy), you should filter (and project)
> early
> > > and often.
> > >
> > >
> > > On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <
> > [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Hi, I have rather simple problem and I can't create nice solution.
> > > > Here is my input:
> > > > msisdn longitude latitude ts
> > > > 1 20.30 40.50 123
> > > > 1 0.0 null 456
> > > > 2 60.70 34.67 678
> > > > 2 null null 978
> > > >
> > > > I need:
> > > > group by msisdn
> > > > order by ts inside each group
> > > > filter records in each group:
> > > > 1. put all records where longitude, latitude are valid on one side
> > > > 2. put all records where longitude/latidude = 0.0/null to the othe
> side
> > > >
> > > > Here is pig pseudo-code:
> > > > rawRecords = LOAD '/data' as ...;
> > > > grouped = GROUP rawRecords BY msisdn;
> > > > validAndNotValidRecords = FOREACH grouped{
> > > >              ordered = ORDER rawRecords BY ts;
> > > >              --do sometihing here to filter valid and not valid
> > > records....
> > > > }
> > > > STORE notValidRecords INTO /not_valid_data;
> > > >
> > > > someOtherProjection = GROUP validRecords By msisdn;
> > > > --continue to work with filtered valid records...
> > > >
> > > > Can I do it in a single pig script, or I need to create two scripts:
> > > > the first one would filter not valid records and store them
> > > > the second one will continue to process filtered set of records?
> > > >
> > >
> >
>