Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Filter bag with multiple output


Copy link to this message
-
Re: Filter bag with multiple output
You can use the SPLIT operator to split a relation into two (or more)
relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT

Also, you should probably do this before GROUP. As a best practice (and
general pig optimization strategy), you should filter (and project) early
and often.
On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Hi, I have rather simple problem and I can't create nice solution.
> Here is my input:
> msisdn longitude latitude ts
> 1 20.30 40.50 123
> 1 0.0 null 456
> 2 60.70 34.67 678
> 2 null null 978
>
> I need:
> group by msisdn
> order by ts inside each group
> filter records in each group:
> 1. put all records where longitude, latitude are valid on one side
> 2. put all records where longitude/latidude = 0.0/null to the othe side
>
> Here is pig pseudo-code:
> rawRecords = LOAD '/data' as ...;
> grouped = GROUP rawRecords BY msisdn;
> validAndNotValidRecords = FOREACH grouped{
>              ordered = ORDER rawRecords BY ts;
>              --do sometihing here to filter valid and not valid records....
> }
> STORE notValidRecords INTO /not_valid_data;
>
> someOtherProjection = GROUP validRecords By msisdn;
> --continue to work with filtered valid records...
>
> Can I do it in a single pig script, or I need to create two scripts:
> the first one would filter not valid records and store them
> the second one will continue to process filtered set of records?
>