Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Filter bag with multiple output


Copy link to this message
-
Filter bag with multiple output
Serega Sheypak 2013-07-23, 08:27
Hi, I have rather simple problem and I can't create nice solution.
Here is my input:
msisdn longitude latitude ts
1 20.30 40.50 123
1 0.0 null 456
2 60.70 34.67 678
2 null null 978

I need:
group by msisdn
order by ts inside each group
filter records in each group:
1. put all records where longitude, latitude are valid on one side
2. put all records where longitude/latidude = 0.0/null to the othe side

Here is pig pseudo-code:
rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped{
             ordered = ORDER rawRecords BY ts;
             --do sometihing here to filter valid and not valid records....
}
STORE notValidRecords INTO /not_valid_data;

someOtherProjection = GROUP validRecords By msisdn;
--continue to work with filtered valid records...

Can I do it in a single pig script, or I need to create two scripts:
the first one would filter not valid records and store them
the second one will continue to process filtered set of records?