Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Filter bag with multiple output


Copy link to this message
-
Filter bag with multiple output
Hi, I have rather simple problem and I can't create nice solution.
Here is my input:
msisdn longitude latitude ts
1 20.30 40.50 123
1 0.0 null 456
2 60.70 34.67 678
2 null null 978

I need:
group by msisdn
order by ts inside each group
filter records in each group:
1. put all records where longitude, latitude are valid on one side
2. put all records where longitude/latidude = 0.0/null to the othe side

Here is pig pseudo-code:
rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped{
             ordered = ORDER rawRecords BY ts;
             --do sometihing here to filter valid and not valid records....
}
STORE notValidRecords INTO /not_valid_data;

someOtherProjection = GROUP validRecords By msisdn;
--continue to work with filtered valid records...

Can I do it in a single pig script, or I need to create two scripts:
the first one would filter not valid records and store them
the second one will continue to process filtered set of records?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB