Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Filter bag with multiple output


+
Serega Sheypak 2013-07-23, 08:27
Copy link to this message
-
Re: Filter bag with multiple output
You can use the SPLIT operator to split a relation into two (or more)
relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT

Also, you should probably do this before GROUP. As a best practice (and
general pig optimization strategy), you should filter (and project) early
and often.
On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Hi, I have rather simple problem and I can't create nice solution.
> Here is my input:
> msisdn longitude latitude ts
> 1 20.30 40.50 123
> 1 0.0 null 456
> 2 60.70 34.67 678
> 2 null null 978
>
> I need:
> group by msisdn
> order by ts inside each group
> filter records in each group:
> 1. put all records where longitude, latitude are valid on one side
> 2. put all records where longitude/latidude = 0.0/null to the othe side
>
> Here is pig pseudo-code:
> rawRecords = LOAD '/data' as ...;
> grouped = GROUP rawRecords BY msisdn;
> validAndNotValidRecords = FOREACH grouped{
>              ordered = ORDER rawRecords BY ts;
>              --do sometihing here to filter valid and not valid records....
> }
> STORE notValidRecords INTO /not_valid_data;
>
> someOtherProjection = GROUP validRecords By msisdn;
> --continue to work with filtered valid records...
>
> Can I do it in a single pig script, or I need to create two scripts:
> the first one would filter not valid records and store them
> the second one will continue to process filtered set of records?
>
+
Serega Sheypak 2013-07-23, 12:58
+
Pradeep Gollakota 2013-07-23, 14:19
+
Serega Sheypak 2013-07-23, 14:21
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB