Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> simple pig logic


Copy link to this message
-
Re: simple pig logic
If I understood your question correctly, given the following input:

main_data.txt
{"id": "foo", "some_field": 12354, "score": 0}
{"id": "foobar", "some_field": 12354, "score": 0}
{"id": "baz", "some_field": 12345, "score": 0}

score_data.txt
{"id": "foo", "score": 1}
{"id": "foobar", "score": 20}

you want the following output

{"id": "foo", "some_field": 12354, "score": 1}
{"id": "foobar", "some_field": 12354, "score": 20}
{"id": "baz", "some_field": 12345, "score": 0}

If that is correct, you can do a LEFT OUTER join on the two relations.

main = LOAD 'main_data.txt' as (id: chararray, some_field: int, score: int);
scores = LOAD 'score_data.txt' as (id: chararray, score: int);
both = JOIN main by id LEFT, scores by id;
final = FOREACH both GENERATE main::id as id, main::some_field as
some_field, (scores::score == null ? main::score : scores::score) as
score;
dump final;

After the join, check to see if the scores::score is null… if it is, choose
the default of main::score… if not choose scores::score.

Hope this helps!
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB