I'm pretty sure the answer to my question is no, but I have to ask. Is it
possible within Pig to store different groups of data into different output
files where the grouping is dynamic (i.e. not known ahead of time)? Here's
what I'm trying to do...
I've got a script that reads log files of URLs and generates counts for a
given time period. The urls might have a 'tag' querystring param though, and
in that case I want to get the most popular urls for each tag output to it's
My data looks like this and is ordered by tag asc, count desc:
[tag] [timeinterval] [url] [count]
I need to do something like so:
for each tag group found
store all records in file foo_[tag].txt
I ultimately need these files on local disk and I'm looking for a better way
to do so than generating a file of N unique tags in HDFS, reading it from
Java, submitting N jobs with the tag name substituted into a script file,
followed by N copyToLocal calls.
At least two possible solutions come to mind, but am curious if there's
another that I'm overlooking:
1. In java submit pig dynamic commands to an instance of PigServer. I'd
still need a unique tag file for this case.
2. Maybe with a custom store function??