Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> MultiStorage for many key values

Copy link to this message
Re: MultiStorage for many key values
I have the same doubt as Thomas Kappler.
And it will be kind of you if someone can say something more detailed about
'custom partitioner' said by Daniel Dai.
I think the docs 'piglatin_ref2.html#partitionby' seems too simple.
2011/6/17 Daniel Dai <[EMAIL PROTECTED]>

> Try custom partitioner: http://pig.apache.org/docs/r0.**
> 8.1/piglatin_ref2.html#**partitionby<http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby>
> Daniel
> On 06/16/2011 12:38 AM, Thomas Kappler wrote:
>> Hi all,
>> piggybank.storage.MultiStorage allows storing the Pig output into
>> different directories, taken from a given field in a relation, so that
>> the output is partitioned by the unique values of that field.
>> This is just what I need for my use-case. However, I have about 50,000
>> unique values in the partitioning field. It seems that MutliStorage
>> will run one reducer per unique value, i.e., per output directory.
>> Obviously, this takes a long time.
>> Is there a better way of doing it?
>> I could group by the partitioning field and write a post-processing
>> script to go through the Pig output and write each line to a different
>> line. It would be simple, but I'd prefer to do it all in Pig for
>> consistency.
>> Thanks,
>> Thomas