Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> MultiStorage for many key values


Copy link to this message
-
Re: MultiStorage for many key values
I have the same doubt as Thomas Kappler.
And it will be kind of you if someone can say something more detailed about
'custom partitioner' said by Daniel Dai.
I think the docs 'piglatin_ref2.html#partitionby' seems too simple.
2011/6/17 Daniel Dai <[EMAIL PROTECTED]>

> Try custom partitioner: http://pig.apache.org/docs/r0.**
> 8.1/piglatin_ref2.html#**partitionby<http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby>
>
> Daniel
>
>
> On 06/16/2011 12:38 AM, Thomas Kappler wrote:
>
>> Hi all,
>>
>> piggybank.storage.MultiStorage allows storing the Pig output into
>> different directories, taken from a given field in a relation, so that
>> the output is partitioned by the unique values of that field.
>>
>> This is just what I need for my use-case. However, I have about 50,000
>> unique values in the partitioning field. It seems that MutliStorage
>> will run one reducer per unique value, i.e., per output directory.
>> Obviously, this takes a long time.
>>
>> Is there a better way of doing it?
>>
>> I could group by the partitioning field and write a post-processing
>> script to go through the Pig output and write each line to a different
>> line. It would be simple, but I'd prefer to do it all in Pig for
>> consistency.
>>
>> Thanks,
>> Thomas
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB