The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
Is there any other way to know the exact number of samples loaded?
By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
On Jul 25, 2012, at 6:49 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
> No. The number of mappers is determined by the InputFormat used by your load function (TextInputFormat if you're using the default PigStorage loader) when the Hadoop job is submitted. Pig doesn't have access to that info until it's handed the jobs off to MapReduce.
> On Jul 25, 2012, at 3:47 PM, Prasanth J wrote:
>> Hello everyone
>> I would like know if there is a way to know the number of mappers while compiling physical plan to MR-plan.
>> -- Prasanth