Number of mappers in MRCompiler

Prasanth J 2012-07-25, 22:47
Alan Gates 2012-07-25, 22:49
Prasanth J 2012-07-25, 23:28
Alan Gates 2012-08-23, 22:20
Prasanth J 2012-08-24, 02:55
Dmitriy Ryaboy 2012-08-24, 03:20
Re: Number of mappers in MRCompiler
Oh yeah.. This question is not related to our cube sampling stuff that we discussed.. wanted to know the reason behind that just out of curiosity :)
-- Prasanth

On Aug 23, 2012, at 11:20 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> I think we decided to instead stub in a special loader that reads a
> few records from each underlying split, in a single mapper (by using a
> single wrapping split), right?
> On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <[EMAIL PROTECTED]> wrote:
>> I see. Thanks Alan for your reply.
>> Also one more question that I posted earlier was
>> I used RandomSampleLoader and specified a sample size of 100. The number of map tasks that are executed is 110. So I am expecting total samples that are received on the reducer to be 110*100 = 11000 but its always more than the expected value. The actual received tuples is between 14000 to 15000. I am not sure if its a bug or if I am missing something. Is it an expected behavior?
>> Thanks
>> -- Prasanth
>> On Aug 23, 2012, at 6:20 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>>> Sorry for the very slow response, but here it is, hopefully better late than never.
>>> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
>>>> Thanks Alan.
>>>> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
>>>> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
>>>> Is there any other way to know the exact number of samples loaded?
>>> Not that I know of.
>>>> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
>>> Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.
>>> Alan.