Rahul Bhattacharjee 2013-04-16, 08:49
-Re: Hadoop sampler related query!
Mahesh Balija 2013-04-24, 05:53
The limitation to use InputSampler is, the K and OK (I mean
Map INKEY and OUTKEY) both should be of same type.
Technically because, while collecting the samples (ie.,
arraylist of keys) in writePartitionFile method it uses the INKEY as the
key. And for writing the partition file it uses Mapper OutputKEY as the
Logically also this is the expected behavior of sampling
because, while collecting the samples the only source is the input splits
(INKEY) from which it collects the samples and for generating partition
file you need to generate based on the Mapper outkey type.
On Tue, Apr 23, 2013 at 4:12 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:
> + mapred dev
> On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>> I have a question related to Hadoop's input sampler ,which is used for
>> investigating the data set before hand using random selection , sampling
>> etc .Mainly used for total sort , used in pig's skewed join implementation
>> as well.
>> The question here is -
>> K and V are input key and value of the mapper .Essentially coming in from
>> the input format. OK and OV are output key and value emitted from the
>> Looking at the input sample's code ,it looks like it is creating the
>> partition based on the input key of the mapper.
>> I think the partitions should be created considering the output key (OK)
>> and the output key sort comparator should be used for sorting the samples.
>> If partitioning is done based on input key and the mapper emits a
>> different key then the total sort wouldn't hold any good.
>> Is there is any condition that input sample is to be only used for
Rahul Bhattacharjee 2013-04-24, 07:07
Mahesh Balija 2013-04-24, 07:58