I have a question related to Hadoop's input sampler ,which is used for
investigating the data set before hand using random selection , sampling
etc .Mainly used for total sort , used in pig's skewed join implementation
The question here is -
K and V are input key and value of the mapper .Essentially coming in from
the input format. OK and OV are output key and value emitted from the
Looking at the input sample's code ,it looks like it is creating the
partition based on the input key of the mapper.
I think the partitions should be created considering the output key (OK)
and the output key sort comparator should be used for sorting the samples.
If partitioning is done based on input key and the mapper emits a different
key then the total sort wouldn't hold any good.
Is there is any condition that input sample is to be only used for
Mahesh Balija 2013-04-24, 05:53
Rahul Bhattacharjee 2013-04-24, 07:07
Mahesh Balija 2013-04-24, 07:58