-Re: Real-life experience of forcing smaller input splits?
Edward Capriolo 2013-01-25, 07:46
Not all files are split-table Sequence Files are. Raw gzip files are not.
On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
> set mapred.min.split.size=1024000;
> set mapred.max.split.size=4096000;
> set hive.merge.mapfiles=false;
> I had set above value and setting max split size to a lower value did
> increase my # number of maps. My blocksize was 128MB
> Only thing was my files on hdfs were not heavily compressed and I was
> using RCFileFormat
> I would suggest if you have heavily compressed files then you may want to
> do check what will be size after uncompression and allocate more memory to
> On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]> wrote:
>> I have seen many posts on various sites and MLs, but didn't find a firm
>> answer anywhere: is it possible yes or no to force a smaller split size
>> than a block on the mappers, from the client side? I'm not after
>> pointers to the docs (unless you're very very sure :-) but after
>> real-life experience along the lines of 'yes, it works this way, I've
>> done it like this...'
>> All the parameters that I could find (especially specifying a max input
>> split size) seem to have no effect, and the files that I have are so
>> heavily compressed that they completely saturate the mappers' memory
>> when processed.
>> A solution I could imagine for this specific issue is reducing the block
>> size, but for now I simply went with disabling in-file compression for
>> those. And changing the block size on a per-file basis is something I'd
>> like to avoid if at all possible.
>> All the hive settings that we tried only got me as far as raising the
>> number of mappers from 5 to 6 (yay!) where I would have needed at least
>> ten times more.
> Nitin Pawar