-Re: Real-life experience of forcing smaller input splits?
David Morel 2013-01-25, 12:28
On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:
> It seems to me the question has not been answered :
> "is it possible yes or no to force a smaller split size
> than a block on the mappers"
> Not that I know (but you could implement something to do it) but why
> you do it?
> By default if the split is set under the size of a block, it will be a
> One of the reason is data-locality. The second is that a block is
> into a single hard-drive (leaving replicas aside) so if n mappers were
> reading n parts from the same block well they would share the
> bandwidth... So it is not a clear win.
> You can change the block size of the file you want to read but using
> smaller block size is really an anti-pattern. Most people increase the
> block size.
> (Note : block size of files are fixed when writing the files and it
> can be
> different between two different files.)
That will be my approach for now, or disabling compression altogether
these files. The only problem I have is that compression is so efficient
that any operation in the mapper (so on the uncompressed data) just
the mapper throw an OOM exception, no matter how much memory I give it.
What partly works though, is setting a low mapred.max.split.size. In a
directory containing 34 files, I get 33 mappers (???). When setting
hive.merge.mapfiles to false (and leaving mapred.max.split.size at its
blocksize default), it doesn't seem to have any effect and I get 20
> Are you trying to handle data which are too small?
> If hive supports multi-threading for mapper it might be an solution.
> But I
> don't the configuration for that.
> PS : the question is quite general and not really hive related
I realized that after re-reading the whole thread :-)
Thanks for all the answers, everyone!
> On Fri, Jan 25, 2013 at 8:46 AM, Edward Capriolo
> <[EMAIL PROTECTED]>wrote:
>> Not all files are split-table Sequence Files are. Raw gzip files are
>> On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar
>> <[EMAIL PROTECTED]>wrote:
>>> set mapred.min.split.size=1024000;
>>> set mapred.max.split.size=4096000;
>>> set hive.merge.mapfiles=false;
>>> I had set above value and setting max split size to a lower value
>>> increase my # number of maps. My blocksize was 128MB
>>> Only thing was my files on hdfs were not heavily compressed and I
>>> using RCFileFormat
>>> I would suggest if you have heavily compressed files then you may
>>> want to
>>> do check what will be size after uncompression and allocate more
>>> memory to
>>> On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]>
>>>> I have seen many posts on various sites and MLs, but didn't find a
>>>> answer anywhere: is it possible yes or no to force a smaller split
>>>> than a block on the mappers, from the client side? I'm not after
>>>> pointers to the docs (unless you're very very sure :-) but after
>>>> real-life experience along the lines of 'yes, it works this way,
>>>> done it like this...'
>>>> All the parameters that I could find (especially specifying a max
>>>> split size) seem to have no effect, and the files that I have are
>>>> heavily compressed that they completely saturate the mappers'
>>>> when processed.
>>>> A solution I could imagine for this specific issue is reducing the
>>>> size, but for now I simply went with disabling in-file compression
>>>> those. And changing the block size on a per-file basis is something
>>>> like to avoid if at all possible.
>>>> All the hive settings that we tried only got me as far as raising
>>>> number of mappers from 5 to 6 (yay!) where I would have needed at
>>>> ten times more.