Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Real-life experience of forcing smaller input splits?

Copy link to this message
Re: Real-life experience of forcing smaller input splits?
On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:

> It seems to me the question has not been answered :
> "is it possible yes or no to force a smaller split size
> than a block on the mappers"
> Not that I know (but you could implement something to do it) but why
> would
> you do it?
> By default if the split is set under the size of a block, it will be a
> block.
> One of the reason is data-locality. The second is that a block is
> written
> into a single hard-drive (leaving replicas aside) so if n mappers were
> reading n parts from the same block well they would share the
> hard-drive
> bandwidth... So it is not a clear win.
> You can change the block size of the file you want to read but using
> smaller block size is really an anti-pattern. Most people increase the
> block size.
> (Note : block size of files are fixed when writing the files and it
> can be
> different between two different files.)

That will be my approach for now, or disabling compression altogether
these files. The only problem I have is that compression is so efficient
that any operation in the mapper (so on the uncompressed data) just
the mapper throw an OOM exception, no matter how much memory I give it.

What partly works though, is setting a low mapred.max.split.size. In a
directory containing 34 files, I get 33 mappers (???). When setting
hive.merge.mapfiles to false (and leaving mapred.max.split.size at its
blocksize default), it doesn't seem to have any effect and I get 20

> Are you trying to handle data which are too small?
> If hive supports multi-threading for mapper it might be an solution.
> But I
> don't the configuration for that.
> Regards
> Bertrand
> PS : the question is quite general and not really hive related

I realized that after re-reading the whole thread :-)

Thanks for all the answers, everyone!


> On Fri, Jan 25, 2013 at 8:46 AM, Edward Capriolo
>> Not all files are split-table Sequence Files are. Raw gzip files are
>> not.
>> On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar
>> <[EMAIL PROTECTED]>wrote:
>>> set mapred.min.split.size=1024000;
>>> set mapred.max.split.size=4096000;
>>> set hive.merge.mapfiles=false;
>>> I had set above value and setting max split size to a lower value  
>>> did
>>> increase my # number of maps.  My blocksize was 128MB
>>> Only thing was my files on hdfs were not heavily compressed and I
>>> was
>>> using RCFileFormat
>>> I would suggest if you have heavily compressed files then you may
>>> want to
>>> do check what will be size after uncompression and allocate more
>>> memory to
>>> maps
>>> On Fri, Jan 25, 2013 at 11:46 AM, David Morel <[EMAIL PROTECTED]>
>>> wrote:
>>>> Hello,
>>>> I have seen many posts on various sites and MLs, but didn't find a
>>>> firm
>>>> answer anywhere: is it possible yes or no to force a smaller split
>>>> size
>>>> than a block on the mappers, from the client side? I'm not after
>>>> pointers to the docs (unless you're very very sure :-) but after
>>>> real-life experience along the lines of 'yes, it works this way,
>>>> I've
>>>> done it like this...'
>>>> All the parameters that I could find (especially specifying a max
>>>> input
>>>> split size) seem to have no effect, and the files that I have are
>>>> so
>>>> heavily compressed that they completely saturate the mappers'
>>>> memory
>>>> when processed.
>>>> A solution I could imagine for this specific issue is reducing the
>>>> block
>>>> size, but for now I simply went with disabling in-file compression
>>>> for
>>>> those. And changing the block size on a per-file basis is something
>>>> I'd
>>>> like to avoid if at all possible.
>>>> All the hive settings that we tried only got me as far as raising
>>>> the
>>>> number of mappers from 5 to 6 (yay!) where I would have needed at
>>>> least
>>>> ten times more.
>>>> Thanks!
>>>> D.Morel