Hive, mail # user - Real-life experience of forcing smaller input splits?

David Morel 2013-01-25, 06:16
Mathieu Despriee 2013-01-25, 07:44
Nitin Pawar 2013-01-25, 06:47
Edward Capriolo 2013-01-25, 07:46
Bertrand Dechoux 2013-01-25, 09:37
David Morel 2013-01-25, 09:53
David Morel 2013-01-25, 12:28
Re: Real-life experience of forcing smaller input splits?
Dean Wampler 2013-01-25, 13:39

> That will be my approach for now, or disabling compression altogether for
> these files. The only problem I have is that compression is so efficient
> that any operation in the mapper (so on the uncompressed data) just makes
> the mapper throw an OOM exception, no matter how much memory I give it.
> What partly works though, is setting a low mapred.max.split.size. In a
> directory containing 34 files, I get 33 mappers (???). When setting
> hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs
> blocksize default), it doesn't seem to have any effect and I get 20 mappers
> only.
You can still use compression if you use a splittable format, like bzip2
with block compression. Gzip isn't splittable.

If you're running out of memory, you could also increase the heap size for
the client VMs. See the "Real-World Cluster Configurations" section of this


By the way, you could also experiment with turning on intermediate
compression; compression of the data sent between the mapper and reducer
tasks, compression of the output, etc, as discussed here:

>  ...
Edward Capriolo 2013-01-25, 07:44