Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Real-life experience of forcing smaller input splits?


Copy link to this message
-
Re: Real-life experience of forcing smaller input splits?
...

>
> That will be my approach for now, or disabling compression altogether for
> these files. The only problem I have is that compression is so efficient
> that any operation in the mapper (so on the uncompressed data) just makes
> the mapper throw an OOM exception, no matter how much memory I give it.
>
> What partly works though, is setting a low mapred.max.split.size. In a
> directory containing 34 files, I get 33 mappers (???). When setting
> hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs
> blocksize default), it doesn't seem to have any effect and I get 20 mappers
> only.
>
>
You can still use compression if you use a splittable format, like bzip2
with block compression. Gzip isn't splittable.

If you're running out of memory, you could also increase the heap size for
the client VMs. See the "Real-World Cluster Configurations" section of this
page:

http://hadoop.apache.org/docs/r1.0.3/cluster_setup.html

By the way, you could also experiment with turning on intermediate
compression; compression of the data sent between the mapper and reducer
tasks, compression of the output, etc, as discussed here:

https://cwiki.apache.org/Hive/adminmanual-configuration.html
>  ...
>>
>