Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Real-life experience of forcing smaller input splits?


+
David Morel 2013-01-25, 06:16
+
Mathieu Despriee 2013-01-25, 07:44
+
Nitin Pawar 2013-01-25, 06:47
+
Edward Capriolo 2013-01-25, 07:46
+
Bertrand Dechoux 2013-01-25, 09:37
+
David Morel 2013-01-25, 09:53
+
David Morel 2013-01-25, 12:28
Copy link to this message
-
Re: Real-life experience of forcing smaller input splits?
Dean Wampler 2013-01-25, 13:39
...

>
> That will be my approach for now, or disabling compression altogether for
> these files. The only problem I have is that compression is so efficient
> that any operation in the mapper (so on the uncompressed data) just makes
> the mapper throw an OOM exception, no matter how much memory I give it.
>
> What partly works though, is setting a low mapred.max.split.size. In a
> directory containing 34 files, I get 33 mappers (???). When setting
> hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs
> blocksize default), it doesn't seem to have any effect and I get 20 mappers
> only.
>
>
You can still use compression if you use a splittable format, like bzip2
with block compression. Gzip isn't splittable.

If you're running out of memory, you could also increase the heap size for
the client VMs. See the "Real-World Cluster Configurations" section of this
page:

http://hadoop.apache.org/docs/r1.0.3/cluster_setup.html

By the way, you could also experiment with turning on intermediate
compression; compression of the data sent between the mapper and reducer
tasks, compression of the output, etc, as discussed here:

https://cwiki.apache.org/Hive/adminmanual-configuration.html
>  ...
>>
>
+
Edward Capriolo 2013-01-25, 07:44