Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Real-life experience of forcing smaller input splits?


+
David Morel 2013-01-25, 06:16
+
Mathieu Despriee 2013-01-25, 07:44
+
Nitin Pawar 2013-01-25, 06:47
+
Edward Capriolo 2013-01-25, 07:46
+
Bertrand Dechoux 2013-01-25, 09:37
+
David Morel 2013-01-25, 09:53
+
David Morel 2013-01-25, 12:28
Copy link to this message
-
Re: Real-life experience of forcing smaller input splits?
...

>
> That will be my approach for now, or disabling compression altogether for
> these files. The only problem I have is that compression is so efficient
> that any operation in the mapper (so on the uncompressed data) just makes
> the mapper throw an OOM exception, no matter how much memory I give it.
>
> What partly works though, is setting a low mapred.max.split.size. In a
> directory containing 34 files, I get 33 mappers (???). When setting
> hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs
> blocksize default), it doesn't seem to have any effect and I get 20 mappers
> only.
>
>
You can still use compression if you use a splittable format, like bzip2
with block compression. Gzip isn't splittable.

If you're running out of memory, you could also increase the heap size for
the client VMs. See the "Real-World Cluster Configurations" section of this
page:

http://hadoop.apache.org/docs/r1.0.3/cluster_setup.html

By the way, you could also experiment with turning on intermediate
compression; compression of the data sent between the mapper and reducer
tasks, compression of the output, etc, as discussed here:

https://cwiki.apache.org/Hive/adminmanual-configuration.html
>  ...
>>
>
+
Edward Capriolo 2013-01-25, 07:44
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB