Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> single output file per partition?


Copy link to this message
-
single output file per partition?
What's the best way to enforce a single output file per partition?

INSERT OVERWRITE TABLE <table>
PARTITION (x,y,z)
SELECT ...
FROM ...
WHERE ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
force a single reducer per partition but that didn't work. I still got
multiple files per partition.

Do I have to use a single reduce task? With a few TB of data that's
probably not a good idea.

My current idea is to create a temp table with the same partitioning
structure. Insert into that table first and then select * from that table
into the output table. With combineinputformat=true that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
Will that work with a partitioned table?

Thanks!
igor
+
Stephen Sprague 2013-08-21, 15:51
+
Igor Tatarinov 2013-08-21, 18:12
+
Stephen Sprague 2013-08-21, 19:07
+
Sanjay Subramanian 2013-08-21, 19:15
+
Igor Tatarinov 2013-08-21, 20:19
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB