Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> single output file per partition?


Copy link to this message
-
single output file per partition?
What's the best way to enforce a single output file per partition?

INSERT OVERWRITE TABLE <table>
PARTITION (x,y,z)
SELECT ...
FROM ...
WHERE ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
force a single reducer per partition but that didn't work. I still got
multiple files per partition.

Do I have to use a single reduce task? With a few TB of data that's
probably not a good idea.

My current idea is to create a temp table with the same partitioning
structure. Insert into that table first and then select * from that table
into the output table. With combineinputformat=true that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
Will that work with a partitioned table?

Thanks!
igor
+
Stephen Sprague 2013-08-21, 15:51
+
Igor Tatarinov 2013-08-21, 18:12
+
Stephen Sprague 2013-08-21, 19:07
+
Sanjay Subramanian 2013-08-21, 19:15
+
Igor Tatarinov 2013-08-21, 20:19