Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> single output file per partition?


Copy link to this message
-
Re: single output file per partition?
hi igor,
lots of ideas there!  I can't speak for them all but let me confirm first
that "cluster by X into 1 bucket" didn't work?  I would have thought that
would have done it.
On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:

> What's the best way to enforce a single output file per partition?
>
> INSERT OVERWRITE TABLE <table>
> PARTITION (x,y,z)
> SELECT ...
> FROM ...
> WHERE ...
>
> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
> force a single reducer per partition but that didn't work. I still got
> multiple files per partition.
>
> Do I have to use a single reduce task? With a few TB of data that's
> probably not a good idea.
>
> My current idea is to create a temp table with the same partitioning
> structure. Insert into that table first and then select * from that table
> into the output table. With combineinputformat=true that should work right?
>
> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
> Will that work with a partitioned table?
>
> Thanks!
> igor
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB