Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> single output file per partition?


Copy link to this message
-
Re: single output file per partition?
Using a single bucket per partition seems to create a single reducer which
is too slow.
I've tried enforcing small files merge but that didn't work. I still got
multiple output files.

Creating a temp table and then "combining" the multiple files into one
using a simple select * is the only option that seems to work. It's odd
that I have to create the temp table but I don't see a workaround.
On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <[EMAIL PROTECTED]> wrote:

> hi igor,
> lots of ideas there!  I can't speak for them all but let me confirm first
> that "cluster by X into 1 bucket" didn't work?  I would have thought that
> would have done it.
>
>
>
>
> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
>
>> What's the best way to enforce a single output file per partition?
>>
>> INSERT OVERWRITE TABLE <table>
>> PARTITION (x,y,z)
>> SELECT ...
>> FROM ...
>> WHERE ...
>>
>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>> force a single reducer per partition but that didn't work. I still got
>> multiple files per partition.
>>
>> Do I have to use a single reduce task? With a few TB of data that's
>> probably not a good idea.
>>
>> My current idea is to create a temp table with the same partitioning
>> structure. Insert into that table first and then select * from that table
>> into the output table. With combineinputformat=true that should work right?
>>
>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>> Will that work with a partitioned table?
>>
>> Thanks!
>> igor
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB