Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> single output file per partition?


+
Igor Tatarinov 2013-08-20, 21:29
+
Stephen Sprague 2013-08-21, 15:51
+
Igor Tatarinov 2013-08-21, 18:12
Copy link to this message
-
Re: single output file per partition?
I see.  I'll have to punt then.  However, there is an after the fact file
crusher Ed Capriolo wrote a while back here:
https://github.com/edwardcapriolo/filecrush YMMV
On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:

> Using a single bucket per partition seems to create a single reducer which
> is too slow.
>  I've tried enforcing small files merge but that didn't work. I still got
> multiple output files.
>
> Creating a temp table and then "combining" the multiple files into one
> using a simple select * is the only option that seems to work. It's odd
> that I have to create the temp table but I don't see a workaround.
>
>
> On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <[EMAIL PROTECTED]>wrote:
>
>> hi igor,
>> lots of ideas there!  I can't speak for them all but let me confirm first
>> that "cluster by X into 1 bucket" didn't work?  I would have thought that
>> would have done it.
>>
>>
>>
>>
>> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
>>
>>> What's the best way to enforce a single output file per partition?
>>>
>>> INSERT OVERWRITE TABLE <table>
>>> PARTITION (x,y,z)
>>> SELECT ...
>>> FROM ...
>>> WHERE ...
>>>
>>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>>> force a single reducer per partition but that didn't work. I still got
>>> multiple files per partition.
>>>
>>> Do I have to use a single reduce task? With a few TB of data that's
>>> probably not a good idea.
>>>
>>> My current idea is to create a temp table with the same partitioning
>>> structure. Insert into that table first and then select * from that table
>>> into the output table. With combineinputformat=true that should work right?
>>>
>>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>>> Will that work with a partitioned table?
>>>
>>> Thanks!
>>> igor
>>>
>>
>>
>
+
Sanjay Subramanian 2013-08-21, 19:15
+
Igor Tatarinov 2013-08-21, 20:19
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB