Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> single output file per partition?

Copy link to this message
Re: single output file per partition?

I tried file crusher with LZO but it does not work….I have LZO correctly configured in production and my jobs are running daily using LZO compression.

I like Crusher so I will see why its not working…Thanks to Edward the code is there to tweak :-)  and test locally
From: Stephen Sprague <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Wednesday, August 21, 2013 12:07 PM
Subject: Re: single output file per partition?

I see.  I'll have to punt then.  However, there is an after the fact file crusher Ed Capriolo wrote a while back here:  https://github.com/edwardcapriolo/filecrush YMMV
On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Using a single bucket per partition seems to create a single reducer which is too slow.
I've tried enforcing small files merge but that didn't work. I still got multiple output files.

Creating a temp table and then "combining" the multiple files into one using a simple select * is the only option that seems to work. It's odd that I have to create the temp table but I don't see a workaround.
On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
hi igor,
lots of ideas there!  I can't speak for them all but let me confirm first that "cluster by X into 1 bucket" didn't work?  I would have thought that would have done it.
On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
What's the best way to enforce a single output file per partition?

FROM ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will force a single reducer per partition but that didn't work. I still got multiple files per partition.

Do I have to use a single reduce task? With a few TB of data that's probably not a good idea.

My current idea is to create a temp table with the same partitioning structure. Insert into that table first and then select * from that table into the output table. With combineinputformat=true that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles) Will that work with a partitioned table?

=====================This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.