Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> single output file per partition?


Copy link to this message
-
Re: single output file per partition?
Actually, using a temp table doesn't work either. Apparently, a single
mapper can read from multiple partitions (and output multiple files). There
is no way to force a single mapper per partition.
On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:

> Using a single bucket per partition seems to create a single reducer which
> is too slow.
> I've tried enforcing small files merge but that didn't work. I still got
> multiple output files.
>
> Creating a temp table and then "combining" the multiple files into one
> using a simple select * is the only option that seems to work. It's odd
> that I have to create the temp table but I don't see a workaround.
>
>
> On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <[EMAIL PROTECTED]>wrote:
>
>> hi igor,
>> lots of ideas there!  I can't speak for them all but let me confirm first
>> that "cluster by X into 1 bucket" didn't work?  I would have thought that
>> would have done it.
>>
>>
>>
>>
>> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
>>
>>> What's the best way to enforce a single output file per partition?
>>>
>>> INSERT OVERWRITE TABLE <table>
>>> PARTITION (x,y,z)
>>> SELECT ...
>>> FROM ...
>>> WHERE ...
>>>
>>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>>> force a single reducer per partition but that didn't work. I still got
>>> multiple files per partition.
>>>
>>> Do I have to use a single reduce task? With a few TB of data that's
>>> probably not a good idea.
>>>
>>> My current idea is to create a temp table with the same partitioning
>>> structure. Insert into that table first and then select * from that table
>>> into the output table. With combineinputformat=true that should work right?
>>>
>>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>>> Will that work with a partitioned table?
>>>
>>> Thanks!
>>> igor
>>>
>>
>>
>