Ya I very much agree with you on those lines. Using the basic stuff would literally run into memory issues with large datasets. I had some of those resolved by using the DISTRIBUTE BY clause and so. In short a little work around over your hive queries could help you out in some cases.
Bejoy K S
From: hadoopman <[EMAIL PROTECTED]>
Date: Sun, 14 Aug 2011 08:57:12
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: how to load data to partitioned table
Something else I've noticed is when loading LOTS of historical data, if
you can try to say load a month of data at a time, try to just load THAT
month of data and only that month. I've been able to load several years
of data (depending on the data) at a single load however there have been
times when loading a large dataset that I would run into memory issues
during the reduce phase (usually during shuffle/sort). Things from out
of memory to stack overflow messages (I've compiled a list of the more
Then I noticed that only loading data from say a single month loaded
quickly and without the memory headaches during the reduce.
Something to keep in mind and it works great!
On 08/12/2011 07:58 AM, [EMAIL PROTECTED] wrote:
> Hi Daniel
> Just having a look at your requirement , to load data into a partition
> based hive table from any input file the most hassle free approach
> would be.
> 1. Load the data into a non partitioned table that shares similar
> structure as the target table.
> 2. Populate the target table with the data from non partitioned one
> using hive dynamic partition
> With Dynamic partitions you don't need to manually identify the data
> partitions and distribute data accordingly.
> A similar implementation is described in the blog post
> Hope it helps
> Bejoy K S
> *From: * Vikas Srivastava <[EMAIL PROTECTED]>
> *Date: *Fri, 12 Aug 2011 17:31:28 +0530
> *To: *<[EMAIL PROTECTED]>
> *ReplyTo: * [EMAIL PROTECTED]
> *Subject: *Re: how to load data to partitioned table
> Hey ,
> Simpley you have run query like this
> FROM sales_temp INSERT OVERWRITE TABLE sales partition(period_key)
> SELECT *
> Vikas Srivastava
> 2011/8/12 Daniel,Wu <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
> suppose the table is partitioned by period_key, and the csv file
> also has a column named as period_key. The csv file contains
> multiple days of data, how can we load it in the the table?
> I think of an workaround by first load the data into a
> non-partition table, and then insert the data from non-partition
> table to the partition table.
> hive> INSERT OVERWRITE TABLE sales SELECT * FROM sales_temp;
> FAILED: Error in semantic analysis: need to specify partition
> columns because the destination table is partitioned.
> However it doesn't work also. please help.
> With Regards
> Vikas Srivastava
> DWH & Analytics Team
> Mob:+91 9560885900
> One97 | Let's get talking !