Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Please HELP: HIVE alter table add new partition to schema...


Copy link to this message
-
Re: Please HELP: HIVE alter table add new partition to schema...
Mark Grover 2012-06-19, 15:45
I agree.
It would be best if you wrote a script that iterates through each leaf-level partition of your existing table (logdate='A', source='B', datacenter='C', hostname='D')
and populate new leaf-level partitions in the new table (logdate='A', source='B', datacenter='C', hostname='D', loghour).

By leaf-level partition, I am referring to the partition column that appears the last in the list of partition columns.

Technically, you can do dynamic partitioning in 1 query with something like:
FROM test_table src
INSERT OVERWRITE TABLE dest PARTITION(logdate, source, datacenter, hostname, loghour)
   SELECT ts, exec_time, domain_id, domain_name, logdate, source, datacenter, hostname, loghour;

However, as far as I understand, you wouldn't be able to take advantage of the existing partitioning in your source table (test_table). If you would like to take advantage of existing partitioning, you would have to issue a series of queries like this:
FROM test_table src
INSERT OVERWRITE TABLE dest PARTITION(logdate='A', source='B', datacenter='C', hostname='D', loghour)
   SELECT ts, exec_time, domain_id, domain_name, logdate, source, datacenter, hostname, loghour where logdate='A' and source='B' and datacenter='C' and hostname='D';

Just as a side note, whenever you think of your table partitioning, keep in mind to not overdo it. Creating more partitions could lead to a lot of small files on HDFS which reduces the performance of your Hadoop cluster. A couple people have talked about this small files problem:
http://arunxjacob.blogspot.ca/2011/04/hdfs-file-size-vs-allocation-other.html
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/

In general, you would like your file sizes to be atleast of HDFS block size, most likely a small multiple of the block size. If you do find that you are running into the small files problem, there are other ways to get around like bucketing.

Good luck!
Mark

----- Original Message -----
From: "Edward Capriolo" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, June 19, 2012 11:12:48 AM
Subject: Re: Please HELP: HIVE alter table add new partition to schema...

You can not change the partition columns. I would use a dynamic
partition insert to select all the data from the original table into
the new table.

On 6/19/12, Tim Havens <[EMAIL PROTECTED]> wrote:
> So...I have a table that has thousands of files, and Billions of rows
> related it.
>
> Lets make this a simple table:
>
> CREATE TABLE test_table (
>     ts BIGINT,
>     exec_time DOUBLE,
>     domain_id BIGINT,
>     domain_name STRING,
> )
> PARTITIONED BY (logdate STRING, source STRING, datacenter STRING,
> hostname STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY '\001'
> MAP KEYS TERMINATED BY '\002'
> LINES TERMINATED BY '\n'
> STORED AS TextFile;
>
> So...what I need to do is ADD a partition to the PARTITIONED BY spec
> above....
>
> The partitioned by column I want to add is 'loghour STRING'.
>
> I can't seem to find any way to accomplish actually adding a NEW
> PARTITION COLUMN in the 'PARTITIONED BY' spec, without completely
> recreating and reloading the table.
>
> What's the correct way of adding to the partition schema and new
> column like 'loghour STRING'.
>
> I'm not trying to add an entry into the table DATA, I'm trying to add
> a completely new PARTITIONED BY Column...
>