-Re: Please HELP: HIVE alter table add new partition to schema...
Mark Grover 2012-06-19, 15:45
It would be best if you wrote a script that iterates through each leaf-level partition of your existing table (logdate='A', source='B', datacenter='C', hostname='D')
and populate new leaf-level partitions in the new table (logdate='A', source='B', datacenter='C', hostname='D', loghour).
By leaf-level partition, I am referring to the partition column that appears the last in the list of partition columns.
Technically, you can do dynamic partitioning in 1 query with something like:
FROM test_table src
INSERT OVERWRITE TABLE dest PARTITION(logdate, source, datacenter, hostname, loghour)
SELECT ts, exec_time, domain_id, domain_name, logdate, source, datacenter, hostname, loghour;
However, as far as I understand, you wouldn't be able to take advantage of the existing partitioning in your source table (test_table). If you would like to take advantage of existing partitioning, you would have to issue a series of queries like this:
FROM test_table src
INSERT OVERWRITE TABLE dest PARTITION(logdate='A', source='B', datacenter='C', hostname='D', loghour)
SELECT ts, exec_time, domain_id, domain_name, logdate, source, datacenter, hostname, loghour where logdate='A' and source='B' and datacenter='C' and hostname='D';
Just as a side note, whenever you think of your table partitioning, keep in mind to not overdo it. Creating more partitions could lead to a lot of small files on HDFS which reduces the performance of your Hadoop cluster. A couple people have talked about this small files problem:
In general, you would like your file sizes to be atleast of HDFS block size, most likely a small multiple of the block size. If you do find that you are running into the small files problem, there are other ways to get around like bucketing.
----- Original Message -----
From: "Edward Capriolo" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, June 19, 2012 11:12:48 AM
Subject: Re: Please HELP: HIVE alter table add new partition to schema...
You can not change the partition columns. I would use a dynamic
partition insert to select all the data from the original table into
the new table.
On 6/19/12, Tim Havens <[EMAIL PROTECTED]> wrote:
> So...I have a table that has thousands of files, and Billions of rows
> related it.
> Lets make this a simple table:
> CREATE TABLE test_table (
> ts BIGINT,
> exec_time DOUBLE,
> domain_id BIGINT,
> domain_name STRING,
> PARTITIONED BY (logdate STRING, source STRING, datacenter STRING,
> hostname STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY '\001'
> MAP KEYS TERMINATED BY '\002'
> LINES TERMINATED BY '\n'
> STORED AS TextFile;
> So...what I need to do is ADD a partition to the PARTITIONED BY spec
> The partitioned by column I want to add is 'loghour STRING'.
> I can't seem to find any way to accomplish actually adding a NEW
> PARTITION COLUMN in the 'PARTITIONED BY' spec, without completely
> recreating and reloading the table.
> What's the correct way of adding to the partition schema and new
> column like 'loghour STRING'.
> I'm not trying to add an entry into the table DATA, I'm trying to add
> a completely new PARTITIONED BY Column...