Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Automating the partition creation process


+
Sadananda Hegde 2013-01-29, 04:05
+
Mark Grover 2013-01-29, 04:47
+
Dean Wampler 2013-01-29, 16:37
Copy link to this message
-
Re: Automating the partition creation process
Thanks Dean,

But the M/R job creates the sub-folder based on the data and hence could
create the sub-folder for any of the previous days (and for multiple days
too). So I canot limit to today's folder only. I need to get the
year/month/day values from the HDFS location instead of from the system
date and loop through HDFS partitions and add missing partitions to the
metastore (probably it's similar to what MSCK does). Any suggestions?

Thanks,
Sadu
On Tue, Jan 29, 2013 at 10:37 AM, Dean Wampler <
[EMAIL PROTECTED]> wrote:

> This is very easy to do with a shell script and you can have cron run it
> nightly, say just after midnight so the year, month, and day calculation
> below return the correct values:
>
> #!/bin/bash
>
> year=$(year +%Y)
> month=$(year +%m)
> day=$(year +%d)
>
> hive -e "use mydb; ALTER TABLE foo ADD IF NOT EXISTS PARTITION
> (year='$year', month='$month', day='$day') location
> '/path/to/mydb.db/foo/year=$year/month=$month/day=$day';"
>
> (The last line wrapped)
>
> If the year, month and day are integers, then omit the single quotes,
> i.e., (year=$year, month=$month, day=$day)
>
> Including IF NOT EXISTS let's you run the script multiple times per day,
> if you're paranoid... ;)
>
> By the way, I used "use mydb; ALTER TABLE foo..." instead of "ALTER TABLE
> mydb.foo..." because hive (v0.10.0) didn't like the latter.
>
> dean
>
> On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover <[EMAIL PROTECTED]
> > wrote:
>
>> Sadananda,
>> See if this helps:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
>>
>>
>> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote:
>>
>>> Hello,
>>>
>>> My hive table is partitioned by year, month and day. I have defined it
>>> as external table. The M/R job correctly loads the files into the daily
>>> subfolders. The hdfs files will be loaded to
>>> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs.
>>> The M/R job has some business logic in determining the values for year,
>>> month and day; so one run might create / load files into multiple sub
>>> -folders (multiple days). I am able to query the tables after adding
>>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate
>>> the partition creation step? Basically this script needs to identify the
>>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD
>>> PARTITION statements.
>>>
>>> For example, say the M/R job loads files into the following 3 sub-folders
>>>
>>> /user/hive/warehouse/sales/year=2013/month=1/day=21
>>> /user/hive/warehouse/sales/year=2013/month=1/day=22
>>> /user/hive/warehouse/sales/year=2013/month=1/day=23
>>>
>>> Then it should create 3 alter table statements
>>>
>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21);
>>>  ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22);
>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23);
>>>
>>> I thought of changing M/R jobs to load all files into same folder,
>>> then first load the files into non-partitioned table and then to load the
>>> partitioned table from non-partitioned table (using dynamic partition); but
>>> would prefer to avoid that extra step if possible (esp. since data is
>>> already in the correct sub-folders).
>>>
>>> Any help would greately be appreciated.
>>>
>>> Regards,
>>> Sadu
>>>
>>>
>>>
>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
+
Sadananda Hegde 2013-01-30, 01:09
+
Mark Grover 2013-01-30, 01:17
+
Edward Capriolo 2013-01-30, 01:21
+
Sadananda Hegde 2013-01-30, 01:49
+
Dean Wampler 2013-01-30, 02:05
+
abhishek 2013-01-29, 04:47