|
Sadananda Hegde
2013-01-29, 04:05
Mark Grover
2013-01-29, 04:47
abhishek
2013-01-29, 04:47
Dean Wampler
2013-01-29, 16:37
Sadananda Hegde
2013-01-30, 01:09
Mark Grover
2013-01-30, 01:17
Edward Capriolo
2013-01-30, 01:21
Sadananda Hegde
2013-01-30, 01:44
Sadananda Hegde
2013-01-30, 01:49
Dean Wampler
2013-01-30, 02:05
|
-
Automating the partition creation processSadananda Hegde 2013-01-29, 04:05
Hello,
My hive table is partitioned by year, month and day. I have defined it as external table. The M/R job correctly loads the files into the daily subfolders. The hdfs files will be loaded to <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. The M/R job has some business logic in determining the values for year, month and day; so one run might create / load files into multiple sub -folders (multiple days). I am able to query the tables after adding partitions using ALTER TABLE ADD PARTITION statement. But how do I automate the partition creation step? Basically this script needs to identify the subfolders created by the M/R job and create corresponding ALTER TABLE ADD PARTITION statements. For example, say the M/R job loads files into the following 3 sub-folders /user/hive/warehouse/sales/year=2013/month=1/day=21 /user/hive/warehouse/sales/year=2013/month=1/day=22 /user/hive/warehouse/sales/year=2013/month=1/day=23 Then it should create 3 alter table statements ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); I thought of changing M/R jobs to load all files into same folder, then first load the files into non-partitioned table and then to load the partitioned table from non-partitioned table (using dynamic partition); but would prefer to avoid that extra step if possible (esp. since data is already in the correct sub-folders). Any help would greately be appreciated. Regards, Sadu
-
Re: Automating the partition creation processMark Grover 2013-01-29, 04:47
Sadananda,
See if this helps: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: > Hello, > > My hive table is partitioned by year, month and day. I have defined it as > external table. The M/R job correctly loads the files into the daily > subfolders. The hdfs files will be loaded to > <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. > The M/R job has some business logic in determining the values for year, > month and day; so one run might create / load files into multiple sub > -folders (multiple days). I am able to query the tables after adding > partitions using ALTER TABLE ADD PARTITION statement. But how do I automate > the partition creation step? Basically this script needs to identify the > subfolders created by the M/R job and create corresponding ALTER TABLE ADD > PARTITION statements. > > For example, say the M/R job loads files into the following 3 sub-folders > > /user/hive/warehouse/sales/year=2013/month=1/day=21 > /user/hive/warehouse/sales/year=2013/month=1/day=22 > /user/hive/warehouse/sales/year=2013/month=1/day=23 > > Then it should create 3 alter table statements > > ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); > ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); > ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); > > I thought of changing M/R jobs to load all files into same folder, > then first load the files into non-partitioned table and then to load the > partitioned table from non-partitioned table (using dynamic partition); but > would prefer to avoid that extra step if possible (esp. since data is > already in the correct sub-folders). > > Any help would greately be appreciated. > > Regards, > Sadu > > >
-
Re: Automating the partition creation processabhishek 2013-01-29, 04:47
Sadananda,
Look at Oozie workflow. Regards Abhishek On Jan 28, 2013, at 11:05 PM, Sadananda Hegde <[EMAIL PROTECTED]> wrote: > Hello, > > My hive table is partitioned by year, month and day. I have defined it as external table. The M/R job correctly loads the files into the daily subfolders. The hdfs files will be loaded to <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. The M/R job has some business logic in determining the values for year, month and day; so one run might create / load files into multiple sub -folders (multiple days). I am able to query the tables after adding partitions using ALTER TABLE ADD PARTITION statement. But how do I automate the partition creation step? Basically this script needs to identify the subfolders created by the M/R job and create corresponding ALTER TABLE ADD PARTITION statements. > > For example, say the M/R job loads files into the following 3 sub-folders > > /user/hive/warehouse/sales/year=2013/month=1/day=21 > /user/hive/warehouse/sales/year=2013/month=1/day=22 > /user/hive/warehouse/sales/year=2013/month=1/day=23 > > Then it should create 3 alter table statements > > ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); > ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); > ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); > > I thought of changing M/R jobs to load all files into same folder, then first load the files into non-partitioned table and then to load the partitioned table from non-partitioned table (using dynamic partition); but would prefer to avoid that extra step if possible (esp. since data is already in the correct sub-folders). > > Any help would greately be appreciated. > > Regards, > Sadu > >
-
Re: Automating the partition creation processDean Wampler 2013-01-29, 16:37
This is very easy to do with a shell script and you can have cron run it
nightly, say just after midnight so the year, month, and day calculation below return the correct values: #!/bin/bash year=$(year +%Y) month=$(year +%m) day=$(year +%d) hive -e "use mydb; ALTER TABLE foo ADD IF NOT EXISTS PARTITION (year='$year', month='$month', day='$day') location '/path/to/mydb.db/foo/year=$year/month=$month/day=$day';" (The last line wrapped) If the year, month and day are integers, then omit the single quotes, i.e., (year=$year, month=$month, day=$day) Including IF NOT EXISTS let's you run the script multiple times per day, if you're paranoid... ;) By the way, I used "use mydb; ALTER TABLE foo..." instead of "ALTER TABLE mydb.foo..." because hive (v0.10.0) didn't like the latter. dean On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > Sadananda, > See if this helps: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions > > > On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: > >> Hello, >> >> My hive table is partitioned by year, month and day. I have defined it as >> external table. The M/R job correctly loads the files into the daily >> subfolders. The hdfs files will be loaded to >> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >> The M/R job has some business logic in determining the values for year, >> month and day; so one run might create / load files into multiple sub >> -folders (multiple days). I am able to query the tables after adding >> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >> the partition creation step? Basically this script needs to identify the >> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >> PARTITION statements. >> >> For example, say the M/R job loads files into the following 3 sub-folders >> >> /user/hive/warehouse/sales/year=2013/month=1/day=21 >> /user/hive/warehouse/sales/year=2013/month=1/day=22 >> /user/hive/warehouse/sales/year=2013/month=1/day=23 >> >> Then it should create 3 alter table statements >> >> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >> >> I thought of changing M/R jobs to load all files into same folder, >> then first load the files into non-partitioned table and then to load the >> partitioned table from non-partitioned table (using dynamic partition); but >> would prefer to avoid that extra step if possible (esp. since data is >> already in the correct sub-folders). >> >> Any help would greately be appreciated. >> >> Regards, >> Sadu >> >> >> > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
-
Re: Automating the partition creation processSadananda Hegde 2013-01-30, 01:09
Thanks Mark,
Recover partition feature will satisfy my needs; but MSCK Repair Partition < tablename> option is not working for me. It does not give any error; but does not add any partitions either. It looks like it adds partitions only when the sub-folder is empty; but not when the sub-folder has the data files. I see a fix to this issue here. https://issues.apache.org/jira/browse/HIVE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel But probably it's not commited yet, since the final result says 'ABORTED". Thanks, Sadu On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > Sadananda, > See if this helps: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions > > > On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: > >> Hello, >> >> My hive table is partitioned by year, month and day. I have defined it as >> external table. The M/R job correctly loads the files into the daily >> subfolders. The hdfs files will be loaded to >> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >> The M/R job has some business logic in determining the values for year, >> month and day; so one run might create / load files into multiple sub >> -folders (multiple days). I am able to query the tables after adding >> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >> the partition creation step? Basically this script needs to identify the >> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >> PARTITION statements. >> >> For example, say the M/R job loads files into the following 3 sub-folders >> >> /user/hive/warehouse/sales/year=2013/month=1/day=21 >> /user/hive/warehouse/sales/year=2013/month=1/day=22 >> /user/hive/warehouse/sales/year=2013/month=1/day=23 >> >> Then it should create 3 alter table statements >> >> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >> >> I thought of changing M/R jobs to load all files into same folder, >> then first load the files into non-partitioned table and then to load the >> partitioned table from non-partitioned table (using dynamic partition); but >> would prefer to avoid that extra step if possible (esp. since data is >> already in the correct sub-folders). >> >> Any help would greately be appreciated. >> >> Regards, >> Sadu >> >> >> > >
-
Re: Automating the partition creation processMark Grover 2013-01-30, 01:17
Hi Sadananda,
Sorry to hear that. It got committed, don't worry about the "ABORTED". Here is the commit on the trunk: https://github.com/apache/hive/commit/523f47c3b6e7cb7b6b7b7801c66406e116af6dbc However, there is no Apache Hive release with that patch in it. You have two options: 1. Download the patch, rebuild hive and use it 2. Find a hacky way to recover your partitions when they are empty and populate them later. Sorry for the inconvenience. Mark On Tue, Jan 29, 2013 at 5:09 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: > Thanks Mark, > > Recover partition feature will satisfy my needs; but MSCK Repair Partition > < tablename> option is not working for me. It does not give any error; but > does not add any partitions either. It looks like it adds partitions only > when the sub-folder is empty; but not when the sub-folder has the data > files. I see a fix to this issue here. > > https://issues.apache.org/jira/browse/HIVE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > But probably it's not commited yet, since the final result says 'ABORTED". > > Thanks, > Sadu > > On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover <[EMAIL PROTECTED] > > wrote: > >> Sadananda, >> See if this helps: >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions >> >> >> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >> >>> Hello, >>> >>> My hive table is partitioned by year, month and day. I have defined it >>> as external table. The M/R job correctly loads the files into the daily >>> subfolders. The hdfs files will be loaded to >>> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >>> The M/R job has some business logic in determining the values for year, >>> month and day; so one run might create / load files into multiple sub >>> -folders (multiple days). I am able to query the tables after adding >>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >>> the partition creation step? Basically this script needs to identify the >>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >>> PARTITION statements. >>> >>> For example, say the M/R job loads files into the following 3 sub-folders >>> >>> /user/hive/warehouse/sales/year=2013/month=1/day=21 >>> /user/hive/warehouse/sales/year=2013/month=1/day=22 >>> /user/hive/warehouse/sales/year=2013/month=1/day=23 >>> >>> Then it should create 3 alter table statements >>> >>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >>> >>> I thought of changing M/R jobs to load all files into same folder, >>> then first load the files into non-partitioned table and then to load the >>> partitioned table from non-partitioned table (using dynamic partition); but >>> would prefer to avoid that extra step if possible (esp. since data is >>> already in the correct sub-folders). >>> >>> Any help would greately be appreciated. >>> >>> Regards, >>> Sadu >>> >>> >>> >> >> >
-
Re: Automating the partition creation processEdward Capriolo 2013-01-30, 01:21
You can also just create all your partitions ahead of time. They will not
do any harm if empty. (unless you have an older version and hit this... http://issues.apache.org/jira/browse/HIVE-1007 ) On Tue, Jan 29, 2013 at 8:17 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > Hi Sadananda, > Sorry to hear that. > > It got committed, don't worry about the "ABORTED". Here is the commit on > the trunk: > > https://github.com/apache/hive/commit/523f47c3b6e7cb7b6b7b7801c66406e116af6dbc > > However, there is no Apache Hive release with that patch in it. > > You have two options: > 1. Download the patch, rebuild hive and use it > 2. Find a hacky way to recover your partitions when they are empty and > populate them later. > > Sorry for the inconvenience. > > Mark > > On Tue, Jan 29, 2013 at 5:09 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: > >> Thanks Mark, >> >> Recover partition feature will satisfy my needs; but MSCK Repair >> Partition < tablename> option is not working for me. It does not give any >> error; but does not add any partitions either. It looks like it adds >> partitions only when the sub-folder is empty; but not >> when the sub-folder has the data files. I see a fix to this issue here. >> >> https://issues.apache.org/jira/browse/HIVE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> >> But probably it's not commited yet, since the final result says >> 'ABORTED". >> >> Thanks, >> Sadu >> >> On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover < >> [EMAIL PROTECTED]> wrote: >> >>> Sadananda, >>> See if this helps: >>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions >>> >>> >>> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >>> >>>> Hello, >>>> >>>> My hive table is partitioned by year, month and day. I have defined it >>>> as external table. The M/R job correctly loads the files into the daily >>>> subfolders. The hdfs files will be loaded to >>>> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >>>> The M/R job has some business logic in determining the values for year, >>>> month and day; so one run might create / load files into multiple sub >>>> -folders (multiple days). I am able to query the tables after adding >>>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >>>> the partition creation step? Basically this script needs to identify the >>>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >>>> PARTITION statements. >>>> >>>> For example, say the M/R job loads files into the following 3 >>>> sub-folders >>>> >>>> /user/hive/warehouse/sales/year=2013/month=1/day=21 >>>> /user/hive/warehouse/sales/year=2013/month=1/day=22 >>>> /user/hive/warehouse/sales/year=2013/month=1/day=23 >>>> >>>> Then it should create 3 alter table statements >>>> >>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >>>> >>>> I thought of changing M/R jobs to load all files into same folder, >>>> then first load the files into non-partitioned table and then to load the >>>> partitioned table from non-partitioned table (using dynamic partition); but >>>> would prefer to avoid that extra step if possible (esp. since data is >>>> already in the correct sub-folders). >>>> >>>> Any help would greately be appreciated. >>>> >>>> Regards, >>>> Sadu >>>> >>>> >>>> >>> >>> >> >
-
Re: Automating the partition creation processSadananda Hegde 2013-01-30, 01:44
Thanks Dean,
But the M/R job creates the sub-folder based on the data and hence could create the sub-folder for any of the previous days (and for multiple days too). So I canot limit to today's folder only. I need to get the year/month/day values from the HDFS location instead of from the system date and loop through HDFS partitions and add missing partitions to the metastore (probably it's similar to what MSCK does). Any suggestions? Thanks, Sadu On Tue, Jan 29, 2013 at 10:37 AM, Dean Wampler < [EMAIL PROTECTED]> wrote: > This is very easy to do with a shell script and you can have cron run it > nightly, say just after midnight so the year, month, and day calculation > below return the correct values: > > #!/bin/bash > > year=$(year +%Y) > month=$(year +%m) > day=$(year +%d) > > hive -e "use mydb; ALTER TABLE foo ADD IF NOT EXISTS PARTITION > (year='$year', month='$month', day='$day') location > '/path/to/mydb.db/foo/year=$year/month=$month/day=$day';" > > (The last line wrapped) > > If the year, month and day are integers, then omit the single quotes, > i.e., (year=$year, month=$month, day=$day) > > Including IF NOT EXISTS let's you run the script multiple times per day, > if you're paranoid... ;) > > By the way, I used "use mydb; ALTER TABLE foo..." instead of "ALTER TABLE > mydb.foo..." because hive (v0.10.0) didn't like the latter. > > dean > > On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover <[EMAIL PROTECTED] > > wrote: > >> Sadananda, >> See if this helps: >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions >> >> >> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >> >>> Hello, >>> >>> My hive table is partitioned by year, month and day. I have defined it >>> as external table. The M/R job correctly loads the files into the daily >>> subfolders. The hdfs files will be loaded to >>> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >>> The M/R job has some business logic in determining the values for year, >>> month and day; so one run might create / load files into multiple sub >>> -folders (multiple days). I am able to query the tables after adding >>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >>> the partition creation step? Basically this script needs to identify the >>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >>> PARTITION statements. >>> >>> For example, say the M/R job loads files into the following 3 sub-folders >>> >>> /user/hive/warehouse/sales/year=2013/month=1/day=21 >>> /user/hive/warehouse/sales/year=2013/month=1/day=22 >>> /user/hive/warehouse/sales/year=2013/month=1/day=23 >>> >>> Then it should create 3 alter table statements >>> >>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >>> >>> I thought of changing M/R jobs to load all files into same folder, >>> then first load the files into non-partitioned table and then to load the >>> partitioned table from non-partitioned table (using dynamic partition); but >>> would prefer to avoid that extra step if possible (esp. since data is >>> already in the correct sub-folders). >>> >>> Any help would greately be appreciated. >>> >>> Regards, >>> Sadu >>> >>> >>> >> >> > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > >
-
Re: Automating the partition creation processSadananda Hegde 2013-01-30, 01:49
Thanks, Edward. I can probably create all previous days partitions ahead
of time and then use Dean's logic to create new partitions on a daily basis. I will probably end up having few empty partitions; need to make sure it does not cause any confusions. Thanks, Sadu On Tue, Jan 29, 2013 at 7:21 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > You can also just create all your partitions ahead of time. They will not > do any harm if empty. (unless you have an older version and hit this... > http://issues.apache.org/jira/browse/HIVE-1007 ) > > > On Tue, Jan 29, 2013 at 8:17 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > >> Hi Sadananda, >> Sorry to hear that. >> >> It got committed, don't worry about the "ABORTED". Here is the commit on >> the trunk: >> >> https://github.com/apache/hive/commit/523f47c3b6e7cb7b6b7b7801c66406e116af6dbc >> >> However, there is no Apache Hive release with that patch in it. >> >> You have two options: >> 1. Download the patch, rebuild hive and use it >> 2. Find a hacky way to recover your partitions when they are empty and >> populate them later. >> >> Sorry for the inconvenience. >> >> Mark >> >> On Tue, Jan 29, 2013 at 5:09 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >> >>> Thanks Mark, >>> >>> Recover partition feature will satisfy my needs; but MSCK Repair >>> Partition < tablename> option is not working for me. It does not give any >>> error; but does not add any partitions either. It looks like it adds >>> partitions only when the sub-folder is empty; but not >>> when the sub-folder has the data files. I see a fix to this issue here. >>> >>> https://issues.apache.org/jira/browse/HIVE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >>> >>> But probably it's not commited yet, since the final result says >>> 'ABORTED". >>> >>> Thanks, >>> Sadu >>> >>> On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Sadananda, >>>> See if this helps: >>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions >>>> >>>> >>>> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >>>> >>>>> Hello, >>>>> >>>>> My hive table is partitioned by year, month and day. I have defined it >>>>> as external table. The M/R job correctly loads the files into the daily >>>>> subfolders. The hdfs files will be loaded to >>>>> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >>>>> The M/R job has some business logic in determining the values for year, >>>>> month and day; so one run might create / load files into multiple sub >>>>> -folders (multiple days). I am able to query the tables after adding >>>>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >>>>> the partition creation step? Basically this script needs to identify the >>>>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >>>>> PARTITION statements. >>>>> >>>>> For example, say the M/R job loads files into the following 3 >>>>> sub-folders >>>>> >>>>> /user/hive/warehouse/sales/year=2013/month=1/day=21 >>>>> /user/hive/warehouse/sales/year=2013/month=1/day=22 >>>>> /user/hive/warehouse/sales/year=2013/month=1/day=23 >>>>> >>>>> Then it should create 3 alter table statements >>>>> >>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >>>>> >>>>> I thought of changing M/R jobs to load all files into same folder, >>>>> then first load the files into non-partitioned table and then to load the >>>>> partitioned table from non-partitioned table (using dynamic partition); but >>>>> would prefer to avoid that extra step if possible (esp. since data is >>>>> already in the correct sub-folders). >>>>> >>>>> Any help would greately be appreciated. >>>>> >>>>> Regards, >>>>> Sadu >>>>> >>>>> >>>>>
-
Re: Automating the partition creation processDean Wampler 2013-01-30, 02:05
Yes, if the only problem with my script solution is the variable number of
new days, you can just add all possibilities in advance, knowing that some will be empty of content... On Tue, Jan 29, 2013 at 7:49 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: > Thanks, Edward. I can probably create all previous days partitions ahead > of time and then use Dean's logic to create new partitions on a daily > basis. I will probably end up having few empty partitions; need to make > sure it does not cause any confusions. > > Thanks, > Sadu > > On Tue, Jan 29, 2013 at 7:21 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > >> You can also just create all your partitions ahead of time. They will not >> do any harm if empty. (unless you have an older version and hit this... >> http://issues.apache.org/jira/browse/HIVE-1007 ) >> >> >> On Tue, Jan 29, 2013 at 8:17 PM, Mark Grover <[EMAIL PROTECTED] >> > wrote: >> >>> Hi Sadananda, >>> Sorry to hear that. >>> >>> It got committed, don't worry about the "ABORTED". Here is the commit on >>> the trunk: >>> >>> https://github.com/apache/hive/commit/523f47c3b6e7cb7b6b7b7801c66406e116af6dbc >>> >>> However, there is no Apache Hive release with that patch in it. >>> >>> You have two options: >>> 1. Download the patch, rebuild hive and use it >>> 2. Find a hacky way to recover your partitions when they are empty and >>> populate them later. >>> >>> Sorry for the inconvenience. >>> >>> Mark >>> >>> On Tue, Jan 29, 2013 at 5:09 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >>> >>>> Thanks Mark, >>>> >>>> Recover partition feature will satisfy my needs; but MSCK Repair >>>> Partition < tablename> option is not working for me. It does not give any >>>> error; but does not add any partitions either. It looks like it adds >>>> partitions only when the sub-folder is empty; but not >>>> when the sub-folder has the data files. I see a fix to this issue here. >>>> >>>> https://issues.apache.org/jira/browse/HIVE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >>>> >>>> But probably it's not commited yet, since the final result says >>>> 'ABORTED". >>>> >>>> Thanks, >>>> Sadu >>>> >>>> On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Sadananda, >>>>> See if this helps: >>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions >>>>> >>>>> >>>>> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> My hive table is partitioned by year, month and day. I have >>>>>> defined it as external table. The M/R job correctly loads the files into >>>>>> the daily subfolders. The hdfs files will be loaded to >>>>>> <hivetable>/year=yyyy/month=mm/day=dd/ folders by the scheduled M/R jobs. >>>>>> The M/R job has some business logic in determining the values for year, >>>>>> month and day; so one run might create / load files into multiple sub >>>>>> -folders (multiple days). I am able to query the tables after adding >>>>>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate >>>>>> the partition creation step? Basically this script needs to identify the >>>>>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD >>>>>> PARTITION statements. >>>>>> >>>>>> For example, say the M/R job loads files into the following 3 >>>>>> sub-folders >>>>>> >>>>>> /user/hive/warehouse/sales/year=2013/month=1/day=21 >>>>>> /user/hive/warehouse/sales/year=2013/month=1/day=22 >>>>>> /user/hive/warehouse/sales/year=2013/month=1/day=23 >>>>>> >>>>>> Then it should create 3 alter table statements >>>>>> >>>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21); >>>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22); >>>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23); >>>>>> >>>>>> I thought of changing M/R jobs to load all files into same folder, >>>>>> then first load the files into non-partitioned table and then to load the *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330 |