Koert Kuipers 2012-02-08, 16:04
bejoy_ks@... 2012-02-08, 16:16
Mark Grover 2012-02-08, 17:23
we have a similar situation and this is what we did.
In our case, the partitions correspond to dates. We also have multiple
external tables set up this way.
The upstream process updates a status file with the earliest and
latest date available. I scan the DFS for new partitions (scan program
should know in advance how deep it should go to avoid excessive
namenode calls) and load any new partition that is not in HIVE and
it's less or equal than what the status file says.
This is done to avoid loading a partition that the upstream service is
not done processing (for this reason also the ALTER TABLE RECOVER
PARTITIONS won't work in our case unless we change how the upstream
process stores data).
Also, I drop the partitions that are older than the lower bound on the
status file. That's because the upstream system deletes older data,
and you don't want a stale partition in hive that points to data that
is not there anymore.
For all these operations I use hiveserver, the thrift interface. We've
been using this system for 2 years with this sync running hourly.
If it's of general interest, I can share the code, my company is open
On Wed, Feb 8, 2012 at 9:23 AM, Mark Grover <[EMAIL PROTECTED]> wrote:
> Hi Koert,
> That's because Hive metastore doesn't know about the partitions you added. I was in a similar situation but I use Amazon EMR and in their version of Hive, one can run the command "alter table src recover partitions" that goes through the directory structure of table (src, in this case) and adds the partitions in Hive metastore. You can look into getting a patch for that, if at all available.
> Instead of running the command through shell, you could try to connect to Hive from within Java (I personally use the Hive JDBC driver to do so) and then issue the desired add partition command but it might be an overkill if this the only reason you want to connect to Hive from Java.
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
> www: oanda.com www: fxtrade.com
> e: [EMAIL PROTECTED]
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
> ----- Original Message -----
> From: "Koert Kuipers" <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, February 8, 2012 11:04:18 AM
> Subject: external partitioned table
> hello all,
> we have an external partitioned table in hive.
> we add to this table by having map-reduce jobs (so not from hive) create new subdirectories with the right format (partitionid=partitionvalue).
> however hive doesn't pick them up automatically. we have to go into hive shell and run "alter table sometable add partition (partitionid=partitionvalue)". to make matter worse hive doesnt really lend itself to running such an add-partition-operation from java (or for that matter: hive doesn't lend itself to any easy programmatic manipulations... grrr. but i will stop now before i go on a a rant).
> any suggestions how to approach this? thanks!
> best, koert