You have a couple of options to save your intermediate state- 1. If your
metastore is HA, you can save your state in metastore (eg- alter table
<tname> TBLPROPERTIES ("job.state", "DoneTill:121122)). 2. You can
periodically save your state in EMR-local drives and upload it to s3. You
can use any custom format for your state information (mysql dump is also a
To make all this work, you can start your EMR cluster in hive-interactive
mode and run the driver process. This process can be in any programming
(scripting) language of your choice.
You can use your state to know about newly added partition (periodically
Hive/HCatalog has a way to pragmatically alert for data availability (take
a look at HIVE-2038, HCatalog-3).
On Sun, Dec 11, 2011 at 4:48 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> Hello All,
> So I had a single node pseudo cluster that has been calculating me
> some statistics running for a year. finally it grew more than
> do-it-at-home task.
> So I have my data uploaded to s3, and I have configured everything so
> that I can load my tables, and load the partitions, and the data is
> available to the elastic map reduce.
> I have number of problems I need to solve before I can use this in any
> useful manner.
> First: I load data and I must run number of queries where the input is
> a partition name. usually MMDDHH. so each time the script runs, I must
> keep a state where I left last, and then it must do some processing
> for the partitions that are newly loaded.
> considering I am using s3, how can I store state? perhaps in some
> other table, that is also stored in s3? is it a good approach, to keep
> states and such things in other tables, like in sql's old days?
> another problem I am having is how to implement a function that will
> increase partition. how will i know what are the newest loaded
> also is there like a cursor in HQL?
> Best regards,