I have been using the AWS setup for EMR for some time now and I am currently in the process of implementing spark/shark on my own cluster. I am installing from https://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz. Which includes hive0.9.0. I am using this with s3 and am unable to recover partitions from a directory with a series of other directories (partitions) inside of it. I want to have 2 partitions 2012-10-25 and 2012-10-26 which contain their respective files. For example I have the following files located at s3://varickTest3/nn/.
drwxrwxrwx - 0 1970-01-01 00:00 /nn/ds=2012-10-25
-rwxrwxrwx 1 49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00000
-rwxrwxrwx 1 49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00001
drwxrwxrwx - 0 1970-01-01 00:00 /nn/ds=2012-10-26
-rwxrwxrwx 1 49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00000
-rwxrwxrwx 1 49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00001
When I run the query in hive (not shark):
CREATE EXTERNAL TABLE wiki(id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
PARTITIONED BY (ds STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://varickTest3/nn';
ALTER TABLE wiki RECOVER PARTITIONS;
This will result in an empty table.
I have tried many iterations of this and nothing has worked so far. Including adding:
MSCK REPAIR TABLE wiki;
And using s3 rather than s3n (credentials for both types are set in core-site.xml)
And setting the options:
Although if I use:
The table will have content but I am still unable to recover partitions.
Is there any way to do this using settings or data structure (rather than writing a script) to partition the table using the directories as I can in AWS?
Thank you for any help anyone can give me.
Mark Grover 2012-12-14, 09:01