Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Partition by directory


Copy link to this message
-
Partition by directory
Hello All,

I have been using the AWS setup for EMR for some time now and I am currently in the process of implementing spark/shark on my own cluster. I am installing from https://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz. Which includes hive0.9.0. I am using this with s3 and am unable to recover partitions from a directory with a series of other directories (partitions)  inside of it. I want to have 2 partitions 2012-10-25 and 2012-10-26 which contain their respective files. For example I have the following files located at s3://varickTest3/nn/.
drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-25

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00000

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-25/part-00001

drwxrwxrwx   -          0 1970-01-01 00:00 /nn/ds=2012-10-26

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00000

-rwxrwxrwx   1   49696432 2012-12-10 20:55 /nn/ds=2012-10-26/part-00001
When I run the query in hive (not shark):
CREATE EXTERNAL TABLE wiki(id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)

PARTITIONED BY (ds STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://varickTest3/nn';

ALTER TABLE wiki RECOVER PARTITIONS;
This will result in an empty table.
I have tried many iterations of this and nothing has worked so far. Including adding:

MSCK REPAIR TABLE wiki;

And using s3 rather than s3n (credentials for both types are set in core-site.xml)
And setting the options:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;
Although if I use:

LOCATION 's3n://varickTest3/nn/*
The table will have content but I am still unable to recover partitions.
Is there any way to do this using settings or data structure (rather than writing a script) to partition the table using the directories as I can in AWS?
Thank you for any help anyone can give me.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB