Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> HIVE and S3 via EMR?


Copy link to this message
-
Re: HIVE and S3 via EMR?
I've made the bucket - which is derived from the enron emails - available
at s3:///rjurney_public_web/from_to_date and a sample is available at
http://s3.amazonaws.com/rjurney_public_web/from_to_date/part-m-00004

I am using hive 0.9.0.  I don't care about partitioning - I just want to
load my data any whichaway at this point.  Create table isn't working, so
I'm trying alter table now.  I really want to create a table, then load the
data into it, but external would be fine.

On Tue, May 29, 2012 at 2:42 PM, Aniket Mokashi <[EMAIL PROTECTED]> wrote:

> I think right URI scheme is s3n://abc/def. We use that with EMR version of
> hive in production.
>
> create table test (schema string) location 's3n://abc/def'; should work.
>
> On Tue, May 29, 2012 at 2:35 PM, Balaji Rao <[EMAIL PROTECTED]> wrote:
>
>> To partition on s3, one would create folders like:
>> s3://mybucket/path/dt=2012-05-20
>>                             dt=2012-05-21
>>                             dt=2012-05-22
>>
>> You can then use:
>> create external table from_to(from_address string, to_address string)
>> partitioned by (dt string) row format delimited fields terminated by
>> '\t' stored as textfile locaton 's3://mybucket/path';
>>
>> Then issue the command:
>> alter table from_to recover partitions;
>>
>> You will be able to then use the partitions:
>> select from_address, to_address, dt from from_to where dt >='2012-05-21'
>>
>> On Tue, May 29, 2012 at 5:19 PM, Russell Jurney
>> <[EMAIL PROTECTED]> wrote:
>> > I get an error when I create an external table.  btw - I can partition
>> on dt
>> > or from/to address.  I'm just not clear on how to partition - my efforts
>> > fail.
>> >
>> > hive> create external table from_to(from_address string, to_address
>> string,
>> > dt string)
>> >     >     row format delimited fields terminated by '\t' stored as
>> textfile
>> > location 's3n://rjurney_public_web/from_to_date';
>> > FAILED: Error in metadata: java.lang.IllegalArgumentException: Invalid
>> > hostname in URI s3n://rjurney_public_web/from_to_date
>> > FAILED: Execution Error, return code 1 from
>> > org.apache.hadoop.hive.ql.exec.DDLTask
>> >
>> >
>> > However, I just upgraded to HIVE 0.9, and it works :)  No reason to use
>> the
>> > old stuff when I can scp the new one up.
>> >
>> > Thanks!
>> >
>> > On Tue, May 29, 2012 at 1:34 PM, Balaji Rao <[EMAIL PROTECTED]>
>> wrote:
>> >>
>> >> If you are using hive on EMR, you can create a table directly from the
>> >> data on S3:
>> >>
>> >> From hive, you can create tables that use S3 data like this:
>> >>
>> >> create external table from_to(from_address string, to_address string,
>> >> dt string) row format delimited fields terminated by '\t' stored as
>> >> textfile location 's3://rjurney_public_web/from_to_date';
>> >>
>> >> You could then:
>> >>  select <*> from from_to
>> >>
>> >> Balaji
>> >>
>> >> On Tue, May 29, 2012 at 4:20 PM, Russell Jurney
>> >> <[EMAIL PROTECTED]> wrote:
>> >> > How do I load data from S3 into Hive using Amazon EMR?  I've booted a
>> >> > small
>> >> > cluster, and I want to load a 3-column TSV file from Pig into a table
>> >> > like
>> >> > this:
>> >> >
>> >> > create table from_to (from_address string, to_address string, dt
>> >> > string);
>> >> >
>> >> >
>> >> > When I run something like this:
>> >> >
>> >> > load data inpath 's3n://rjurney_public_web/from_to_date' into table
>> >> > from_to;
>> >> >
>> >> >
>> >> > I get errors:
>> >> >
>> >> > FAILED: Error in semantic analysis: Line 1:17 Invalid path
>> >> > 's3n://rjurney_public_web/from_to_date': only "file" or "hdfs" file
>> >> > systems
>> >> > accepted. s3n file system is not supported.
>> >> >
>> >> >
>> >> > There is no distcp on the master node of my EMR cluster, so I can't
>> copy
>> >> > it
>> >> > over.  I've read the documentation... and so far after a day of
>> trying,
>> >> > I
>> >> > can't load data into HIVE via EMR.
>> >> >
>> >> > What am I missing?  Thanks!
>> >> > --
>> >> > Russell

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB