Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> loading datafiles in s3


Copy link to this message
-
Re: loading datafiles in s3
I think the answer to 1 is No but you can confirm on the AWS EMR forum.

The problem I've been having is that if you have x=foo in the prefix of your
S3 path, EMR will try to use it as part of your partitioning key even if you
don't want it.
Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
confused. Sometimes it works, other times it complains that x is not part of
your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
why.
On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <
[EMAIL PROTECTED]> wrote:

> allo,****
>
> 1 dunno.  I generate my EMR scripts in a separate script so generating a
> stack of ‘alter table…’ queries is easy for me****
>
> 2 event_b will have a null value in column 4.****
>
> 2 b ( you didn’t ask) what happens with this row:****
>
> ** **
>
>   event_c user_id  france 500 afifthcolumn****
>
> ** **
>
> afifthcolumn will be truncated and you’ll have only event_c through 500 in
> the row****
>
> ** **
>
> Pat****
>
> ** **
>
> *From:* Kennon Lee [mailto:[EMAIL PROTECTED]]
> *Sent:* Monday, June 27, 2011 5:50 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* loading datafiles in s3****
>
> ** **
>
> Hello,****
>
> We're using hive on amazon elastic mapreduce to process logs on s3, and I
> had a couple basic questions. Apologies if they've been answered already-- I
> gathered most info from the hive tutorial on amazon (
> http://aws.amazon.com/articles/2855), as well as from skimming the hive
> wiki pages, but I'm still very new to all of this. So, questions:****
>
> ** **
>
> 1) Is it possible to partition on directories that do not have the "key="
> prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
> and so ideally we could partition on that structure instead of adding "dt="
> to every directory name. I found an old thread discussing this (
> http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
> but couldnt find the actual syntax.****
>
> ** **
>
> 2) How does hive handle tab-delimited files where rows sometimes have
> different column counts? For instance, if we are parsing an event log that
> contains multiple events, some of which have more columns associated with
> them:****
>
> ** **
>
> event_a        user_id        apple          300****
>
> event_b        user_id        cat****
>
> ** **
>
> If i define my hive table to have 4 columns, how will hive react to the
> event_b row?****
>
> ** **
>
> Thanks!****
>
> ** **
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB