Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> loading datafiles in s3


Copy link to this message
-
Re: loading datafiles in s3
I think the answer to 1 is No but you can confirm on the AWS EMR forum.

The problem I've been having is that if you have x=foo in the prefix of your
S3 path, EMR will try to use it as part of your partitioning key even if you
don't want it.
Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
confused. Sometimes it works, other times it complains that x is not part of
your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
why.
On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <
[EMAIL PROTECTED]> wrote:

> allo,****
>
> 1 dunno.  I generate my EMR scripts in a separate script so generating a
> stack of ‘alter table…’ queries is easy for me****
>
> 2 event_b will have a null value in column 4.****
>
> 2 b ( you didn’t ask) what happens with this row:****
>
> ** **
>
>   event_c user_id  france 500 afifthcolumn****
>
> ** **
>
> afifthcolumn will be truncated and you’ll have only event_c through 500 in
> the row****
>
> ** **
>
> Pat****
>
> ** **
>
> *From:* Kennon Lee [mailto:[EMAIL PROTECTED]]
> *Sent:* Monday, June 27, 2011 5:50 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* loading datafiles in s3****
>
> ** **
>
> Hello,****
>
> We're using hive on amazon elastic mapreduce to process logs on s3, and I
> had a couple basic questions. Apologies if they've been answered already-- I
> gathered most info from the hive tutorial on amazon (
> http://aws.amazon.com/articles/2855), as well as from skimming the hive
> wiki pages, but I'm still very new to all of this. So, questions:****
>
> ** **
>
> 1) Is it possible to partition on directories that do not have the "key="
> prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
> and so ideally we could partition on that structure instead of adding "dt="
> to every directory name. I found an old thread discussing this (
> http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
> but couldnt find the actual syntax.****
>
> ** **
>
> 2) How does hive handle tab-delimited files where rows sometimes have
> different column counts? For instance, if we are parsing an event log that
> contains multiple events, some of which have more columns associated with
> them:****
>
> ** **
>
> event_a        user_id        apple          300****
>
> event_b        user_id        cat****
>
> ** **
>
> If i define my hive table to have 4 columns, how will hive react to the
> event_b row?****
>
> ** **
>
> Thanks!****
>
> ** **
>