Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> S3/EMR Hive: Load contents of a single file


Copy link to this message
-
RE: S3/EMR Hive: Load contents of a single file
A bit more info - do an extended description of the table:

$ desc extended gsrc1;

And the "location" field is "location:s3://mybucket/path/to/data/src1.txt"

Do the same on a table created with a location pointing at the directory and the same info gives (not surprisingly) "location:s3://mybucket/path/to/data/"

From: Tony Burton [mailto:[EMAIL PROTECTED]]
Sent: 27 March 2013 08:46
To: '[EMAIL PROTECTED]'
Subject: RE: S3/EMR Hive: Load contents of a single file

Thanks for the reply Keith.

> you could have dispensed with the additional "alter table" business and simply created the original table around the directory in the first place

Yep, but I have multiple files in that directory and wanted to create a table based upon one file per table.

> Do you know for certain that it isn't using other files also in that directory as part of the same table
> or if it is currently empty, that if you add a new file to the directory after creating the table in your
> described fashion, it doesn't immediately become visible as part of the table?

I've got two files in my s3://mybucket/path/to/data/ directory, s3://mybucket/path/to/data/src1.txt and s3://mybucket/path/to/data/src2.txt - both contain lists of ~-separated date/count pairs, eg 20130101~12345. Both contain data for just the month of February this year.

Create two tables:

$ create external table gsrc1 (gdate string, c int) row format delimited fields terminated by '~' stored as textfile;
$ alter table gsrc1 set location 's3://spinmetrics/global/src1.txt';
$ create external table gsrc2 (gdate string, c int) row format delimited fields terminated by '~' stored as textfile;
$ alter table gsrc2 set location 's3://spinmetrics/global/src2.txt';

Count(*) on each table:

$ select count(*) from gsrc1:
28
$ select count(*) from gsrc2:
28

Ok, but both tables could be pointing at the same data. Check max, min and first/last entry from both tables:

$ select min(c), max(c) from gsrc1;
2935 23130
$ select min(c), max(c) from gsrc2;
865953 2768868

$ select * from gsrc1 where gdate="20130201"
20130201 5153
$ select * from gsrc1 where gdate="20130228"
20130228 7051
$ select * from gsrc2 where gdate="20130201"
20130201 1472017
$ select * from gsrc2 where gdate="20130228"
20130228 1323241

And without copying in the whole data set I am 100% confident that these values match the contents of the individual files in s3. Maybe other readers could try a similar exercise and present their results? Are there other tests I could try to further verify my findings?

Tony

-----Original Message-----
From: Keith Wiley [mailto:[EMAIL PROTECTED]]
Sent: 26 March 2013 19:40
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: S3/EMR Hive: Load contents of a single file

Are you sure this is doing what you think it's doing? Since Hive associates tables with directories (well external tables at least, I'm not very familiar with internal tables), my suspicion is that even if your approach described below works, what Hive actually did was use s3://mybucket/path/to/data/ as the table location...in which case you could have dispensed with the additional "alter table" business and simply created the original table around the directory in the first place...or I could be completely wrong. Do you know for certain that it isn't using other files also in that directory as part of the same table...or if it is currently empty, that if you add a new file to the directory after creating the table in your described fashion, it doesn't immediately become visible as part of the table? I eagerly await clarification.

On Mar 26, 2013, at 10:39 , Tony Burton wrote:

>
> Thanks for the quick reply Sanjay.
>
> ALTER TABLE is the key, but slightly different to your suggestion. I create the table as before, but don't specify location:
>
> $ create external table myData (str1 string, str2 string, count1 int)
> partitioned by <snip> row format <snip> stored as textfile;
>
> Then use ALTER TABLE like this:
________________________________________________________________________________
Keith Wiley [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> keithwiley.com music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy."

Please consider the environment before printing this email

Inbound Email has been scanned for viruses and SPAM
*****************************************************************************
P Please consider the environment before printing this email or attachments

This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

**********************************************************************
Please consider the environment before printing this email or attachments

This e