Tony Burton 2013-03-26, 17:11
Ramki Palle 2013-03-26, 17:41
Sanjay Subramanian 2013-03-26, 17:21
Tony Burton 2013-03-26, 17:39
Sanjay Subramanian 2013-03-26, 17:41
Tony Burton 2013-03-26, 17:45
Keith Wiley 2013-03-26, 19:39
-RE: S3/EMR Hive: Load contents of a single file
Tony Burton 2013-03-27, 08:46
Thanks for the reply Keith.
> you could have dispensed with the additional "alter table" business and simply created the original table around the directory in the first place
Yep, but I have multiple files in that directory and wanted to create a table based upon one file per table.
> Do you know for certain that it isn't using other files also in that directory as part of the same table
> or if it is currently empty, that if you add a new file to the directory after creating the table in your
> described fashion, it doesn't immediately become visible as part of the table?
I've got two files in my s3://mybucket/path/to/data/ directory, s3://mybucket/path/to/data/src1.txt and s3://mybucket/path/to/data/src2.txt - both contain lists of ~-separated date/count pairs, eg 20130101~12345. Both contain data for just the month of February this year.
Create two tables:
$ create external table gsrc1 (gdate string, c int) row format delimited fields terminated by '~' stored as textfile;
$ alter table gsrc1 set location 's3://spinmetrics/global/src1.txt';
$ create external table gsrc2 (gdate string, c int) row format delimited fields terminated by '~' stored as textfile;
$ alter table gsrc2 set location 's3://spinmetrics/global/src2.txt';
Count(*) on each table:
$ select count(*) from gsrc1:
$ select count(*) from gsrc2:
Ok, but both tables could be pointing at the same data. Check max, min and first/last entry from both tables:
$ select min(c), max(c) from gsrc1;
$ select min(c), max(c) from gsrc2;
$ select * from gsrc1 where gdate="20130201"
$ select * from gsrc1 where gdate="20130228"
$ select * from gsrc2 where gdate="20130201"
$ select * from gsrc2 where gdate="20130228"
And without copying in the whole data set I am 100% confident that these values match the contents of the individual files in s3. Maybe other readers could try a similar exercise and present their results? Are there other tests I could try to further verify my findings?
From: Keith Wiley [mailto:[EMAIL PROTECTED]]
Sent: 26 March 2013 19:40
To: [EMAIL PROTECTED]
Subject: Re: S3/EMR Hive: Load contents of a single file
Are you sure this is doing what you think it's doing? Since Hive associates tables with directories (well external tables at least, I'm not very familiar with internal tables), my suspicion is that even if your approach described below works, what Hive actually did was use s3://mybucket/path/to/data/ as the table location...in which case you could have dispensed with the additional "alter table" business and simply created the original table around the directory in the first place...or I could be completely wrong. Do you know for certain that it isn't using other files also in that directory as part of the same table...or if it is currently empty, that if you add a new file to the directory after creating the table in your described fashion, it doesn't immediately become visible as part of the table? I eagerly await clarification.
On Mar 26, 2013, at 10:39 , Tony Burton wrote:
> Thanks for the quick reply Sanjay.
> ALTER TABLE is the key, but slightly different to your suggestion. I create the table as before, but don't specify location:
> $ create external table myData (str1 string, str2 string, count1 int)
> partitioned by <snip> row format <snip> stored as textfile;
> Then use ALTER TABLE like this:
> $ ALTER TABLE myData SET LOCATION '
> s3://mybucket/path/to/data/src1.txt ';
> Bingo, I can now run queries with myData in the same way I can when the LOCATION is a directory. Cool!
> From: Sanjay Subramanian [mailto:[EMAIL PROTECTED]]
> Sent: 26 March 2013 17:22
> To: [EMAIL PROTECTED]
> Subject: Re: S3/EMR Hive: Load contents of a single file
> Hi Tony
> Can u create the table without any location.
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy."
-- Edwin A. Abbott, Flatland ________________________________________________________________________________
Please consider the environment before printing this email
Inbound Email has been scanned for viruses and SPAM
Please consider the environment before printing this email or attachments
This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.
Outbound email has been scanned for viruses and SPAM
Tony Burton 2013-03-27, 09:58
Keith Wiley 2013-03-27, 17:02
Tony Burton 2013-03-27, 17:18