|
|
-
Creating external table poiting to s3 folder with files not loading data
Fernando Andrés Doglio Tu... 2012-12-11, 14:05
Long subject, I know.. let me explain a bit more about the problem:
I'm trying to load a file into a hive table (this is on an EMR instance) for that I create an external table, and I set the location to the folder on an s3 bucket, where the file resides. The problem is that even though the table is created correctly, when I do a "select * from table" it returns nothing. I'm not seeing errors on the logs either, so I don't know what can be happening....
Also, probably important: I'm using a custom SerDe that I did not write...but I do have the code for it.
I'm quite new to hive, so I appreciate any kind of pointers you can throw at me.
Thanks! Fernando Doglio
+
Fernando Andrés Doglio Tu... 2012-12-11, 14:05
-
Re: Creating external table poiting to s3 folder with files not loading data
Dean Wampler 2012-12-14, 15:22
A couple of clarifying questions and suggestions. First, keep in mind that Hive doesn't care if you have a typo of some kind in your external location ;) Use DESCRIBE FORMATTED to verify the path is right. For an external partitioned table, DESCRIBE FORMATTED table PARTITION(col1=val1,col2=val2,...).
A dumb mistake I've often made is use a variable in a script, e.g., "... LOCATION '${DATA}/foo/bar/baz';" and forget to define DATA when invoking the script.
When you said "load a file", did you mean using the LOAD DATA ... INPATH 's3n://...' command? I've read that s3n is not supported for these statements, but I'm not sure that's actually true.
If everything looks correct, you should be able to do hadoop fs -ls s3n://... successfully. Actually, since your hive environment could have different settings for some filesystem properties, it might be a better check to use dfs -ls ... at the hive CLI prompt.
Otherwise, it's probably the SerDe, as Mark suggested. If possible, I would attempt to use the data in some temporary external table using a built-in SerDe, like the default, just to confirm that it's not a file system issue and it's probably the SerDe.
Hope that helps. dean
On Tue, Dec 11, 2012 at 8:05 AM, Fernando Andrés Doglio Turissini < [EMAIL PROTECTED]> wrote:
> Long subject, I know.. let me explain a bit more about the problem: > > I'm trying to load a file into a hive table (this is on an EMR instance) > for that I create an external table, and I set the location to the folder > on an s3 bucket, where the file resides. > The problem is that even though the table is created correctly, when I do > a "select * from table" it returns nothing. I'm not seeing errors on the > logs either, so I don't know what can be happening.... > > Also, probably important: I'm using a custom SerDe that I did not > write...but I do have the code for it. > > I'm quite new to hive, so I appreciate any kind of pointers you can throw > at me. > > Thanks! > Fernando Doglio >
-- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
+
Dean Wampler 2012-12-14, 15:22
-
Re: Creating external table poiting to s3 folder with files not loading data
Fernando Andrés Doglio Tu... 2012-12-17, 11:32
Hello, and thank you both for your answers... I think I found the problem... keep in mind I'm quite new to all this Hive/Hadoop stuff :)
I think my problem was due to the fact that the create table statement had the partition defined but the information was not partitioned on the file system (it was just 1 file inside a folder).
I'm guessing that what I have to do, is load the data into a non-partitioned table and then copy the information using hive and dynamic partitioning the data in the same query... is that right?
Thanks again!
On Fri, Dec 14, 2012 at 1:22 PM, Dean Wampler < [EMAIL PROTECTED]> wrote:
> A couple of clarifying questions and suggestions. First, keep in mind that > Hive doesn't care if you have a typo of some kind in your external location > ;) Use DESCRIBE FORMATTED to verify the path is right. For an external > partitioned table, DESCRIBE FORMATTED table > PARTITION(col1=val1,col2=val2,...). > > A dumb mistake I've often made is use a variable in a script, e.g., "... > LOCATION '${DATA}/foo/bar/baz';" and forget to define DATA when invoking > the script. > > When you said "load a file", did you mean using the LOAD DATA ... INPATH > 's3n://...' command? I've read that s3n is not supported for these > statements, but I'm not sure that's actually true. > > If everything looks correct, you should be able to do hadoop fs -ls > s3n://... successfully. Actually, since your hive environment could have > different settings for some filesystem properties, it might be a better > check to use dfs -ls ... at the hive CLI prompt. > > Otherwise, it's probably the SerDe, as Mark suggested. If possible, I > would attempt to use the data in some temporary external table using a > built-in SerDe, like the default, just to confirm that it's not a file > system issue and it's probably the SerDe. > > Hope that helps. > dean > > On Tue, Dec 11, 2012 at 8:05 AM, Fernando Andrés Doglio Turissini < > [EMAIL PROTECTED]> wrote: > >> Long subject, I know.. let me explain a bit more about the problem: >> >> I'm trying to load a file into a hive table (this is on an EMR instance) >> for that I create an external table, and I set the location to the folder >> on an s3 bucket, where the file resides. >> The problem is that even though the table is created correctly, when I do >> a "select * from table" it returns nothing. I'm not seeing errors on the >> logs either, so I don't know what can be happening.... >> >> Also, probably important: I'm using a custom SerDe that I did not >> write...but I do have the code for it. >> >> I'm quite new to hive, so I appreciate any kind of pointers you can throw >> at me. >> >> Thanks! >> Fernando Doglio >> > > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > > >
+
Fernando Andrés Doglio Tu... 2012-12-17, 11:32
-
Re: Creating external table poiting to s3 folder with files not loading data
Dean Wampler 2012-12-17, 15:07
You raise an important point; "metadata" commands like create table and alter table only affect metadata, not the actual data itself. So, you have to write the files into the partition directories yourself and in the correct schema. One way to do the latter is to stage the raw data in a "temporary" table and write a query that INSERTS INTO the correct partitions of the final table.
However, dynamic partitioning with external tables requires a little care. By default, if the partitions don't already exist, the directories will be created under hive.metastore.warehouse.dir, e.g., /data/hive/warehouse, as if the table is managed, even if you intended the files to be somewhere else. BUT, if you create the partitions in advance, the data will end up in the correct directories for each partition.
Here's an example, adapted from the courseware we use at Think Big Analytics:
First, assume I have some "raw" stocks data in a staged table name raw_stocks and schema (ymd STRING, symbol STRING, closing_price FLOAT, ...) where "ymd" is a year-month-day string, e.g., YYYY-MM-DD.
Next, create an EXTERNAL stocks table partitioned by year, for example:
CREATE EXTERNAL TABLE stocks (ymd STRING, symbol STRING, closing_price FLOAT, ...) PARTITIONED BY (year INT);
Now, if you don't create the partitions in advance and run the following, watch what happens:
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE stocks PARTITION(year) SELECT ymd, symbol, price_close, year(ymd) FROM raw_stocks;
I'll end up with directories under /data/hive/warehouse/stocks/, e.g., /data/hive/warehouse/stocks/year=1984: /data/hive/warehouse/stocks/year=1985: ... /data/hive/warehouse/stocks/year=2011: /data/hive/warehouse/stocks/year=2012:
You can use DESCRIBE FORMATTED stocks PARTITION(year=XXXX) to confirm these locations (as well as dfs -ls ...., of course).
In other words, just like managed/internal tables. Also, just to be clear, if you now drop the table, the directories won't be deleted, even though they are in the managed table location, because the table is external.
HOWEVER, if you run a query first over raw_stocks to determine all the years, then you can create the partitions in advance: (I wrote a bash script to generate these statements and yes, you can build a single statement that does all of them at once...)
ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/1984'; ... ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/2012';
Now, the dynamic partitions query will run as before, but the partitions will be in
/data/stocks/year=1984: /data/stocks/year=1985: ... /data/stocks/year=2011: /data/stocks/year=2012:
as desired. It should work for S3, etc., as well.
dean
On Mon, Dec 17, 2012 at 5:32 AM, Fernando Andrés Doglio Turissini < [EMAIL PROTECTED]> wrote:
> Hello, and thank you both for your answers... > I think I found the problem... keep in mind I'm quite new to all this > Hive/Hadoop stuff :) > > I think my problem was due to the fact that the create table statement had > the partition defined but the information was not partitioned on the file > system (it was just 1 file inside a folder). > > I'm guessing that what I have to do, is load the data into a > non-partitioned table and then copy the information using hive and dynamic > partitioning the data in the same query... is that right? > > Thanks again! > > > On Fri, Dec 14, 2012 at 1:22 PM, Dean Wampler < > [EMAIL PROTECTED]> wrote: > >> A couple of clarifying questions and suggestions. First, keep in mind >> that Hive doesn't care if you have a typo of some kind in your external >> location ;) Use DESCRIBE FORMATTED to verify the path is right. For an >> external partitioned table, DESCRIBE FORMATTED table >> PARTITION(col1=val1,col2=val2,...). >> >> A dumb mistake I've often made is use a variable in a script, e.g., "... >> LOCATION '${DATA}/foo/bar/baz';" and forget to define DATA when invoking *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
+
Dean Wampler 2012-12-17, 15:07
-
Re: Creating external table poiting to s3 folder with files not loading data
Mark Grover 2012-12-14, 09:05
Fernando, It is more likely related to your SerDe and the underlying data not matching up to it and than being related to the table being external on S3.
Mark
On Tue, Dec 11, 2012 at 6:05 AM, Fernando Andrés Doglio Turissini <[EMAIL PROTECTED]> wrote: > Long subject, I know.. let me explain a bit more about the problem: > > I'm trying to load a file into a hive table (this is on an EMR instance) for > that I create an external table, and I set the location to the folder on an > s3 bucket, where the file resides. > The problem is that even though the table is created correctly, when I do a > "select * from table" it returns nothing. I'm not seeing errors on the logs > either, so I don't know what can be happening.... > > Also, probably important: I'm using a custom SerDe that I did not > write...but I do have the code for it. > > I'm quite new to hive, so I appreciate any kind of pointers you can throw at > me. > > Thanks! > Fernando Doglio
+
Mark Grover 2012-12-14, 09:05
|
|