Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Creating external table poiting to s3 folder with files not loading data


+
Fernando Andrés Doglio Tu... 2012-12-11, 14:05
+
Dean Wampler 2012-12-14, 15:22
+
Fernando Andrés Doglio Tu... 2012-12-17, 11:32
Copy link to this message
-
Re: Creating external table poiting to s3 folder with files not loading data
Dean Wampler 2012-12-17, 15:07
You raise an important point; "metadata" commands like create table and
alter table only affect metadata, not the actual data itself. So, you have
to write the files into the partition directories yourself and in the
correct schema. One way to do the latter is to stage the raw data in a
"temporary" table and write a query that INSERTS INTO the correct
partitions of the final table.

However, dynamic partitioning with external tables requires a little care.
By default, if the partitions don't already exist, the directories will be
created under
hive.metastore.warehouse.dir, e.g., /data/hive/warehouse, as if the table
is managed, even if you intended the files to be somewhere else. BUT, if
you create the partitions in advance, the data will end up in the correct
directories for each partition.

Here's an example, adapted from the courseware we use at Think Big
Analytics:

First, assume I have some "raw" stocks data in a staged table name
raw_stocks and schema (ymd STRING, symbol STRING, closing_price FLOAT, ...)
where "ymd" is a year-month-day string, e.g., YYYY-MM-DD.

Next, create an EXTERNAL stocks table partitioned by year, for example:

CREATE EXTERNAL TABLE stocks (ymd STRING, symbol STRING, closing_price
FLOAT, ...)
PARTITIONED BY (year INT);

Now, if you don't create the partitions in advance and run the following,
watch what happens:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO TABLE stocks PARTITION(year)
SELECT ymd, symbol, price_close, year(ymd) FROM raw_stocks;

I'll end up with directories under /data/hive/warehouse/stocks/, e.g.,
  /data/hive/warehouse/stocks/year=1984:
  /data/hive/warehouse/stocks/year=1985:
  ...
  /data/hive/warehouse/stocks/year=2011:
  /data/hive/warehouse/stocks/year=2012:

You can use DESCRIBE FORMATTED stocks PARTITION(year=XXXX) to confirm these
locations (as well as dfs -ls ...., of course).

In other words, just like managed/internal tables. Also, just to be clear,
if you now drop the table, the directories won't be deleted, even though
they are in the managed table location, because the table is external.

HOWEVER, if you run a query first over raw_stocks to determine all the
years, then you can create the partitions in advance:
(I wrote a bash script to generate these statements and yes, you can build
a single statement that does all of them at once...)

ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/1984';
...
ALTER TABLE stocks2 ADD PARTITION(year=1984) LOCATION '/data/stocks/2012';

Now, the dynamic partitions query will run as before, but the partitions
will be in

  /data/stocks/year=1984:
  /data/stocks/year=1985:
  ...
  /data/stocks/year=2011:
  /data/stocks/year=2012:

as desired. It should work for S3, etc., as well.

dean

On Mon, Dec 17, 2012 at 5:32 AM, Fernando Andrés Doglio Turissini <
[EMAIL PROTECTED]> wrote:

> Hello, and thank you both for your answers...
> I think I found the problem... keep in mind I'm quite new to all this
> Hive/Hadoop stuff :)
>
> I think my problem was due to the fact that the create table statement had
> the partition defined but the information was not partitioned on the file
> system (it was just 1 file inside a folder).
>
> I'm guessing that what I have to do, is load the data into a
> non-partitioned table and then  copy the information using hive and dynamic
> partitioning the data in the same query... is that right?
>
> Thanks again!
>
>
> On Fri, Dec 14, 2012 at 1:22 PM, Dean Wampler <
> [EMAIL PROTECTED]> wrote:
>
>> A couple of clarifying questions and suggestions. First, keep in mind
>> that Hive doesn't care if you have a typo of some kind in your external
>> location ;) Use DESCRIBE FORMATTED to verify the path is right. For an
>> external partitioned table, DESCRIBE FORMATTED table
>> PARTITION(col1=val1,col2=val2,...).
>>
>> A dumb mistake I've often made is use a variable in a script, e.g., "...
>> LOCATION '${DATA}/foo/bar/baz';" and forget to define DATA when invoking
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330
+
Mark Grover 2012-12-14, 09:05