Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - confused on different behavior of Bucketized tables do not support INSERT INTO


Copy link to this message
-
Re: confused on different behavior of Bucketized tables do not support INSERT INTO
Mark Grover 2012-06-01, 04:26
Hi Bruce,
It seems to me that your issue may be two-fold.

1) The JIRA that introduced "insert into <table>" (https://issues.apache.org/jira/browse/HIVE-306) in Hive 0.8 only works for non-bucketed partitions or tables. This is because appending inside a table or (its partition, if it's partitioned) works by adding a new file within the folder on HDFS. Since bucketing in a partition is implemented as files with a folder, this would break bucketing. To allow for insert into support for bucketed tables, we will probably have to regenerate the entire bucket(s) within the partition. Not sure if this is being done or has already been done in a newer release.

2) It seems like there was a minor bug in the implementation. As mentioned by this ticket (https://issues.apache.org/jira/browse/HIVE-3064) where "insert into <table>" works as "insert overwrite <table>" if the table name in the insert statement has upper case characters (black magic, eh?). Regardless of whether your table was created using upper or lower case letters, this bug manifests itself if you use upper case characters for the table name in your insert into table query.

To summarize, you shouldn't be calling insert into on bucketed tables for the reasons stated above. Also, if you are using insert into command, try to restrict your table names to lower case letters.

I have also verified that "insert into table dummy select * from dummy" works when dummy is a non-bucketed table. As a matter of fact, "insert overwrite table dummy select * from dummy" works too.

I would have expected your query "insert into table test2 select * from test;" to fail since test2 is bucketed as well. However, it doesn't. This seems to be a bug and I have created a JIRA for this (https://issues.apache.org/jira/browse/HIVE-3077).

Hope that helps,
Mark

----- Original Message -----
From: "Bruce Bian" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, May 31, 2012 10:16:03 AM
Subject: Re: confused on different behavior of Bucketized tables do not support INSERT INTO

So I did another test on this.

hive> create table test(foo int,bar string) clustered by(foo) sorted by (foo asc) into 2 buckets;
OK
Time taken: 0.097 seconds
hive> create table test2 (foo int,bar string) clustered by(foo) sorted by (foo asc) into 2 buckets;
OK

hive> LOAD DATA LOCAL INPATH 'hive/examples/files/kv1.txt' OVERWRITE INTO TABLE test;

hive> set hive.enforce.bucketing=true;
hive> set hive.enforce.sorting=true;

hive> insert into table test2 select * from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
……………………���…………………��……………

hive> insert into table test2 select * from test2;
FAILED: Error in semantic analysis: Bucketized tables do not support INSERT INTO: Table: test2
Seems like the error"FAILED: Error in semantic analysis: Bucketized tables do not support INSERT INTO: Table: vt_new_data error is occurred
" is only thrown when insert into a bucketized table from the same table? And when insert into a bucketized table multi-times, it will create a original_file_copy_n under the same bucket.

-rw-r--r-- 3 wbian supergroup 2856 2012-05-31 22:03 /user/hive/warehouse/test2/000000_0
-rw-r--r-- 3 wbian supergroup 2856 2012-05-31 22:04 /user/hive/warehouse/test2/000000_0_copy_1
-rw-r--r-- 3 wbian supergroup 2956 2012-05-31 22:03 /user/hive/warehouse/test2/000001_0
-rw-r--r-- 3 wbian supergroup 2956 2012-05-31 22:04 /user/hive/warehouse/test2/000001_0_copy_1
And since what I want to do is SMB Map Join, the following triggered the SMB Map Join successfully

set hive.optimize.bucketmapjoin= true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select /*+mapjoin(test)*/ * from pokes join test on pokes.foo=test.foo;
So what's the reason for throwing that error(i mean why not support insert into a bucketized table from the same table)?And isn't that error message kind of misleading?
On Thu, May 31, 2012 at 6:43 PM, Bruce Bian < [EMAIL PROTECTED] > wrote:
I'm using hive 0.9.0

On Thursday, May 31, 2012, Bruce Bian wrote:
Hi,
I've got a table vt_new_data which is defined as follows:

CREATE TABLE VT_NEW_DATA
(
V_ACCOUNT_NUM string
,V_ACCOUNT_MODIFIER_NUM string
,V_DEPOSIT_TYPE_CD string
,V_DEPOSIT_TERM int
,V_LEDGER_SUBJECT_ID string
,V_ACCOUNTING_ORG_CD string
,V_OPEN_DT string
,V_CLOSE_DT string
,V_CURRENCY_CD string
,V_ACCOUNT_BAL float
,V_INNER_MONTH_DELAY_ACCUM float
) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS RCFile;
when I execute the following query
explain insert into table vt_new_data select * from vt_new_data limit 1; (this is just a test)
an FAILED: Error in semantic analysis: Bucketized tables do not support INSERT INTO: Table: vt_new_data error is occurred
but when I execute the query:

explain insert into table vt_new_data
select /*+ MAPJOIN(T4) */
t1.account_num as v_account_num
,t1.account_modifier_num as v_account_modifier_num
,'3006' as v_deposit_type_cd
,0 as v_deposit_term
,'23201000' v_ledger_subject_id
,coalesce(t2.party_id,'') as v_accounting_org_cd
,coalesce(t3.card_begin_dt,'19000101') as v_open_dt
,coalesce(t3.card_live_dt,'19000101') as v_close_dt
,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as v_currency_cd
,coalesce(t4.agt_amt,0) as v_account_bal
,0 as v_inner_month_delay_accum
from t03_e_cash_bucket t1
left outer join t03_agt_amount_h_bucket t4
on t1.account_num=t4.account_num
and t1.account_modifier_num=t4.account_modifier_num
and t4.agt_amt_type_cd = '001'
and t4.start_date<='$TXNDATE'
and t4.end_date>'$TXNDATE'
left outer join t01_party_card_rela_h_bucket t2
on t1.card_no = t2.card_no
and t2.party_card_rela_type_cd = '01'
and t2.start_date<='$TXNDATE'
and t2.end_date>'$TXNDATE'
left outer join t03_card_bucket t3
on t1.card_no = t3.card_no;
the