Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive skewed tables

Copy link to this message
Hive skewed tables
Rajesh Balamohan 2013-11-14, 01:05
Hi All,

I have the following skewed table "addresses_1"

select id, count(*) c from addresses_1 group by id order by c desc limit 10;
142624653    1554806
198477395    958492
102641838    220181
138947865    211331
156483436    193429
96411677    179771
210082076    168033
800174765    152421
139116901    141207
704352025    137263

I was able to create the following table with the skew information.  And I
was able to load the data into the table as well.

 CREATE  TABLE skew_addresses_1(
  id bigint,
  address_id bigint,
  address_lines string,
  city string,
  state string,
  postal_code string,
  country string,
  latitude string,
  longitude string,
  ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653, 198477395,
102641838, 138947865, 156483436, 96411677, 210082076, 800174765, 139116901,
   stored as rcfile;

select id,count(*) c from skew_addresses_1 where id=142624653 group by id
order by c limit 10;

*However, at the time of running select query,  entire dataset is
scanned. *I thought only the relevant dataset (with skew information
will be
scanned).  Am I missing anything here?  Any help will be appreciated.  I am
using Hive 10.x

I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
skew information populated in SKEWED_COL_NAMES in metadata.  But there is
no information in SKEWED_COL_VALUE_LOC_MAP table.