In my understanding,
when you are saying scanning entire dataset it is looking at all your
partitions because your data has been partitioned by the date column.
A skewed table is a table where there will be different files created for
all your skewed keys in all the partitions.
So for your query it will look at all partitions.
The setting you have kept is only applicable to join queries as it clearly
says skewjoin. Non join queries it does not have an affect.
On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan <
[EMAIL PROTECTED]> wrote:
> Hi All,
> I have the following skewed table "addresses_1"
> select id, count(*) c from addresses_1 group by id order by c desc limit
> 142624653 1554806
> 198477395 958492
> 102641838 220181
> 138947865 211331
> 156483436 193429
> 96411677 179771
> 210082076 168033
> 800174765 152421
> 139116901 141207
> 704352025 137263
> I was able to create the following table with the skew information. And I
> was able to load the data into the table as well.
> CREATE TABLE skew_addresses_1(
> id bigint,
> address_id bigint,
> address_lines string,
> city string,
> state string,
> postal_code string,
> country string,
> latitude string,
> longitude string,
> ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
> 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
> 139116901, 704352025)
> stored as rcfile;
> select id,count(*) c from skew_addresses_1 where id=142624653 group by id
> order by c limit 10;
> *However, at the time of running select query, entire dataset is
> scanned. * I thought only the relevant dataset (with skew information
> will be scanned). Am I missing anything here? Any help will be
> appreciated. I am using Hive 10.x
> I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
> skew information populated in SKEWED_COL_NAMES in metadata. But there is
> no information in SKEWED_COL_VALUE_LOC_MAP table.