Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive skewed tables


Copy link to this message
-
Re: Hive skewed tables
Thanks Nitin.   I have only one partition in this table for testing.   I
thought within the partition it will scan only certain files based on
skewed fields.   However it is scanning the entire data within the
partition.
On Nov 14, 2013 9:38 AM, "Nitin Pawar" <[EMAIL PROTECTED]> wrote:

> In my understanding,
> when you are saying scanning entire dataset it is looking at all your
> partitions because your data has been partitioned by the date column.
>
> A skewed table is a table where there will be different files created for
> all your skewed keys in all the partitions.
> So for your query it will look at all partitions.
>
> The setting you have kept is only applicable to join queries as it clearly
> says skewjoin.  Non join queries it does not have an affect.
>
>
> Thanks,
> Nitin
>
>
>
>
> On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan <
> [EMAIL PROTECTED]> wrote:
>
>> Hi All,
>>
>> I have the following skewed table "addresses_1"
>>
>> select id, count(*) c from addresses_1 group by id order by c desc limit
>> 10;
>> 142624653    1554806
>> 198477395    958492
>> 102641838    220181
>> 138947865    211331
>> 156483436    193429
>> 96411677    179771
>> 210082076    168033
>> 800174765    152421
>> 139116901    141207
>> 704352025    137263
>>
>> I was able to create the following table with the skew information.  And
>> I was able to load the data into the table as well.
>>
>>  CREATE  TABLE skew_addresses_1(
>>   id bigint,
>>   address_id bigint,
>>   address_lines string,
>>   city string,
>>   state string,
>>   postal_code string,
>>   country string,
>>   latitude string,
>>   longitude string,
>>   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
>> 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
>> 139116901, 704352025)
>>    stored as rcfile;
>>
>> select id,count(*) c from skew_addresses_1 where id=142624653 group by id
>> order by c limit 10;
>>
>> *However, at the time of running select query,  entire dataset is
>> scanned. * I thought only the relevant dataset (with skew information
>> will be scanned).  Am I missing anything here?  Any help will be
>> appreciated.  I am using Hive 10.x
>>
>> I have enabled hive.optimize.skewjoin.compiletime=true and I can see the
>> skew information populated in SKEWED_COL_NAMES in metadata.  But there is
>> no information in SKEWED_COL_VALUE_LOC_MAP table.
>>
>>
>> --
>> ~Rajesh.B
>>
>
>
>
> --
> Nitin Pawar
>