Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive skewed tables


Copy link to this message
-
Re: Hive skewed tables
I mentioned that as it scanned all files based on hdfs bytes read.. Table
is not compressed and hdfs bytes read matched the data size in the
partition.

I had bucketing enabled.   But somehow when I joined with another table it
had long tail issue where most of the data went to single reducer.   Hence
I thought of using skewed join..
On Nov 14, 2013 11:33 AM, "Nitin Pawar" <[EMAIL PROTECTED]> wrote:

> how did u check its looking at all files inside the partition?
> If you want more restriction on limit on filse to be accessed, you can
> bucket them as well. That way you really dont have to worry about which
> data is skewed and let the framework handle it.
>
>
>
> On Thu, Nov 14, 2013 at 11:16 AM, Rajesh Balamohan <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks Nitin.   I have only one partition in this table for testing.   I
>> thought within the partition it will scan only certain files based on
>> skewed fields.   However it is scanning the entire data within the
>> partition.
>>  On Nov 14, 2013 9:38 AM, "Nitin Pawar" <[EMAIL PROTECTED]> wrote:
>>
>>> In my understanding,
>>> when you are saying scanning entire dataset it is looking at all your
>>> partitions because your data has been partitioned by the date column.
>>>
>>> A skewed table is a table where there will be different files created
>>> for all your skewed keys in all the partitions.
>>> So for your query it will look at all partitions.
>>>
>>> The setting you have kept is only applicable to join queries as it
>>> clearly says skewjoin.  Non join queries it does not have an affect.
>>>
>>>
>>> Thanks,
>>> Nitin
>>>
>>>
>>>
>>>
>>> On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have the following skewed table "addresses_1"
>>>>
>>>> select id, count(*) c from addresses_1 group by id order by c desc
>>>> limit 10;
>>>> 142624653    1554806
>>>> 198477395    958492
>>>> 102641838    220181
>>>> 138947865    211331
>>>> 156483436    193429
>>>> 96411677    179771
>>>> 210082076    168033
>>>> 800174765    152421
>>>> 139116901    141207
>>>> 704352025    137263
>>>>
>>>> I was able to create the following table with the skew information.
>>>> And I was able to load the data into the table as well.
>>>>
>>>>  CREATE  TABLE skew_addresses_1(
>>>>   id bigint,
>>>>   address_id bigint,
>>>>   address_lines string,
>>>>   city string,
>>>>   state string,
>>>>   postal_code string,
>>>>   country string,
>>>>   latitude string,
>>>>   longitude string,
>>>>   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
>>>> 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
>>>> 139116901, 704352025)
>>>>    stored as rcfile;
>>>>
>>>> select id,count(*) c from skew_addresses_1 where id=142624653 group by
>>>> id order by c limit 10;
>>>>
>>>> *However, at the time of running select query,  entire dataset is
>>>> scanned. * I thought only the relevant dataset (with skew information
>>>> will be scanned).  Am I missing anything here?  Any help will be
>>>> appreciated.  I am using Hive 10.x
>>>>
>>>> I have enabled hive.optimize.skewjoin.compiletime=true and I can see
>>>> the skew information populated in SKEWED_COL_NAMES in metadata.  But there
>>>> is no information in SKEWED_COL_VALUE_LOC_MAP table.
>>>>
>>>>
>>>> --
>>>> ~Rajesh.B
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>
>
> --
> Nitin Pawar
>