Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive skewed tables


Copy link to this message
-
Re: Hive skewed tables
I mentioned that as it scanned all files based on hdfs bytes read.. Table
is not compressed and hdfs bytes read matched the data size in the
partition.

I had bucketing enabled.   But somehow when I joined with another table it
had long tail issue where most of the data went to single reducer.   Hence
I thought of using skewed join..
On Nov 14, 2013 11:33 AM, "Nitin Pawar" <[EMAIL PROTECTED]> wrote:

> how did u check its looking at all files inside the partition?
> If you want more restriction on limit on filse to be accessed, you can
> bucket them as well. That way you really dont have to worry about which
> data is skewed and let the framework handle it.
>
>
>
> On Thu, Nov 14, 2013 at 11:16 AM, Rajesh Balamohan <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks Nitin.   I have only one partition in this table for testing.   I
>> thought within the partition it will scan only certain files based on
>> skewed fields.   However it is scanning the entire data within the
>> partition.
>>  On Nov 14, 2013 9:38 AM, "Nitin Pawar" <[EMAIL PROTECTED]> wrote:
>>
>>> In my understanding,
>>> when you are saying scanning entire dataset it is looking at all your
>>> partitions because your data has been partitioned by the date column.
>>>
>>> A skewed table is a table where there will be different files created
>>> for all your skewed keys in all the partitions.
>>> So for your query it will look at all partitions.
>>>
>>> The setting you have kept is only applicable to join queries as it
>>> clearly says skewjoin.  Non join queries it does not have an affect.
>>>
>>>
>>> Thanks,
>>> Nitin
>>>
>>>
>>>
>>>
>>> On Thu, Nov 14, 2013 at 6:35 AM, Rajesh Balamohan <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have the following skewed table "addresses_1"
>>>>
>>>> select id, count(*) c from addresses_1 group by id order by c desc
>>>> limit 10;
>>>> 142624653    1554806
>>>> 198477395    958492
>>>> 102641838    220181
>>>> 138947865    211331
>>>> 156483436    193429
>>>> 96411677    179771
>>>> 210082076    168033
>>>> 800174765    152421
>>>> 139116901    141207
>>>> 704352025    137263
>>>>
>>>> I was able to create the following table with the skew information.
>>>> And I was able to load the data into the table as well.
>>>>
>>>>  CREATE  TABLE skew_addresses_1(
>>>>   id bigint,
>>>>   address_id bigint,
>>>>   address_lines string,
>>>>   city string,
>>>>   state string,
>>>>   postal_code string,
>>>>   country string,
>>>>   latitude string,
>>>>   longitude string,
>>>>   ) PARTITIONED BY (dateTS string) SKEWED BY (id) ON (142624653,
>>>> 198477395, 102641838, 138947865, 156483436, 96411677, 210082076, 800174765,
>>>> 139116901, 704352025)
>>>>    stored as rcfile;
>>>>
>>>> select id,count(*) c from skew_addresses_1 where id=142624653 group by
>>>> id order by c limit 10;
>>>>
>>>> *However, at the time of running select query,  entire dataset is
>>>> scanned. * I thought only the relevant dataset (with skew information
>>>> will be scanned).  Am I missing anything here?  Any help will be
>>>> appreciated.  I am using Hive 10.x
>>>>
>>>> I have enabled hive.optimize.skewjoin.compiletime=true and I can see
>>>> the skew information populated in SKEWED_COL_NAMES in metadata.  But there
>>>> is no information in SKEWED_COL_VALUE_LOC_MAP table.
>>>>
>>>>
>>>> --
>>>> ~Rajesh.B
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>
>
> --
> Nitin Pawar
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB