Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Partition performance


+
Ian 2013-04-04, 23:01
+
Sanjay Subramanian 2013-04-04, 23:06
Copy link to this message
-
Re: Partition performance
Ramki Palle 2013-04-04, 23:21
Is it possible for you to send the explain plan of these two queries?

Regards,
Ramki.
On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
[EMAIL PROTECTED]> wrote:

>  The slow down is most possibly due to large number of partitions.
> I believe the Hive book authors tell us to be cautious with large number
> of partitions :-)  and I abide by that.
>
>  Users
> Please add your points of view and experiences
>
>  Thanks
> sanjay
>
>   From: Ian <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, Ian <
> [EMAIL PROTECTED]>
> Date: Thursday, April 4, 2013 4:01 PM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Partition performance
>
>   Hi,
>
> I created 3 years of hourly log files (totally 26280 files), and use
> External Table with partition to query. I tried two partition methods.
>
> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per
> hour). Use date and hour as partition keys. Add 3 years of directories to
> the table partitions. So there are 26280 partitions.
>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
> string, hr int);
>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
> '/test1/2013/04/02/16';
>
> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per
> day, 24 files in each directory). Use date as partition key. Add 3 years of
> directories to the table partitions. So there are 1095 partitions.
>          CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
> string);
>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
> '/test2/2013/04/02';
>
> When doing a simple query like
>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <> '2013-02-14'
>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>
> I'm wondering why there is a big performance difference between these two?
> These two approaches have the same number of files, only the directory
> structure is different. So Hive is going to load the same amount of files.
> Why does the number of partitions have such big impact? Does that mean #2
> is a better partition strategy?
>
> Thanks.
>
>
>
> CONFIDENTIALITY NOTICE
> =====================> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
+
Owen OMalley 2013-04-04, 23:25
+
Dean Wampler 2013-04-04, 23:28
+
Ian 2013-04-05, 18:36
+
Ramki Palle 2013-04-05, 20:12
+
Ian 2013-04-11, 22:25
+
Peter Marron 2013-07-02, 09:34
+
Owen OMalley 2013-07-02, 14:51
+
David Morel 2013-07-03, 12:19
+
Edward Capriolo 2013-07-03, 14:22
+
Owen OMalley 2013-07-03, 14:56
+
Peter Marron 2013-07-04, 07:37
+
Peter Marron 2013-07-04, 09:25
+
Dean Wampler 2013-07-03, 13:51