Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Partition performance


Copy link to this message
-
Re: Partition performance
There is only one map task because it's using the CombineHiveInputFormat (In my test cases, all files are very small). If I set hive.input.format to HiveInputFormat, then it has 336 map tasks in the first case. But the performance is even worse since there are too many map tasks and each one is only handling a small file.
 
It takes a lot of time before it actually submits the job. So maybe querying the metastore for partition info takes time?
 

________________________________
 From: Ramki Palle <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Ian <[EMAIL PROTECTED]>
Sent: Friday, April 5, 2013 1:12 PM
Subject: Re: Partition performance
  
Can you tell how many map tasks are there in each scenario?

If my assumption is correct, you should have 336 in the first case and 14 in second case.

It looks like it is combing all small files in a folder and running as one map task for all 24 files in a folder, whereas it is running a separate task in these files are there in different partitions (folders).
You can try to reuse the JVM and see if the response time is similar.

Can you please try the following and let us know how long each strategy takes?
hive> set mapred.job.reuse.jvm.num.tasks = 24;
Run your  query that has more partitions and see if the response time is lower.
Regards,

Ramki.
On Fri, Apr 5, 2013 at 11:36 AM, Ian <[EMAIL PROTECTED]> wrote:

Thanks. This is just a test from my local box. So each file is only 1kb. I shared the query plans of these two tests at:
>http://codetidy.com/paste/raw/5198
>http://codetidy.com/paste/raw/5199

>Also in the Hadoop log, there is this line for each partition:org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to work list for file hdfs://localhost:8020/test1/2011/02/01/01
>Does that mean each partition will become a map task?
>
>I'm still new in Hive, just wondering what are the common strategy for partitioning the hourly logs? I know we shouldn't have too many partitions but I'm wondering what's the reason behind it? If I run this on a real cluster, maybe it won't perform so differently?
>
>Thanks. 
> From: Dean Wampler <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Thursday, April 4, 2013 4:28 PM
>Subject: Re: Partition performance
>
>
>
>Also, how big are the files in each directory? Are they roughly the size of one HDFS block or a multiple. Lots of small files will mean lots of mapper tasks will little to do.
>
>
>You can also compare the job tracker console output for each job. I bet the slow one has a lot of very short map and reduce tasks, while the faster one has fewer tasks that run longer. A rule of thumb is that any one task should take 20 seconds or more to amortize over the few seconds spent in start up per task.
>
>
>In other words, if you think about what's happening at the HDFS and MR level, you can learn to predict how fast or slow things will run. Learning to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.
>
>
>dean
>
>
>On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
>See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000.
>>
>>-- Owen
>>
>>
>>
>>On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <[EMAIL PROTECTED]> wrote:
>>
>>Is it possible for you to send the explain plan of these two queries?
>>>
>>>Regards,
>>>Ramki.
>>>
>>>
>>>
>>>
>>>On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <[EMAIL PROTECTED]> wrote:
>>>
>>>The slow down is most possibly due to large number of partitions.
>>>>I believe the Hive book authors tell us to be cautious with large number of partitions :-)  and I abide by that.
>>>>
>>>>
>>>>Users
>>>>Please add your points of view and experiences
>>>>
>>>>
>>>>Thanks
>>>>sanjay
>>>>
>>>> From: Ian <[EMAIL PROTECTED]>
>>>>Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, Ian <[EMAIL PROTECTED]>
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB