Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Performance Issues in Hive with S3 and Partitions


+
richin.jain@... 2012-07-24, 03:33
+
Igor Tatarinov 2012-07-24, 04:37
+
richin.jain@... 2012-07-24, 15:47
+
Edward Capriolo 2012-07-24, 15:52
+
richin.jain@... 2012-07-27, 18:42
Copy link to this message
-
Re: Performance Issues in Hive with S3 and Partitions
Edward Capriolo 2012-07-27, 19:02
Use a different partitioning scheme or consider using clustered /
bucketed tables.

On 7/27/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Igor,
>
> I did not see any major improvement in the performance even after setting
> "Hive.optimize.s3.query=true", although the same was suggested by AWS Team.
>
> My problem is I have too many small files - 3 level of partition, 6500+
> files and a single file is < 1 MB.
> Now I know Hadoop and HDFS are not meant to deal with lot of small files,
> but if that is the way to go is there any work around?
>
> Thanks,
> Richin
>
> From: Jain Richin (Nokia-LC/Boston)
> Sent: Tuesday, July 24, 2012 11:49 AM
> To: [EMAIL PROTECTED]
> Subject: RE: Performance Issues in Hive with S3 and Partitions
>
> Hi Igor,
>
> Thanks for the response. Yes I am using EMR.
> I will make changes and let you know if that helps.
>
> Richin
>
> From: ext Igor Tatarinov
> [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]>
> Sent: Tuesday, July 24, 2012 12:38 AM
> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
> Subject: Re: Performance Issues in Hive with S3 and Partitions
>
> Are you using EMR?
> Have you tried  setting
> Hive.optimize.s3.query=true
>
> as mentioned in
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html
>
> I haven't tried using that option myself. I am curious if it helps in your
> scenario. The above page also mentions another fix that's supposed to help
> with partitioned tables. Optimizing queries with thousands of input files
> used to take a lot of time. But it looks like that fix is enabled by default
> now.
>
> Just in case, also check your jvm reuse option. If it's too low, performance
> will suffer. I had it set to 3 to avoid running out of memory. Using the
> default value of 20 really helps when reading lots of small files.
>
> igor
> decide.com<http://decide.com>
> On Mon, Jul 23, 2012 at 8:33 PM,
> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> Hi,
>
> Sorry this is an AWS Hive Specific question.  I have two External Hive
> tables for my custom logs.
>
> 1. flat directory structure on AWS S3, no partition and files in bz2
> compressed format (few big files)
>
> 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)
>
> I noticed that my queries on the table with Partition is taking forever to
> run. The same queries run fine and finish up quickly on table with no
> partition.
> Am I missing something, I suspect this has something to do with the way S3
> behaves.
>
> A query example is :
>
> select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) -
> min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)
> from logs
> group by id;
>
> Thanks,
> Richin
>
>
+
Bejoy Ks 2012-07-27, 19:05
+
richin.jain@... 2012-07-27, 19:09
+
Connell, Chuck 2012-07-27, 19:39