-Re: Performance Issues in Hive with S3 and Partitions
Edward Capriolo 2012-07-27, 19:02
Use a different partitioning scheme or consider using clustered /
On 7/27/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> I did not see any major improvement in the performance even after setting
> "Hive.optimize.s3.query=true", although the same was suggested by AWS Team.
> My problem is I have too many small files - 3 level of partition, 6500+
> files and a single file is < 1 MB.
> Now I know Hadoop and HDFS are not meant to deal with lot of small files,
> but if that is the way to go is there any work around?
> From: Jain Richin (Nokia-LC/Boston)
> Sent: Tuesday, July 24, 2012 11:49 AM
> To: [EMAIL PROTECTED]
> Subject: RE: Performance Issues in Hive with S3 and Partitions
> Hi Igor,
> Thanks for the response. Yes I am using EMR.
> I will make changes and let you know if that helps.
> From: ext Igor Tatarinov
> [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]>
> Sent: Tuesday, July 24, 2012 12:38 AM
> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
> Subject: Re: Performance Issues in Hive with S3 and Partitions
> Are you using EMR?
> Have you tried setting
> as mentioned in
> I haven't tried using that option myself. I am curious if it helps in your
> scenario. The above page also mentions another fix that's supposed to help
> with partitioned tables. Optimizing queries with thousands of input files
> used to take a lot of time. But it looks like that fix is enabled by default
> Just in case, also check your jvm reuse option. If it's too low, performance
> will suffer. I had it set to 3 to avoid running out of memory. Using the
> default value of 20 really helps when reading lots of small files.
> On Mon, Jul 23, 2012 at 8:33 PM,
> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> Sorry this is an AWS Hive Specific question. I have two External Hive
> tables for my custom logs.
> 1. flat directory structure on AWS S3, no partition and files in bz2
> compressed format (few big files)
> 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)
> I noticed that my queries on the table with Partition is taking forever to
> run. The same queries run fine and finish up quickly on table with no
> Am I missing something, I suspect this has something to do with the way S3
> A query example is :
> select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) -
> min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)
> from logs
> group by id;