|
|
-
Performance Issues in Hive with S3 and Partitions
richin.jain@... 2012-07-24, 03:33
Hi,
Sorry this is an AWS Hive Specific question. I have two External Hive tables for my custom logs.
1. flat directory structure on AWS S3, no partition and files in bz2 compressed format (few big files)
2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)
I noticed that my queries on the table with Partition is taking forever to run. The same queries run fine and finish up quickly on table with no partition. Am I missing something, I suspect this has something to do with the way S3 behaves.
A query example is :
select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) from logs group by id;
Thanks, Richin
+
richin.jain@... 2012-07-24, 03:33
-
Re: Performance Issues in Hive with S3 and Partitions
Igor Tatarinov 2012-07-24, 04:37
Are you using EMR? Have you tried setting Hive.optimize.s3.query=true as mentioned in http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.htmlI haven't tried using that option myself. I am curious if it helps in your scenario. The above page also mentions another fix that's supposed to help with partitioned tables. Optimizing queries with thousands of input files used to take a lot of time. But it looks like that fix is enabled by default now. Just in case, also check your jvm reuse option. If it's too low, performance will suffer. I had it set to 3 to avoid running out of memory. Using the default value of 20 really helps when reading lots of small files. igor decide.com On Mon, Jul 23, 2012 at 8:33 PM, <[EMAIL PROTECTED]> wrote: > Hi, **** > > ** ** > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. **** > > ** ** > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files)**** > > ** ** > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)* > *** > > ** ** > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. **** > > Am I missing something, I suspect this has something to do with the way S3 > behaves.**** > > ** ** > > A query example is :**** > > ** ** > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)**** > > from logs **** > > group by id; **** > > ** ** > > Thanks,**** > > Richin**** >
+
Igor Tatarinov 2012-07-24, 04:37
-
RE: Performance Issues in Hive with S3 and Partitions
richin.jain@... 2012-07-24, 15:47
Hi Igor, Thanks for the response. Yes I am using EMR. I will make changes and let you know if that helps. Richin From: ext Igor Tatarinov [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 24, 2012 12:38 AM To: [EMAIL PROTECTED] Subject: Re: Performance Issues in Hive with S3 and Partitions Are you using EMR? Have you tried setting Hive.optimize.s3.query=true as mentioned in http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.htmlI haven't tried using that option myself. I am curious if it helps in your scenario. The above page also mentions another fix that's supposed to help with partitioned tables. Optimizing queries with thousands of input files used to take a lot of time. But it looks like that fix is enabled by default now. Just in case, also check your jvm reuse option. If it's too low, performance will suffer. I had it set to 3 to avoid running out of memory. Using the default value of 20 really helps when reading lots of small files. igor decide.com< http://decide.com>On Mon, Jul 23, 2012 at 8:33 PM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi, Sorry this is an AWS Hive Specific question. I have two External Hive tables for my custom logs. 1. flat directory structure on AWS S3, no partition and files in bz2 compressed format (few big files) 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) I noticed that my queries on the table with Partition is taking forever to run. The same queries run fine and finish up quickly on table with no partition. Am I missing something, I suspect this has something to do with the way S3 behaves. A query example is : select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) from logs group by id; Thanks, Richin
+
richin.jain@... 2012-07-24, 15:47
-
Re: Performance Issues in Hive with S3 and Partitions
Edward Capriolo 2012-07-24, 15:52
Generally you can not optimize this beyond a certain point. Hadoop tasks have startup and tear down overhead, so if your input format can not map them into less map tasks (like combine input format does) they performance is not going to be so hot. On Tue, Jul 24, 2012 at 11:47 AM, <[EMAIL PROTECTED]> wrote: > Hi Igor, > > > > Thanks for the response. Yes I am using EMR. > > I will make changes and let you know if that helps. > > > > Richin > > > > From: ext Igor Tatarinov [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 24, 2012 12:38 AM > To: [EMAIL PROTECTED] > Subject: Re: Performance Issues in Hive with S3 and Partitions > > > > Are you using EMR? > > Have you tried setting > > Hive.optimize.s3.query=true > > as mentioned in > > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html> > > > I haven't tried using that option myself. I am curious if it helps in your > scenario. The above page also mentions another fix that's supposed to help > with partitioned tables. Optimizing queries with thousands of input files > used to take a lot of time. But it looks like that fix is enabled by default > now. > > > > Just in case, also check your jvm reuse option. If it's too low, performance > will suffer. I had it set to 3 to avoid running out of memory. Using the > default value of 20 really helps when reading lots of small files. > > > > igor > > decide.com > > On Mon, Jul 23, 2012 at 8:33 PM, <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. > > > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files) > > > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) > > > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. > > Am I missing something, I suspect this has something to do with the way S3 > behaves. > > > > A query example is : > > > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) > > from logs > > group by id; > > > > Thanks, > > Richin > >
+
Edward Capriolo 2012-07-24, 15:52
-
RE: Performance Issues in Hive with S3 and Partitions
richin.jain@... 2012-07-27, 18:42
Igor, I did not see any major improvement in the performance even after setting "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. My problem is I have too many small files - 3 level of partition, 6500+ files and a single file is < 1 MB. Now I know Hadoop and HDFS are not meant to deal with lot of small files, but if that is the way to go is there any work around? Thanks, Richin From: Jain Richin (Nokia-LC/Boston) Sent: Tuesday, July 24, 2012 11:49 AM To: [EMAIL PROTECTED] Subject: RE: Performance Issues in Hive with S3 and Partitions Hi Igor, Thanks for the response. Yes I am using EMR. I will make changes and let you know if that helps. Richin From: ext Igor Tatarinov [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Tuesday, July 24, 2012 12:38 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Performance Issues in Hive with S3 and Partitions Are you using EMR? Have you tried setting Hive.optimize.s3.query=true as mentioned in http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.htmlI haven't tried using that option myself. I am curious if it helps in your scenario. The above page also mentions another fix that's supposed to help with partitioned tables. Optimizing queries with thousands of input files used to take a lot of time. But it looks like that fix is enabled by default now. Just in case, also check your jvm reuse option. If it's too low, performance will suffer. I had it set to 3 to avoid running out of memory. Using the default value of 20 really helps when reading lots of small files. igor decide.com< http://decide.com>On Mon, Jul 23, 2012 at 8:33 PM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi, Sorry this is an AWS Hive Specific question. I have two External Hive tables for my custom logs. 1. flat directory structure on AWS S3, no partition and files in bz2 compressed format (few big files) 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) I noticed that my queries on the table with Partition is taking forever to run. The same queries run fine and finish up quickly on table with no partition. Am I missing something, I suspect this has something to do with the way S3 behaves. A query example is : select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) from logs group by id; Thanks, Richin
+
richin.jain@... 2012-07-27, 18:42
-
Re: Performance Issues in Hive with S3 and Partitions
Edward Capriolo 2012-07-27, 19:02
Use a different partitioning scheme or consider using clustered / bucketed tables. On 7/27/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Igor, > > I did not see any major improvement in the performance even after setting > "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. > > My problem is I have too many small files - 3 level of partition, 6500+ > files and a single file is < 1 MB. > Now I know Hadoop and HDFS are not meant to deal with lot of small files, > but if that is the way to go is there any work around? > > Thanks, > Richin > > From: Jain Richin (Nokia-LC/Boston) > Sent: Tuesday, July 24, 2012 11:49 AM > To: [EMAIL PROTECTED] > Subject: RE: Performance Issues in Hive with S3 and Partitions > > Hi Igor, > > Thanks for the response. Yes I am using EMR. > I will make changes and let you know if that helps. > > Richin > > From: ext Igor Tatarinov > [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> > Sent: Tuesday, July 24, 2012 12:38 AM > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: Re: Performance Issues in Hive with S3 and Partitions > > Are you using EMR? > Have you tried setting > Hive.optimize.s3.query=true > > as mentioned in > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html> > I haven't tried using that option myself. I am curious if it helps in your > scenario. The above page also mentions another fix that's supposed to help > with partitioned tables. Optimizing queries with thousands of input files > used to take a lot of time. But it looks like that fix is enabled by default > now. > > Just in case, also check your jvm reuse option. If it's too low, performance > will suffer. I had it set to 3 to avoid running out of memory. Using the > default value of 20 really helps when reading lots of small files. > > igor > decide.com< http://decide.com>> On Mon, Jul 23, 2012 at 8:33 PM, > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi, > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files) > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. > Am I missing something, I suspect this has something to do with the way S3 > behaves. > > A query example is : > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) > from logs > group by id; > > Thanks, > Richin > >
+
Edward Capriolo 2012-07-27, 19:02
-
Re: Performance Issues in Hive with S3 and Partitions
Bejoy Ks 2012-07-27, 19:05
Hi Richin I agree with Edward on this. You have to design your partition in such a way that each partition holds data that is atleast an hdfs block size. Regards, Bejoy KS ________________________________ From: Edward Capriolo <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, July 28, 2012 12:32 AM Subject: Re: Performance Issues in Hive with S3 and Partitions Use a different partitioning scheme or consider using clustered / bucketed tables. On 7/27/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Igor, > > I did not see any major improvement in the performance even after setting > "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. > > My problem is I have too many small files - 3 level of partition, 6500+ > files and a single file is < 1 MB. > Now I know Hadoop and HDFS are not meant to deal with lot of small files, > but if that is the way to go is there any work around? > > Thanks, > Richin > > From: Jain Richin (Nokia-LC/Boston) > Sent: Tuesday, July 24, 2012 11:49 AM > To: [EMAIL PROTECTED] > Subject: RE: Performance Issues in Hive with S3 and Partitions > > Hi Igor, > > Thanks for the response. Yes I am using EMR. > I will make changes and let you know if that helps. > > Richin > > From: ext Igor Tatarinov > [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> > Sent: Tuesday, July 24, 2012 12:38 AM > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: Re: Performance Issues in Hive with S3 and Partitions > > Are you using EMR? > Have you tried setting > Hive.optimize.s3.query=true > > as mentioned in > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html> > I haven't tried using that option myself. I am curious if it helps in your > scenario. The above page also mentions another fix that's supposed to help > with partitioned tables. Optimizing queries with thousands of input files > used to take a lot of time. But it looks like that fix is enabled by default > now. > > Just in case, also check your jvm reuse option. If it's too low, performance > will suffer. I had it set to 3 to avoid running out of memory. Using the > default value of 20 really helps when reading lots of small files. > > igor > decide.com< http://decide.com>> On Mon, Jul 23, 2012 at 8:33 PM, > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi, > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files) > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. > Am I missing something, I suspect this has something to do with the way S3 > behaves. > > A query example is : > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) > from logs > group by id; > > Thanks, > Richin > >
+
Bejoy Ks 2012-07-27, 19:05
-
RE: Performance Issues in Hive with S3 and Partitions
richin.jain@... 2012-07-27, 19:09
Thanks Guys, I am changing my partition to hold a day worth of data and should be good enough for Hive to operate on. Thanks, Richin From: ext Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Friday, July 27, 2012 3:06 PM To: [EMAIL PROTECTED] Subject: Re: Performance Issues in Hive with S3 and Partitions Hi Richin I agree with Edward on this. You have to design your partition in such a way that each partition holds data that is atleast an hdfs block size. Regards, Bejoy KS ________________________________ From: Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Sent: Saturday, July 28, 2012 12:32 AM Subject: Re: Performance Issues in Hive with S3 and Partitions Use a different partitioning scheme or consider using clustered / bucketed tables. On 7/27/12, [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Igor, > > I did not see any major improvement in the performance even after setting > "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. > > My problem is I have too many small files - 3 level of partition, 6500+ > files and a single file is < 1 MB. > Now I know Hadoop and HDFS are not meant to deal with lot of small files, > but if that is the way to go is there any work around? > > Thanks, > Richin > > From: Jain Richin (Nokia-LC/Boston) > Sent: Tuesday, July 24, 2012 11:49 AM > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: RE: Performance Issues in Hive with S3 and Partitions > > Hi Igor, > > Thanks for the response. Yes I am using EMR. > I will make changes and let you know if that helps. > > Richin > > From: ext Igor Tatarinov > [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]<mailto:[mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]> > Sent: Tuesday, July 24, 2012 12:38 AM > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> > Subject: Re: Performance Issues in Hive with S3 and Partitions > > Are you using EMR? > Have you tried setting > Hive.optimize.s3.query=true > > as mentioned in > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html> > I haven't tried using that option myself. I am curious if it helps in your > scenario. The above page also mentions another fix that's supposed to help > with partitioned tables. Optimizing queries with thousands of input files > used to take a lot of time. But it looks like that fix is enabled by default > now. > > Just in case, also check your jvm reuse option. If it's too low, performance > will suffer. I had it set to 3 to avoid running out of memory. Using the > default value of 20 really helps when reading lots of small files. > > igor > decide.com< http://decide.com<http://decide.com/>>> On Mon, Jul 23, 2012 at 8:33 PM, > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>> wrote: > Hi, > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files) > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. > Am I missing something, I suspect this has something to do with the way S3 > behaves. > > A query example is : > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) > from logs > group by id; > > Thanks, > Richin > >
+
richin.jain@... 2012-07-27, 19:09
-
RE: Performance Issues in Hive with S3 and Partitions
Connell, Chuck 2012-07-27, 19:39
What about making your small files bigger, by ZIPping them together? Of course, you have to think about this carefully, so MapReduce can efficiently retrieve the files it needs without unzipping everything every time. Chuck From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Friday, July 27, 2012 2:42 PM To: [EMAIL PROTECTED] Subject: RE: Performance Issues in Hive with S3 and Partitions Igor, I did not see any major improvement in the performance even after setting "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. My problem is I have too many small files - 3 level of partition, 6500+ files and a single file is < 1 MB. Now I know Hadoop and HDFS are not meant to deal with lot of small files, but if that is the way to go is there any work around? Thanks, Richin From: Jain Richin (Nokia-LC/Boston) Sent: Tuesday, July 24, 2012 11:49 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: Performance Issues in Hive with S3 and Partitions Hi Igor, Thanks for the response. Yes I am using EMR. I will make changes and let you know if that helps. Richin From: ext Igor Tatarinov [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Tuesday, July 24, 2012 12:38 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Performance Issues in Hive with S3 and Partitions Are you using EMR? Have you tried setting Hive.optimize.s3.query=true as mentioned in http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.htmlI haven't tried using that option myself. I am curious if it helps in your scenario. The above page also mentions another fix that's supposed to help with partitioned tables. Optimizing queries with thousands of input files used to take a lot of time. But it looks like that fix is enabled by default now. Just in case, also check your jvm reuse option. If it's too low, performance will suffer. I had it set to 3 to avoid running out of memory. Using the default value of 20 really helps when reading lots of small files. igor decide.com< http://decide.com>On Mon, Jul 23, 2012 at 8:33 PM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi, Sorry this is an AWS Hive Specific question. I have two External Hive tables for my custom logs. 1. flat directory structure on AWS S3, no partition and files in bz2 compressed format (few big files) 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) I noticed that my queries on the table with Partition is taking forever to run. The same queries run fine and finish up quickly on table with no partition. Am I missing something, I suspect this has something to do with the way S3 behaves. A query example is : select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) from logs group by id; Thanks, Richin
+
Connell, Chuck 2012-07-27, 19:39
|
|