|
Savant, Keshav
2011-12-06, 11:00
Wojciech Langiewicz
2011-12-06, 12:51
Paul Mackles
2011-12-06, 14:44
Mohit Gupta
2011-12-06, 14:46
Vikas Srivastava
2011-12-07, 06:00
Ayon Sinha
2011-12-07, 06:36
Savant, Keshav
2011-12-07, 10:43
Wojciech Langiewicz
2011-12-07, 14:45
Savant, Keshav
2011-12-08, 05:05
Aniket Mokashi
2011-12-08, 08:03
Wojciech Langiewicz
2011-12-08, 10:30
|
-
Hive query taking too much timeSavant, Keshav 2011-12-06, 11:00
Hi All,
My setup is hadoop-0.20.203.0 hive-0.7.1 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. Keshav _____________ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
-
Re: Hive query taking too much timeWojciech Langiewicz 2011-12-06, 12:51
Hi,
In your case total file size isn't main factor that reduces performance, number of files is. To test this try merging those over 2000 files into one (or few) big, then upload it to HDFS and test hive performance (it should be definitely higher). It this works you should think about merging those files before or after loading them to HDFS. Second issue is counts, try to observe how your jobs uses mappers and reducers, my experience is that simple count() jobs might be stuck on one reducer (the one that does all counting) for longer time. I have not resolved this issue, but it was not significant in my case. set mapred.reduce.tasks=xyz doesn't change that behavior, but for example using GROUP with COUNT works much faster. I hope this helps. -- Wojciech Langiewicz On 06.12.2011 12:00, Savant, Keshav wrote: > Hi All, > > > > My setup is > > hadoop-0.20.203.0 > > hive-0.7.1 > > > > I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is > also acting as secondary name node). On namenode I have setup hive with > HiveDerbyServerMode to support multiple hive server connection. > > > > I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive > query statements, total number of files is 2624 an their combined size > is only 713 MB, which is very less from Hadoop perspective that can > handle TBs of data very easily. > > > > The problem is, when I run a simple count query (i.e. select count(*) > from a_table), it takes too much time in executing the query. > > > > For instance it takes almost 17 minutes to execute the said query if the > table has 950,000 rows, I understand that time is too much for executing > a query with only such small data. > > This is only a dev environment and in production environment the number > of files and their combined size will move into millions and GBs > respectively. > > > > On analyzing the logs on all the datanodes and namenode/secondary > namenode I do not find any error in them. > > > > I have tried setting mapred.reduce.tasks to a fixed number also, but > number of reduce always remains 1 while number of maps is determined by > hive only. > > > > Any suggestion what I am doing wrong, or how can I improve the > performance of hive queries? Any suggestion or pointer is highly > appreciated. > > > > Keshav >
-
RE: Hive query taking too much timePaul Mackles 2011-12-06, 14:44
How much time is it spending in the map/reduce phases, respectively? The large number of files could be creating a lot of mappers which create a lot of overhead. What happens if you merge the 2624 files into a smaller number like 24 or 48. That should speed up the mapper phase significantly.
From: Savant, Keshav [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 06, 2011 6:01 AM To: [EMAIL PROTECTED] Subject: Hive query taking too much time Hi All, My setup is hadoop-0.20.203.0 hive-0.7.1 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. Keshav _____________ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
-
Re: Hive query taking too much timeMohit Gupta 2011-12-06, 14:46
Hi Paul,
I am having the same problem. Do you know any efficient way of merging the files? -Mohit On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <[EMAIL PROTECTED]> wrote: > How much time is it spending in the map/reduce phases, respectively? The > large number of files could be creating a lot of mappers which create a lot > of overhead. What happens if you merge the 2624 files into a smaller number > like 24 or 48. That should speed up the mapper phase significantly.**** > > ** ** > > *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]] > *Sent:* Tuesday, December 06, 2011 6:01 AM > *To:* [EMAIL PROTECTED] > *Subject:* Hive query taking too much time**** > > ** ** > > Hi All,**** > > ** ** > > My setup is **** > > hadoop-0.20.203.0**** > > hive-0.7.1**** > > ** ** > > I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is > also acting as secondary name node). On namenode I have setup hive with > HiveDerbyServerMode to support multiple hive server connection.**** > > ** ** > > I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query > statements, total number of files is 2624 an their combined size is only > 713 MB, which is very less from Hadoop perspective that can handle TBs of > data very easily.**** > > ** ** > > The problem is, when I run a simple count query (i.e. *select count(*) > from a_table*), it takes too much time in executing the query.**** > > ** ** > > For instance it takes almost 17 minutes to execute the said query if the > table has 950,000 rows, I understand that time is too much for executing a > query with only such small data. **** > > This is only a dev environment and in production environment the number of > files and their combined size will move into millions and GBs respectively. > **** > > ** ** > > On analyzing the logs on all the datanodes and namenode/secondary namenode > I do not find any error in them.**** > > ** ** > > I have tried setting mapred.reduce.tasks to a fixed number also, but > number of reduce always remains 1 while number of maps is determined by > hive only.**** > > ** ** > > Any suggestion what I am doing wrong, or how can I improve the performance > of hive queries? Any suggestion or pointer is highly appreciated. **** > > ** ** > > Keshav**** > > _____________ > The information contained in this message is proprietary and/or > confidential. If you are not the intended recipient, please: (i) delete the > message and all copies; (ii) do not disclose, distribute or use the message > in any manner; and (iii) notify the sender immediately. In addition, please > be aware that any message addressed to our domain is subject to archiving > and review by persons other than the intended recipient. Thank you.**** > -- Best Regards, Mohit Gupta Software Engineer at Vdopia Inc.
-
Re: Hive query taking too much timeVikas Srivastava 2011-12-07, 06:00
hey if u having the same col of all the files then you can easily merge by
shell script list=`*.csv` $table=yourtable for file in $list do cat $file >>new_file.csv done hive -e "load data local inpath '$file' into table $table" it will merge all the files in single file then you can upload it in the same query On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <[EMAIL PROTECTED]>wrote: > Hi Paul, > I am having the same problem. Do you know any efficient way of merging the > files? > > -Mohit > > > On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <[EMAIL PROTECTED]> wrote: > >> How much time is it spending in the map/reduce phases, respectively? The >> large number of files could be creating a lot of mappers which create a lot >> of overhead. What happens if you merge the 2624 files into a smaller number >> like 24 or 48. That should speed up the mapper phase significantly.**** >> >> ** ** >> >> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]] >> *Sent:* Tuesday, December 06, 2011 6:01 AM >> *To:* [EMAIL PROTECTED] >> *Subject:* Hive query taking too much time**** >> >> ** ** >> >> Hi All,**** >> >> ** ** >> >> My setup is **** >> >> hadoop-0.20.203.0**** >> >> hive-0.7.1**** >> >> ** ** >> >> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is >> also acting as secondary name node). On namenode I have setup hive with >> HiveDerbyServerMode to support multiple hive server connection.**** >> >> ** ** >> >> I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query >> statements, total number of files is 2624 an their combined size is only >> 713 MB, which is very less from Hadoop perspective that can handle TBs of >> data very easily.**** >> >> ** ** >> >> The problem is, when I run a simple count query (i.e. *select count(*) >> from a_table*), it takes too much time in executing the query.**** >> >> ** ** >> >> For instance it takes almost 17 minutes to execute the said query if the >> table has 950,000 rows, I understand that time is too much for executing a >> query with only such small data. **** >> >> This is only a dev environment and in production environment the number >> of files and their combined size will move into millions and GBs >> respectively.**** >> >> ** ** >> >> On analyzing the logs on all the datanodes and namenode/secondary >> namenode I do not find any error in them.**** >> >> ** ** >> >> I have tried setting mapred.reduce.tasks to a fixed number also, but >> number of reduce always remains 1 while number of maps is determined by >> hive only.**** >> >> ** ** >> >> Any suggestion what I am doing wrong, or how can I improve the >> performance of hive queries? Any suggestion or pointer is highly >> appreciated. **** >> >> ** ** >> >> Keshav**** >> >> _____________ >> The information contained in this message is proprietary and/or >> confidential. If you are not the intended recipient, please: (i) delete the >> message and all copies; (ii) do not disclose, distribute or use the message >> in any manner; and (iii) notify the sender immediately. In addition, please >> be aware that any message addressed to our domain is subject to archiving >> and review by persons other than the intended recipient. Thank you.**** >> > > > > -- > Best Regards, > > Mohit Gupta > Software Engineer at Vdopia Inc. > > > -- With Regards Vikas Srivastava DWH & Analytics Team Mob:+91 9560885900 One97 | Let's get talking !
-
Re: Hive query taking too much timeAyon Sinha 2011-12-07, 06:36
How about a simple Pig script with a load and a store statement? Set the max # reducers to say 20 or 30, that way you will only have 20-30 files as output. Then put these files in the Hive dir. Make sure to match the delimiters in Hive & Pig.
-Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Vikas Srivastava <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, December 6, 2011 10:00 PM Subject: Re: Hive query taking too much time hey if u having the same col of all the files then you can easily merge by shell script list=`*.csv` $table=yourtable for file in $list do cat $file >>new_file.csv done hive -e "load data local inpath '$file' into table $table" it will merge all the files in single file then you can upload it in the same query On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <[EMAIL PROTECTED]> wrote: Hi Paul, >I am having the same problem. Do you know any efficient way of merging the files? > > >-Mohit > > > >On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <[EMAIL PROTECTED]> wrote: > >How much time is it spending in the map/reduce phases, respectively? The large number of files could be creating a lot of mappers which create a lot of overhead. What happens if you merge the 2624 files into a smaller number like 24 or 48. That should speed up the mapper phase significantly. >> >>From:Savant, Keshav [mailto:[EMAIL PROTECTED]] >>Sent: Tuesday, December 06, 2011 6:01 AM >>To: [EMAIL PROTECTED] >>Subject: Hive query taking too much time >> >>Hi All, >> >>My setup is >>hadoop-0.20.203.0 >>hive-0.7.1 >> >>I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. >> >>I have inserted plain text CSV files in HDFS using ��LOAD DATA’ hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. >> >>The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. >> >>For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. >>This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. >> >>On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. >> >>I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. >> >>Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. >> >>Keshav >>_____________ >>The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. > > > >-- >Best Regards, > >Mohit Gupta >Software Engineer at Vdopia Inc. > > > -- With Regards Vikas Srivastava DWH & Analytics Team Mob:+91 9560885900 One97 | Let's get talking !
-
RE: Hive query taking too much timeSavant, Keshav 2011-12-07, 10:43
Hi Wojciech Langiewicz/Paul Mackles,
I tried your suggestion and it worked, now the performance has increased many folds, here are the results from my testing after implementing your suggestion Number of Files on HDFS File Size Select count(*) time taken in seconds Select count(*) result 1 (created from 2624 CSVs ) 708.8 MB 66.258 3,567,922 3 (each created from 2624 CSVs ) 708.8 MB * 3 119.92 10,703,766 3 (each created from 2624 CSVs ) + 14 (each created from almost 200 CSVs) 708.8 MB *3 + Combined size of 14 files (ranging 48 Mb to 68 MB) is : 708.8 MB 153.306 14,271,688 Thanks a lot for your help. Kind Regards, Keshav C Savant From: Paul Mackles [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 06, 2011 8:14 PM To: [EMAIL PROTECTED] Subject: RE: Hive query taking too much time How much time is it spending in the map/reduce phases, respectively? The large number of files could be creating a lot of mappers which create a lot of overhead. What happens if you merge the 2624 files into a smaller number like 24 or 48. That should speed up the mapper phase significantly. From: Savant, Keshav [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 06, 2011 6:01 AM To: [EMAIL PROTECTED] Subject: Hive query taking too much time Hi All, My setup is hadoop-0.20.203.0 hive-0.7.1 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also acting as secondary name node). On namenode I have setup hive with HiveDerbyServerMode to support multiple hive server connection. I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive query statements, total number of files is 2624 an their combined size is only 713 MB, which is very less from Hadoop perspective that can handle TBs of data very easily. The problem is, when I run a simple count query (i.e. select count(*) from a_table), it takes too much time in executing the query. For instance it takes almost 17 minutes to execute the said query if the table has 950,000 rows, I understand that time is too much for executing a query with only such small data. This is only a dev environment and in production environment the number of files and their combined size will move into millions and GBs respectively. On analyzing the logs on all the datanodes and namenode/secondary namenode I do not find any error in them. I have tried setting mapred.reduce.tasks to a fixed number also, but number of reduce always remains 1 while number of maps is determined by hive only. Any suggestion what I am doing wrong, or how can I improve the performance of hive queries? Any suggestion or pointer is highly appreciated. Keshav _____________ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. _____________ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
-
Re: Hive query taking too much timeWojciech Langiewicz 2011-12-07, 14:45
Hi,
In this case it's much easier and faster to merge all files using this command: cat *.csv > output.csv hive -e "load data local inpath 'output.csv' into table $table" On 07.12.2011 07:00, Vikas Srivastava wrote: > hey if u having the same col of all the files then you can easily merge by > shell script > > list=`*.csv` > $table=yourtable > for file in $list > do > cat $file>>new_file.csv > done > hive -e "load data local inpath '$file' into table $table" > > it will merge all the files in single file then you can upload it in the > same query > > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta > <[EMAIL PROTECTED]>wrote: > >> Hi Paul, >> I am having the same problem. Do you know any efficient way of merging the >> files? >> >> -Mohit >> >> >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<[EMAIL PROTECTED]> wrote: >> >>> How much time is it spending in the map/reduce phases, respectively? The >>> large number of files could be creating a lot of mappers which create a lot >>> of overhead. What happens if you merge the 2624 files into a smaller number >>> like 24 or 48. That should speed up the mapper phase significantly.**** >>> >>> ** ** >>> >>> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]] >>> *Sent:* Tuesday, December 06, 2011 6:01 AM >>> *To:* [EMAIL PROTECTED] >>> *Subject:* Hive query taking too much time**** >>> >>> ** ** >>> >>> Hi All,**** >>> >>> ** ** >>> >>> My setup is **** >>> >>> hadoop-0.20.203.0**** >>> >>> hive-0.7.1**** >>> >>> ** ** >>> >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is >>> also acting as secondary name node). On namenode I have setup hive with >>> HiveDerbyServerMode to support multiple hive server connection.**** >>> >>> ** ** >>> >>> I have inserted plain text CSV files in HDFS using �LOAD DATA� hive query >>> statements, total number of files is 2624 an their combined size is only >>> 713 MB, which is very less from Hadoop perspective that can handle TBs of >>> data very easily.**** >>> >>> ** ** >>> >>> The problem is, when I run a simple count query (i.e. *select count(*) >>> from a_table*), it takes too much time in executing the query.**** >>> >>> ** ** >>> >>> For instance it takes almost 17 minutes to execute the said query if the >>> table has 950,000 rows, I understand that time is too much for executing a >>> query with only such small data. **** >>> >>> This is only a dev environment and in production environment the number >>> of files and their combined size will move into millions and GBs >>> respectively.**** >>> >>> ** ** >>> >>> On analyzing the logs on all the datanodes and namenode/secondary >>> namenode I do not find any error in them.**** >>> >>> ** ** >>> >>> I have tried setting mapred.reduce.tasks to a fixed number also, but >>> number of reduce always remains 1 while number of maps is determined by >>> hive only.**** >>> >>> ** ** >>> >>> Any suggestion what I am doing wrong, or how can I improve the >>> performance of hive queries? Any suggestion or pointer is highly >>> appreciated. **** >>> >>> ** ** >>> >>> Keshav**** >>> >>> _____________ >>> The information contained in this message is proprietary and/or >>> confidential. If you are not the intended recipient, please: (i) delete the >>> message and all copies; (ii) do not disclose, distribute or use the message >>> in any manner; and (iii) notify the sender immediately. In addition, please >>> be aware that any message addressed to our domain is subject to archiving >>> and review by persons other than the intended recipient. Thank you.**** >>> >> >> >> >> -- >> Best Regards, >> >> Mohit Gupta >> Software Engineer at Vdopia Inc. >> >> >> > >
-
RE: Hive query taking too much timeSavant, Keshav 2011-12-08, 05:05
You are right Wojciech Langiewicz, we did the same thing and posted my
result yesterday. Now we are planning to do this using a shell script because of dynamicity of our environment where file keep on coming. We will schedule the shell script using cron job. A query on this, we are planning to merge files based on either of the following approach 1. Based on file count: If file count goes to X number of files, then merge and insert in HDFS. 2. Based on merged file size: If merged file size crosses beyond X number of bytes, then insert into HDFS. I think option 2 is better because in that way we can say that all merged files will be almost of same bytes. What do you suggest? Kind Regards, Keshav C Savant -----Original Message----- From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 07, 2011 8:15 PM To: [EMAIL PROTECTED] Subject: Re: Hive query taking too much time Hi, In this case it's much easier and faster to merge all files using this command: cat *.csv > output.csv hive -e "load data local inpath 'output.csv' into table $table" On 07.12.2011 07:00, Vikas Srivastava wrote: > hey if u having the same col of all the files then you can easily > merge by shell script > > list=`*.csv` > $table=yourtable > for file in $list > do > cat $file>>new_file.csv > done > hive -e "load data local inpath '$file' into table $table" > > it will merge all the files in single file then you can upload it in > the same query > > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta > <[EMAIL PROTECTED]>wrote: > >> Hi Paul, >> I am having the same problem. Do you know any efficient way of >> merging the files? >> >> -Mohit >> >> >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<[EMAIL PROTECTED]> wrote: >> >>> How much time is it spending in the map/reduce phases, respectively? >>> The large number of files could be creating a lot of mappers which >>> create a lot of overhead. What happens if you merge the 2624 files >>> into a smaller number like 24 or 48. That should speed up the mapper >>> phase significantly.**** >>> >>> ** ** >>> >>> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]] >>> *Sent:* Tuesday, December 06, 2011 6:01 AM >>> *To:* [EMAIL PROTECTED] >>> *Subject:* Hive query taking too much time**** >>> >>> ** ** >>> >>> Hi All,**** >>> >>> ** ** >>> >>> My setup is **** >>> >>> hadoop-0.20.203.0**** >>> >>> hive-0.7.1**** >>> >>> ** ** >>> >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it >>> is also acting as secondary name node). On namenode I have setup >>> hive with HiveDerbyServerMode to support multiple hive server >>> connection.**** >>> >>> ** ** >>> >>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive >>> query statements, total number of files is 2624 an their combined >>> size is only >>> 713 MB, which is very less from Hadoop perspective that can handle >>> TBs of data very easily.**** >>> >>> ** ** >>> >>> The problem is, when I run a simple count query (i.e. *select >>> count(*) from a_table*), it takes too much time in executing the >>> query.**** >>> >>> ** ** >>> >>> For instance it takes almost 17 minutes to execute the said query if >>> the table has 950,000 rows, I understand that time is too much for >>> executing a query with only such small data. **** >>> >>> This is only a dev environment and in production environment the >>> number of files and their combined size will move into millions and >>> GBs >>> respectively.**** >>> >>> ** ** >>> >>> On analyzing the logs on all the datanodes and namenode/secondary >>> namenode I do not find any error in them.**** >>> >>> ** ** >>> >>> I have tried setting mapred.reduce.tasks to a fixed number also, but >>> number of reduce always remains 1 while number of maps is determined >>> by hive only.**** >>> >>> ** ** >>> >>> Any suggestion what I am doing wrong, or how can I improve the >>> performance of hive queries? Any suggestion or pointer is highly _____________ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
-
Re: Hive query taking too much timeAniket Mokashi 2011-12-08, 08:03
You can also take a look at--
https://issues.apache.org/jira/browse/HIVE-74 On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav < [EMAIL PROTECTED]> wrote: > You are right Wojciech Langiewicz, we did the same thing and posted my > result yesterday. Now we are planning to do this using a shell script > because of dynamicity of our environment where file keep on coming. We > will schedule the shell script using cron job. > > A query on this, we are planning to merge files based on either of the > following approach > 1. Based on file count: If file count goes to X number of files, then > merge and insert in HDFS. > 2. Based on merged file size: If merged file size crosses beyond X > number of bytes, then insert into HDFS. > > I think option 2 is better because in that way we can say that all > merged files will be almost of same bytes. What do you suggest? > > Kind Regards, > Keshav C Savant > > > -----Original Message----- > From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 07, 2011 8:15 PM > To: [EMAIL PROTECTED] > Subject: Re: Hive query taking too much time > > Hi, > In this case it's much easier and faster to merge all files using this > command: > > cat *.csv > output.csv > hive -e "load data local inpath 'output.csv' into table $table" > > On 07.12.2011 07:00, Vikas Srivastava wrote: > > hey if u having the same col of all the files then you can easily > > merge by shell script > > > > list=`*.csv` > > $table=yourtable > > for file in $list > > do > > cat $file>>new_file.csv > > done > > hive -e "load data local inpath '$file' into table $table" > > > > it will merge all the files in single file then you can upload it in > > the same query > > > > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta > > <[EMAIL PROTECTED]>wrote: > > > >> Hi Paul, > >> I am having the same problem. Do you know any efficient way of > >> merging the files? > >> > >> -Mohit > >> > >> > >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<[EMAIL PROTECTED]> > wrote: > >> > >>> How much time is it spending in the map/reduce phases, respectively? > > >>> The large number of files could be creating a lot of mappers which > >>> create a lot of overhead. What happens if you merge the 2624 files > >>> into a smaller number like 24 or 48. That should speed up the mapper > > >>> phase significantly.**** > >>> > >>> ** ** > >>> > >>> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]] > >>> *Sent:* Tuesday, December 06, 2011 6:01 AM > >>> *To:* [EMAIL PROTECTED] > >>> *Subject:* Hive query taking too much time**** > >>> > >>> ** ** > >>> > >>> Hi All,**** > >>> > >>> ** ** > >>> > >>> My setup is **** > >>> > >>> hadoop-0.20.203.0**** > >>> > >>> hive-0.7.1**** > >>> > >>> ** ** > >>> > >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it > >>> is also acting as secondary name node). On namenode I have setup > >>> hive with HiveDerbyServerMode to support multiple hive server > >>> connection.**** > >>> > >>> ** ** > >>> > >>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive > >>> query statements, total number of files is 2624 an their combined > >>> size is only > >>> 713 MB, which is very less from Hadoop perspective that can handle > >>> TBs of data very easily.**** > >>> > >>> ** ** > >>> > >>> The problem is, when I run a simple count query (i.e. *select > >>> count(*) from a_table*), it takes too much time in executing the > >>> query.**** > >>> > >>> ** ** > >>> > >>> For instance it takes almost 17 minutes to execute the said query if > > >>> the table has 950,000 rows, I understand that time is too much for > >>> executing a query with only such small data. **** > >>> > >>> This is only a dev environment and in production environment the > >>> number of files and their combined size will move into millions and > >>> GBs > >>> respectively.**** > >>> > >>> ** ** > >>> > >>> On analyzing the logs on all the datanodes and namenode/secondary > >>> namenode I do not find any error in them.**** "...:::Aniket:::... Quetzalco@tl"
-
Re: Hive query taking too much timeWojciech Langiewicz 2011-12-08, 10:30
Using CombineFileInputFormat might help, but it still creates overhead
when you hold many small files in HDFS. I don't know details of your requirements, but but option 2 seems to be better, make sure that X is at least size of few blocks in HDFS. You could also merge files incrementally, like first every 1h, then merge those results again after 12h and so on. You can use -getmerge option or use this class (I have not used it): http://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/tools/HDFSConcat.html On 08.12.2011 09:03, Aniket Mokashi wrote: > You can also take a look at-- > https://issues.apache.org/jira/browse/HIVE-74 > > On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav< > [EMAIL PROTECTED]> wrote: > >> You are right Wojciech Langiewicz, we did the same thing and posted my >> result yesterday. Now we are planning to do this using a shell script >> because of dynamicity of our environment where file keep on coming. We >> will schedule the shell script using cron job. >> >> A query on this, we are planning to merge files based on either of the >> following approach >> 1. Based on file count: If file count goes to X number of files, then >> merge and insert in HDFS. >> 2. Based on merged file size: If merged file size crosses beyond X >> number of bytes, then insert into HDFS. >> >> I think option 2 is better because in that way we can say that all >> merged files will be almost of same bytes. What do you suggest? >> >> Kind Regards, >> Keshav C Savant >> >> >> -----Original Message----- >> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, December 07, 2011 8:15 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Hive query taking too much time >> >> Hi, >> In this case it's much easier and faster to merge all files using this >> command: >> >> cat *.csv> output.csv >> hive -e "load data local inpath 'output.csv' into table $table" >> >> On 07.12.2011 07:00, Vikas Srivastava wrote: >>> hey if u having the same col of all the files then you can easily >>> merge by shell script >>> >>> list=`*.csv` >>> $table=yourtable >>> for file in $list >>> do >>> cat $file>>new_file.csv >>> done >>> hive -e "load data local inpath '$file' into table $table" >>> >>> it will merge all the files in single file then you can upload it in >>> the same query >>> >>> On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta >>> <[EMAIL PROTECTED]>wrote: >>> >>>> Hi Paul, >>>> I am having the same problem. Do you know any efficient way of >>>> merging the files? >>>> >>>> -Mohit >>>> >>>> >>>> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<[EMAIL PROTECTED]> >> wrote: >>>> >>>>> How much time is it spending in the map/reduce phases, respectively? >> >>>>> The large number of files could be creating a lot of mappers which >>>>> create a lot of overhead. What happens if you merge the 2624 files >>>>> into a smaller number like 24 or 48. That should speed up the mapper >> >>>>> phase significantly.**** >>>>> >>>>> ** ** >>>>> >>>>> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]] >>>>> *Sent:* Tuesday, December 06, 2011 6:01 AM >>>>> *To:* [EMAIL PROTECTED] >>>>> *Subject:* Hive query taking too much time**** >>>>> >>>>> ** ** >>>>> >>>>> Hi All,**** >>>>> >>>>> ** ** >>>>> >>>>> My setup is **** >>>>> >>>>> hadoop-0.20.203.0**** >>>>> >>>>> hive-0.7.1**** >>>>> >>>>> ** ** >>>>> >>>>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it >>>>> is also acting as secondary name node). On namenode I have setup >>>>> hive with HiveDerbyServerMode to support multiple hive server >>>>> connection.**** >>>>> >>>>> ** ** >>>>> >>>>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive >>>>> query statements, total number of files is 2624 an their combined >>>>> size is only >>>>> 713 MB, which is very less from Hadoop perspective that can handle >>>>> TBs of data very easily.**** >>>>> >>>>> ** ** >>>>> >>>>> The problem is, when I run a simple count query (i.e. *select |