Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive query taking too much time


Copy link to this message
-
Re: Hive query taking too much time
Aniket Mokashi 2011-12-08, 08:03
You can also take a look at--
https://issues.apache.org/jira/browse/HIVE-74

On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav <
[EMAIL PROTECTED]> wrote:

> You are right Wojciech Langiewicz, we did the same thing and posted my
> result yesterday. Now we are planning to do this using a shell script
> because of dynamicity of our environment where file keep on coming. We
> will schedule the shell script using cron job.
>
> A query on this, we are planning to merge files based on either of the
> following approach
> 1. Based on file count: If file count goes to X number of files, then
> merge and insert in HDFS.
> 2. Based on merged file size: If merged file size crosses beyond X
> number of bytes, then insert into HDFS.
>
> I think option 2 is better because in that way we can say that all
> merged files will be almost of same bytes. What do you suggest?
>
> Kind Regards,
> Keshav C Savant
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, December 07, 2011 8:15 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hive query taking too much time
>
> Hi,
> In this case it's much easier and faster to merge all files using this
> command:
>
> cat *.csv > output.csv
> hive -e "load data local inpath 'output.csv' into table $table"
>
> On 07.12.2011 07:00, Vikas Srivastava wrote:
> > hey if u having the same col of  all the files then you can easily
> > merge by shell script
> >
> > list=`*.csv`
> > $table=yourtable
> > for file in $list
> > do
> > cat $file>>new_file.csv
> > done
> > hive -e "load data local inpath '$file' into table $table"
> >
> > it will merge all the files in single file then you can upload it in
> > the same query
> >
> > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
> > <[EMAIL PROTECTED]>wrote:
> >
> >> Hi Paul,
> >> I am having the same problem. Do you know any efficient way of
> >> merging the files?
> >>
> >> -Mohit
> >>
> >>
> >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<[EMAIL PROTECTED]>
> wrote:
> >>
> >>> How much time is it spending in the map/reduce phases, respectively?
>
> >>> The large number of files could be creating a lot of mappers which
> >>> create a lot of overhead. What happens if you merge the 2624 files
> >>> into a smaller number like 24 or 48. That should speed up the mapper
>
> >>> phase significantly.****
> >>>
> >>> ** **
> >>>
> >>> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]]
> >>> *Sent:* Tuesday, December 06, 2011 6:01 AM
> >>> *To:* [EMAIL PROTECTED]
> >>> *Subject:* Hive query taking too much time****
> >>>
> >>> ** **
> >>>
> >>> Hi All,****
> >>>
> >>> ** **
> >>>
> >>> My setup is ****
> >>>
> >>> hadoop-0.20.203.0****
> >>>
> >>> hive-0.7.1****
> >>>
> >>> ** **
> >>>
> >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it
> >>> is also acting as secondary name node). On namenode I have setup
> >>> hive with HiveDerbyServerMode to support multiple hive server
> >>> connection.****
> >>>
> >>> ** **
> >>>
> >>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
> >>> query statements, total number of files is 2624 an their combined
> >>> size is only
> >>> 713 MB, which is very less from Hadoop perspective that can handle
> >>> TBs of data very easily.****
> >>>
> >>> ** **
> >>>
> >>> The problem is, when I run a simple count query (i.e. *select
> >>> count(*) from a_table*), it takes too much time in executing the
> >>> query.****
> >>>
> >>> ** **
> >>>
> >>> For instance it takes almost 17 minutes to execute the said query if
>
> >>> the table has 950,000 rows, I understand that time is too much for
> >>> executing a query with only such small data. ****
> >>>
> >>> This is only a dev environment and in production environment the
> >>> number of files and their combined size will move into millions and
> >>> GBs
> >>> respectively.****
> >>>
> >>> ** **
> >>>
> >>> On analyzing the logs on all the datanodes and namenode/secondary
> >>> namenode I do not find any error in them.****

"...:::Aniket:::... Quetzalco@tl"