Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive query taking too much time


Copy link to this message
-
Re: Hive query taking too much time
You can also take a look at--
https://issues.apache.org/jira/browse/HIVE-74

On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav <
[EMAIL PROTECTED]> wrote:

> You are right Wojciech Langiewicz, we did the same thing and posted my
> result yesterday. Now we are planning to do this using a shell script
> because of dynamicity of our environment where file keep on coming. We
> will schedule the shell script using cron job.
>
> A query on this, we are planning to merge files based on either of the
> following approach
> 1. Based on file count: If file count goes to X number of files, then
> merge and insert in HDFS.
> 2. Based on merged file size: If merged file size crosses beyond X
> number of bytes, then insert into HDFS.
>
> I think option 2 is better because in that way we can say that all
> merged files will be almost of same bytes. What do you suggest?
>
> Kind Regards,
> Keshav C Savant
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, December 07, 2011 8:15 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hive query taking too much time
>
> Hi,
> In this case it's much easier and faster to merge all files using this
> command:
>
> cat *.csv > output.csv
> hive -e "load data local inpath 'output.csv' into table $table"
>
> On 07.12.2011 07:00, Vikas Srivastava wrote:
> > hey if u having the same col of  all the files then you can easily
> > merge by shell script
> >
> > list=`*.csv`
> > $table=yourtable
> > for file in $list
> > do
> > cat $file>>new_file.csv
> > done
> > hive -e "load data local inpath '$file' into table $table"
> >
> > it will merge all the files in single file then you can upload it in
> > the same query
> >
> > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
> > <[EMAIL PROTECTED]>wrote:
> >
> >> Hi Paul,
> >> I am having the same problem. Do you know any efficient way of
> >> merging the files?
> >>
> >> -Mohit
> >>
> >>
> >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<[EMAIL PROTECTED]>
> wrote:
> >>
> >>> How much time is it spending in the map/reduce phases, respectively?
>
> >>> The large number of files could be creating a lot of mappers which
> >>> create a lot of overhead. What happens if you merge the 2624 files
> >>> into a smaller number like 24 or 48. That should speed up the mapper
>
> >>> phase significantly.****
> >>>
> >>> ** **
> >>>
> >>> *From:* Savant, Keshav [mailto:[EMAIL PROTECTED]]
> >>> *Sent:* Tuesday, December 06, 2011 6:01 AM
> >>> *To:* [EMAIL PROTECTED]
> >>> *Subject:* Hive query taking too much time****
> >>>
> >>> ** **
> >>>
> >>> Hi All,****
> >>>
> >>> ** **
> >>>
> >>> My setup is ****
> >>>
> >>> hadoop-0.20.203.0****
> >>>
> >>> hive-0.7.1****
> >>>
> >>> ** **
> >>>
> >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it
> >>> is also acting as secondary name node). On namenode I have setup
> >>> hive with HiveDerbyServerMode to support multiple hive server
> >>> connection.****
> >>>
> >>> ** **
> >>>
> >>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
> >>> query statements, total number of files is 2624 an their combined
> >>> size is only
> >>> 713 MB, which is very less from Hadoop perspective that can handle
> >>> TBs of data very easily.****
> >>>
> >>> ** **
> >>>
> >>> The problem is, when I run a simple count query (i.e. *select
> >>> count(*) from a_table*), it takes too much time in executing the
> >>> query.****
> >>>
> >>> ** **
> >>>
> >>> For instance it takes almost 17 minutes to execute the said query if
>
> >>> the table has 950,000 rows, I understand that time is too much for
> >>> executing a query with only such small data. ****
> >>>
> >>> This is only a dev environment and in production environment the
> >>> number of files and their combined size will move into millions and
> >>> GBs
> >>> respectively.****
> >>>
> >>> ** **
> >>>
> >>> On analyzing the logs on all the datanodes and namenode/secondary
> >>> namenode I do not find any error in them.****

"...:::Aniket:::... Quetzalco@tl"
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB