Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive query taking a lot of time just to launch map-reduce jobs


Copy link to this message
-
Re: Hive query taking a lot of time just to launch map-reduce jobs
On 25 Nov 2013, at 11:50, Sreenath wrote:

> hi all,
>
> We are using hive for Ad-hoc querying and have a hive table which is
> partitioned on two fields (date,id).Now for each date there are around
> 1400
> ids so on a single day around that many partitions are added.The
> actual
> data is residing in s3. now the issue we are facing is suppose we do a
> select count(*) for a month from the table then it takes quite a long
> amount of time(approx : 1hrs 52 min) just to launch the map reduce
> job.
> when i ran the query in hive verbose mode i can see that its spending
> this
> time actually deciding how many number of mappers to spawn(calculating
> splits). Is there any means by which i can reduce this lag time for
> the
> launch of map-reduce job.
>
> this is one of the log messages that is being logged during this lag
> time
>
> 13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to
> process
> : 1
> 13/11/19 07:11:06 WARN httpclient.RestS3Service: Response
> '/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404,
> expected
> 200
> Anyone has a quick fix for this ?

So we're talking about 30 days x 1400 ids x number of files per ID
(usually more than 1)

this is at least 42,000 file paths, and (regardless of the error you
posted) hive won't perform well on this many files when making the
query.

It is IMHO a typical case of over-partitioning. I'd use RCFile and keep
IDs unpartitioned.

What volume of data are we talking about here? What's the volume of the
biggest ID for a day, and the average?

David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB