hive.max.split.size can be tuned to decrease the number of mappers. Reference: http://www.slideshare.net/ye.mikez/hive-tuning (slide number 38)
Also using CombineHiveInputFormat (default input format) will combine multiple small files to form a large split and hence less number of mappers.
On Feb 3, 2014, at 10:20 AM, KingDavies <[EMAIL PROTECTED]> wrote:
> Our platform has a 40GB raw data file that was compressed lzo (12GB compressed) to reduce network IO between S3.
> Without indexing the file is unsplittable resulting in 1 map task and poor cluster utilisation.
> After indexing the file to be splitable the hive query produces 120 map tasks.
> However, with the 120 tasks distributed over a small 4 node cluster it takes longer to process the data than when it wasn’t splitable and processing done by a single node (1h20mins vs 17mins). This was with a fairly simple select from where query, without distinct, group by or order.
> I’d like to utilise all nodes in the cluster, to reduce query time. Whats the best way to have the data crunched in parallel but with fewer mappers?
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.