Hive, mail # user - Re: What are all the factors that go into the number of mappers - ORC - 2014-02-03, 06:23
 Search Hadoop and all its subprojects:

Switch to Threaded View
Copy link to this message
-
Re: What are all the factors that go into the number of mappers - ORC
Hi John

Number of mappers is equal to the number of splits generated. Following are the factors that go into split generation
1) HDFS block size
2) Max split size

a split is cut when
1) the cumulative size of all adjacent stripes are greater than HDFS block size
2) the cumulative size of all adjacent stripes are greater than max split size

HDFS block size for ORC files will be min(1.5GB, 2*stripe_size) in the current version of hive (and probably hive 0.12 too). In older versions, HDFS block size = min(2GB, 2*stripe_size).

The other important thing to note is ORC split is generated only when HiveInputFormat is used. By default hive uses CombineHiveInputFormat which uses a different strategy to generate splits. In CombineHiveInputFormat, many small files are combined together to form a large logical split.

In any case for the size you had mentioned (2000 bytes) there should be only one mapper. Can you provide the value for following configs so that we can understand it better?

1) hive.input.format
2) hive.min.split.size
3) hive.max.split.size
4) total size on disk for the table

Thanks
Prasanth Jayachandran

On Feb 2, 2014, at 5:25 PM, John Omernik <[EMAIL PROTECTED]> wrote:

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB